OpenAI and training data company Handshake AI are asking third-party contractors to upload real work they did in past and current jobs, according to a report in Wired, as part of a broader AI industry strategy to generate high-quality training data that could eventually allow models to automate white-collar work. A company presentation reportedly asks contractors to describe tasks performed at other jobs and upload examples of “real, on-the-job work” including “concrete output” files like Word documents, PDFs, PowerPoints, Excel spreadsheets, images, and code repositories. OpenAI instructs contractors to delete proprietary and personally identifiable information before uploading and points them to a ChatGPT “Superstar Scrubbing” tool for this purpose, but intellectual property lawyer Evan Brown warns the approach puts OpenAI “at great risk” by requiring “a lot of trust in its contractors to decide what is and isn’t confidential.”
The fundamental problem is that contractors being asked to determine what constitutes proprietary or confidential information in their past work likely lack legal expertise to make those determinations accurately. What seems like non-sensitive material to a contractor might contain trade secrets, competitive intelligence, client information, or intellectual property that former employers consider protected. A marketing presentation might reveal strategic positioning that competitors could exploit. A code repository might contain proprietary algorithms or architectural decisions that constitute valuable IP. An Excel spreadsheet might include formulas, methodologies, or data structures that represent significant business investment. Asking contractors to judge confidentiality creates massive liability exposure for both the contractors, who might violate non-disclosure agreements or employment contracts, and OpenAI, which could face lawsuits from companies whose proprietary information ends up in training data.
The “Superstar Scrubbing” tool that OpenAI provides for removing sensitive information raises questions about its effectiveness and the standards it applies. If the tool uses AI to identify and remove proprietary information, it might miss context-specific confidentiality that requires industry knowledge or legal judgment. If it relies on contractors manually identifying sensitive content, it provides no more protection than asking contractors to delete such information themselves. Either way, the burden of determining confidentiality falls on people least equipped to make those judgments, contractors performing data labeling work rather than lawyers with expertise in intellectual property, employment law, and confidentiality agreements.
For contractors being asked to provide this material, the request creates significant personal risk. Most employment contracts and non-disclosure agreements prohibit sharing work product with third parties, even after employment ends. Violating those agreements can result in lawsuits seeking damages, injunctions preventing future work in the industry, and reputational harm that affects career prospects. A contractor who uploads work files to OpenAI believing they’ve adequately scrubbed sensitive information might discover years later that a former employer considers the uploads to violate confidentiality agreements, potentially resulting in litigation the contractor can’t afford to defend against and might lose despite good-faith efforts to comply with OpenAI’s scrubbing instructions.
The broader strategy Wired describes, AI companies hiring contractors to generate high-quality training data to eventually automate white-collar work, reveals the ultimate goal: using examples of actual professional work to train models that can perform those same tasks. If OpenAI can collect thousands of real marketing presentations, legal briefs, financial analyses, code repositories, and other professional work products, their models can learn patterns, structures, and techniques that practitioners use in actual business contexts rather than synthetic examples created for training purposes. That real-world data presumably produces models better at generating work that meets professional standards, but it requires access to vast quantities of proprietary professional work that companies and individuals created through significant investment.
The tension between OpenAI’s need for high-quality professional training data and the intellectual property rights of companies that created or paid for that work reflects fundamental questions about AI training that remain legally unresolved. Can contractors provide their past work to AI companies for training purposes? Does that violate employment agreements? Does it constitute theft of trade secrets if the work contains proprietary information? Courts haven’t definitively answered these questions, creating gray area where AI companies push boundaries by asking contractors for material that might or might not be legally available for training use.
For Seattle’s tech industry, where many workers have employment contracts with IP assignment clauses stating that work created during employment belongs to the company, not the employee, this OpenAI practice creates particular concern. A software engineer who worked at Amazon, Microsoft, or local startups and uploads code repositories to OpenAI might be violating IP assignment agreements even if the code seems generic or non-confidential. A product manager who uploads strategy documents might be sharing competitive intelligence that former employers consider protected. A data scientist who uploads analysis spreadsheets might be revealing methodologies and approaches that constitute valuable IP. Whether those uploads violate contracts depends on specific agreement language and the nature of material shared, but the risk is substantial enough that employment lawyers would likely advise clients to refuse such requests rather than attempt to determine what’s safe to share.
The instruction to delete “proprietary and personally identifiable information” treats those categories as if they’re clearly defined and easily identified, but in practice determining what constitutes proprietary information requires legal analysis of specific contractual relationships and business contexts. Information that seems generic might be proprietary if it reveals processes, methodologies, or insights that give a company competitive advantage. A template might be proprietary if it embodies strategic approaches developed through significant investment. Even publicly available information assembled in specific ways might constitute protectable trade secrets if the compilation itself provides value. Contractors tasked with making these determinations will inevitably make errors, either being overly cautious and providing less useful training data, or being insufficiently cautious and sharing material that violates confidentiality.
The approach also raises questions about informed consent from the clients and employers whose work ultimately ends up in training data. When a contractor uploads a presentation created for a client, did that client consent to their proprietary strategic thinking being used to train AI models that might eventually serve their competitors? When a contractor uploads code written for an employer, did that employer agree to their IP being incorporated into commercial AI products? The contractual relationship between OpenAI and contractors doesn’t include the third parties whose work is being collected, creating potential liability for using material without authorization from actual rights holders.
Intellectual property lawyer Evan Brown’s warning that this approach puts OpenAI “at great risk” reflects the scale of potential liability. If dozens or hundreds of contractors upload thousands of files containing proprietary information from their past employers, and those employers discover their IP in OpenAI’s training data or in outputs from OpenAI’s models, the resulting litigation could involve claims for misappropriation of trade secrets, breach of contract, copyright infringement, and other IP violations. Even if OpenAI argues they relied on contractors’ representations that material was properly scrubbed, that might not provide adequate defense if courts determine OpenAI should have implemented more robust verification processes before accepting potentially confidential material.
The requirement for “concrete output” rather than summaries reveals OpenAI’s need for actual work products that models can learn from directly. Summaries or descriptions of work provide less training value than the work itself because models learn by analyzing patterns, structures, and techniques visible in real documents. A summary of a marketing strategy doesn’t show how that strategy was presented, formatted, or argued in the actual deliverable to the client. The real PowerPoint deck shows all those elements, providing rich training data about professional communication standards, persuasive techniques, and industry conventions. But requiring actual files rather than summaries exponentially increases IP risk because files are more likely to contain proprietary information that summaries would omit.
For OpenAI’s competitors developing their own AI models, watching this approach provides strategic intelligence about how OpenAI is sourcing training data. If collecting real professional work through contractors proves effective for improving model capabilities, competitors might adopt similar approaches despite legal risks. If it triggers significant litigation or regulatory backlash, competitors might avoid similar practices. The fact that this information leaked to Wired suggests either whistleblowing by concerned contractors or sources worried about the ethical and legal implications of the practice, indicating internal discomfort with the approach even as company leadership pursues it.
The decline to comment from OpenAI’s spokesperson is notable because the company typically defends its training data practices publicly when questioned. Silence suggests either that the practice is still being refined and leadership doesn’t want to commit to defending it publicly, or that legal concerns about discussing it outweigh benefits of transparency. Either way, the lack of public defense from OpenAI indicates this isn’t a practice the company is eager to spotlight, even as it apparently pursues it behind the scenes through contractor relationships.
For contractors considering whether to comply with these requests, the calculation involves weighing compensation from OpenAI against personal legal risk and ethical concerns about sharing work that might violate confidentiality agreements or harm former employers and clients. Some contractors might refuse, determining the risk isn’t worth whatever OpenAI pays for training data contributions. Others might comply, either believing they can adequately identify and remove sensitive information, or simply prioritizing immediate income over potential future liability. That puts contractors in impossible position of making legal determinations they’re not qualified to make, facing consequences if they guess wrong, with minimal protection if disputes arise.
The ChatGPT “Superstar Scrubbing” tool name itself is interesting, suggesting either internal tool naming conventions at OpenAI or a branded feature marketed to contractors as solving confidentiality concerns. Whether the tool actually provides reliable scrubbing of proprietary information, or whether it’s primarily designed to create appearance of due diligence while pushing liability onto contractors who must ultimately judge what to scrub, affects whether the practice provides real protection or just plausible deniability for OpenAI when confidential information inevitably ends up in training data.
For companies whose employees become contractors for OpenAI or similar firms, this practice creates new risks that employment agreements and IP protections might not adequately address. Traditional non-disclosure and IP assignment agreements assume employees might share confidential information through intentional disclosure to competitors or through careless security practices. They might not contemplate scenarios where AI training companies systematically solicit professional work products from contractors who previously worked at other companies, creating industrial-scale potential for confidential information to flow into training data that gets incorporated into commercial products serving entire industries including direct competitors.
The ultimate question is whether the AI industry’s need for high-quality training data justifies approaches that require contractors to make complex legal judgments about confidentiality and intellectual property rights, exposing them to personal liability while potentially violating rights of third parties whose work is being collected. OpenAI’s apparent willingness to pursue this approach despite obvious legal risks suggests the competitive pressure to improve model capabilities through better training data outweighs concerns about potential litigation or regulatory backlash. Whether that calculation proves correct depends on whether the practice generates lawsuits that create precedents restricting how AI companies can source training data, and whether regulators eventually impose restrictions that retroactively make current practices illegal or actionable.
For now, contractors face requests to upload real work from past jobs, with minimal guidance on determining confidentiality and tools of uncertain effectiveness for scrubbing sensitive information, being asked to make legal judgments that could expose them to liability if they judge incorrectly. That arrangement serves OpenAI’s need for training data while pushing risk onto the most vulnerable participants in the AI development process, contractors with limited resources to defend against potential litigation from former employers whose proprietary information ends up training models that might eventually compete with their businesses.


