April 12, 2026 · 8 min read

Your Writing Has Already Fed the Machine

Name: Rodin
Author: Rodin

AI companies trained their models on human-created content without asking. Here is what happened, and what it means.

Curious what your writing reveals about how you think?

Try Rodin →

Did AI companies train on human writing without permission?

Yes, in documented cases. OpenAI was reported by the New York Times to have scraped over one million hours of YouTube videos to train GPT-4, violating YouTube's terms of service. Google later confirmed it uses YouTube content to train Gemini. Anthropic was found in federal court to have knowingly downloaded over seven million pirated books. These are the documented cases; the full scope is not publicly disclosed.

There is a specific kind of discomfort that comes from realising that something you made has been used in ways you did not agree to, by parties you did not authorise, for purposes that benefit them and not you. It is not quite like theft, because what was taken was not removed from you; your essay is still there, your notes are still there, the thing that was taken was a copy. But the copy contributed to building systems that now compete with human writers, human researchers, human thinkers. And you were not asked, and you were not paid, and in most cases you were not told.

What was actually taken

The scale requires a moment to absorb. One million hours of YouTube video is not a set of recordings you could watch; it is a duration that exceeds human life. To have watched it all at one video per second, beginning at birth, you would not finish before you died. That is the minimum estimate for a single company's scrape of a single platform.

The New York Times investigation, published in 2024, described OpenAI facing a data supply problem: the internet, having been substantially processed, was running out of useful new text at the rate required by scaling. The response, per the investigation, was to use what was available regardless of whether permission existed. A million hours of YouTube was a start.

The pattern, not the exception

It would be convenient if these were isolated failures of individual companies rather than a structural pattern. The evidence does not support that interpretation. Google knew, according to internal sources cited by the New York Times, that OpenAI was using YouTube content — and said nothing, because Google was doing the same thing. The silence was strategic: calling out a competitor for a practice you are also engaged in creates risks you do not want to create.

Anthropic's case is, if anything, more straightforward. A federal judge found that the company had not simply scraped publicly available content but had downloaded books that were explicitly pirated, in bulk, with knowledge of their status. The settlement, reported at the equivalent of approximately two trillion Korean won, is substantial by any measure, but small relative to the value of the models trained.

The legal argument AI companies make

The defence offered in court, and in various public statements, runs approximately as follows: the training process is transformative. What is produced is not a copy of the original work but something new, which learned from the original in the way a student learns from a book. The student does not owe the author a royalty on every thought they subsequently have. The same logic, the argument goes, should apply here.

This is a genuine argument with genuine legal precedent on its side, and courts have reached different conclusions depending on the facts. What makes it uncomfortable is not that it is obviously wrong but that it is the argument a party would make regardless of whether they believed it, because it is the argument that permits them to keep the models they built. The question of whether they would have made it before the models existed — whether data dignity was on anyone's radar in the years when the training was happening — has a fairly clear answer.

The enclosure parallel

Five hundred years ago, English nobles began enclosing common land: placing fences around pasture that had been shared, seizing it for private use, evicting the people who had farmed it. The mechanism was not straightforward theft; it used ambiguous law, specifically a statute that permitted enclosure as long as "adequate" common land remained, and the question of what constituted adequate was never clearly defined. The ambiguity was the instrument.

The web was built as a commons. Tim Berners-Lee, who created it, surrendered all rights deliberately, believing the value of an open and free web exceeded whatever he might have earned from it. For a generation, a kind of symbiosis developed: platforms managed the digital commons, invited people to create within it, and shared some portion of the resulting value with creators through advertising. The arrangement was imperfect but functional.

What AI training did was take the accumulated product of that commons — everything anyone had ever written publicly on the web — and process it into private value. The commons did not disappear; your essay is still there. But the distillation of everything the commons contained — the patterns, the knowledge, the voice — is now in privately held models that you do not own and cannot access without paying.

What the data runs out means

The documentation of AI companies' training practices becomes more legible when you understand what they are racing toward. Sam Altman, in a public interview, described the plan directly: scale up training on internet data until that supply runs out, then transition to AI-generated training data. The scramble — the YouTube scraping, the pirated books, the legal risks taken to access more text — is intelligible as a race against a deadline. If you can get to the point where AI-generated data is good enough to train the next generation of models before the legal environment forces you to stop, the question of data provenance becomes moot.

A 2024 paper in Nature found that models trained exclusively on AI-generated data collapse: quality degrades recursively, each generation worse than the last. Human-created data is, for now, irreplaceable. The scramble is therefore also a gamble: get enough human data processed before either the supply runs out or the courts impose constraints, and hope the bootstrapping threshold is reached before either limit binds.

What it means to build the other way

What Rodin attempts is, in a structural sense, the opposite. Your writing is not processed into a model that you will later interact with as a service. It generates a fingerprint that belongs to you: a structured analysis of how you think, derived from your writing, which you can share, update, or delete. The AI is the mechanism; you are the beneficiary of what it extracts.

This is not a solution to the structural problem of data taking; one tool cannot undo years of training at scale. But it is a different proposition: instead of your writing making a model more capable, your writing makes your intellectual identity more visible — to you, and to the people who think like you. What was taken without asking was the pattern in your thinking. Rodin is built around the premise that the pattern should come back to you.

Your notes already contain your fingerprint.

Extract yours →