Digits and Androminas

The FBI is hunting for the archived website that's making the media uncomfortable.

US authorities want to identify the founder of Archive Today, a service that allows users to bypass media paywalls.

Someone looking at the Archive Today website
15/11/2025
4 min

The FBI has launched an operation to uncover the anonymous founder of Archive Today, a service that allows users to bypass media paywalls. This is a new episode in the ongoing conflict between content owners and the platforms that ensure its preservation, now complicated by the training of artificial intelligence models.

The federal subpoena demands that the domain registrar Tucows provide comprehensive information about the domain owner: customer name, addresses, call and message logs, payment information, IP addresses, and "any other identifying information." This is all part of a "federal criminal investigation" that does not specify the crime, although copyright infringement is the most likely charge after the News/Media Alliance successfully shut down the similar service 12ft.io last July. Tucows has until November 29th to comply with the authorities' demands, but the anonymous operator—rumors point to "Denis Petrov" in Prague and "Masha Rabinovich" in Berlin—continues to operate normally through its mirror domains (archive.is, archive.ph, archive.vn) and even an encrypted Tor service.

While Archive Today preserves hundreds of millions of web pages on a token budget, large media conglomerates are mobilizing the authorities to persecute an archive that, among other things, documents how they themselves modify or delete news without leaving a trace. But business is business, and paywalls are sacrosanct in times of dwindling readership and clicks stolen by AI-generated summaries.

The big datasets that train commercial AI

While Archive Today makes the media uncomfortable for direct economic reasons, Common Crawl does so in a more subtle and massive way. This Californian non-profit service, founded by Gil Elbaz, has been scouring the web since 2007 with monthly scans of 2 to 5 billion pages, lasting a couple of weeks and occupying between 250 and 460 terabytes each. In this way, it has generated a 9.5 petabyte archive that it makes freely available to the public. But this seemingly philanthropic generosity has had a lucrative side effect: it has become the raw material for training most of the large-scale AI language models (LLMs).

According to The AtlanticOpenAI and Anthropic each gave Common Crawl $250,000 in 2023, the year their GPT-4 and Claude models were commercially scaling. It doesn't take a genius to figure out why: between 60% and 82% of the content used to train GPT-3 came from Common Crawl. Models like Meta's Llama, Google's T5, Bloom, and dozens of others draw from the same well.

From the perspective of content owners, the crux of the problem lies in how Common Crawl works: each crawl captures the complete HTML code of pages, including the text that paywalls then hide using JavaScript. This opens an unintentional "backdoor" to access restricted content from hundreds of publications, from the New York Times even the ARA. The organization says it respects the nofollow blogging and robots.txt files that website owners can include to avoid being crawled, but publishers have unsuccessfully demanded the removal of already archived content. Common Crawl responds that the technical format complicates such removal, but it seems like an excuse. Since mid-2023, the service has boasted on its homepage about its role in training AI models, claiming that 82% of the tokens GPT-3 data units come from your archive.

The universe of web archives

Archive Today, which relies on crowdfunding—$800 a week, about €36,000 a year—contains 700 terabytes and some 500 million pages stored since 2012. Its architecture captures each page in full: it creates a functional HTML version with live links and a static screenshot, with a maximum static screenshot size.

But Archive Today is tiny compared to the 99 petabytes of the Internet Archive—with 745 nodes, 28,000 disks, and four data centers. With its Wayback MachineThe Web Archive, which has preserved over a trillion web pages since 1995, is the benchmark for web archives: institutional, transparent, and with the status of a legal repository.

Effects on Catalan in AI

A collateral aspect of these archives with large volumes of text is linguistic. The ecosystem is dominated by English, which represents almost half of Common Crawl; German, Russian, Japanese, Chinese, French, and Spanish each have less than 6%, and Catalan is practically invisible. Of the 90 languages with which OpenAI trained GPT-3, 92.7% was English content and only 0.017% was Catalan. Hence the importance of Project Aina, led by the Generalitat of Catalonia and the Barcelona Supercomputing Center: in its Salamandra models, English represents less than 40% of the training content, and Catalan's share is 100 times greater than in GPT-3, reaching almost 2%.

Between legal persecution and open preservation

The campaign to criminalize digital archives has a new and noteworthy episode: Google has removed 749 million links from its search results on the Anna's Archive website—successor to Z-Library after the US government seized its domains in 2022. The website offered 51 million books and nearly 100 million academic articles. Interestingly, this same Google that is massively excluding pirated books from search results belongs to Alphabet, the parent company of Google DeepMind, which has trained its Gemini models with Common Crawl data. Anna's Archive has openly admitted to providing 30 LLM developers with open access to train them with its "illegal book archive," but unlike OpenAI or Meta—which was accused of pirating 81.7 TB of books to train its Llama model—it does not receive strategic donations. The site remains operational with three domains hosting no pirated content, only links: a legal gray area defended with the argument that "preserving and hosting these files is morally right."

The underlying debate is whether preserving the open web is a public good or a crime when it inconveniences commercial interests. The media argues for loss of subscriptions and advertising revenue; archivists defend historical preservation, fact-checking, and access to information. But when the FBI is busy pursuing anonymous archive operators while OpenAI and Anthropic—valued at tens of billions—train their models with industrially exploited content without compensating the creators, the disparity in treatment raises questions. Perhaps the key isn't who archives, but who has enough money to do so through strategic donations. Tech marketing calls it "democratizing access to knowledge." Media lawyers call it "theft." Perhaps the answer is archived on some cloud server, waiting for someone to filter it with the appropriate instructions.

stats