Anthropic trained Claude on millions of scanned books plus 7M pirated ones — yet shadow libraries carry few Taiwan/HK titles, so law-abiding, niche cultures risk vanishing from AI’s worldview. Open civic media is the lifeline; pay to keep it alive.
In November 2022, OpenAI launched ChatGPT and raised the curtain on generative AI. Its user numbers have led the field by a wide margin ever since, and ChatGPT became a byword for AI itself. Three and a half years on, ChatGPT still commands the largest market share, but its revenue has been caught up by the far less famous Claude — and OpenAI may even be beaten to an IPO by Claude’s developer, Anthropic.
How Claude Was Forged
Anthropic has always occupied the moral high ground, emphasising the safe deployment of AI and registering as a Public Benefit Corporation (PBC) rather than an ordinary for-profit C Corporation. The company, ever insistent on the proper use of AI, frequently accuses Chinese companies of distilling Claude — acquiring American research and development on the cheap. But if Chinese models were distilled from Claude, how did Anthropic train Claude in the first place? A copyright lawsuit offers some clues.
In August 2024, three American authors sued Anthropic for using vast quantities of both legitimately purchased and pirated books to train its large language model, Claude. Court proceedings revealed that in February 2024, Anthropic poached Tom Turvey, the partnerships lead on Google’s book-scanning project, and bought up enormous numbers of mostly second-hand print books, scanned them, extracted the text with optical character recognition (OCR), and destroyed the originals. At the same time, Anthropic downloaded more than seven million pirated books from shadow libraries such as Library Genesis.
In June 2025, a federal judge ruled that scanning and destroying legitimately purchased print books amounted to format shifting, and that digesting huge volumes of text to learn the structure of language — conceptual machine learning — was highly transformative. Both, the court held, were fair use permitted under the law.
As for the pirated books: even if the training itself was lawful, acquiring and retaining illegal data sources was not. In late 2025 the two sides reached a settlement, with the authors receiving more than US$1.5 billion. The case was closed — and several million legitimate print books, plus seven-million-odd pirated e-books, became the nourishment of today’s leading large language models.
Ordinary people like me have no real sense of what US$1.5 billion means. For reference: in September 2025, shortly before the settlement, Anthropic had just raised US$13 billion. And according to estimates reported by the Wall Street Journal, Anthropic’s revenue in the first quarter of 2026 was US$4.8 billion, with the second quarter expected to reach US$10.9 billion — more than doubling.
The More Law-Abiding, the More Absent
On the subject of pirated books, let me share a personal experience. A few years ago, to free up space and simplify my life, I decided to part with a batch of print books, while trying to keep digital copies of each. Working through my own book list, I searched online for pirated versions one by one. The sources were much the same as those Anthropic used — shadow libraries such as Library Genesis. The result: most English-language and mainland Chinese books could be found, but pirated copies of Hong Kong and Taiwan books were few and far between.
From this little story, some will read pride: copyright awareness in Taiwan and Hong Kong far exceeds that on the mainland. Others will read “underlying logic”: the Taiwan and Hong Kong markets are simply too small for piracy to reach commercial scale. Either way, when legitimate copies dominate and piracy never takes hold, rights holders win.
This essay’s concern, however, lies on another level: the fact that shadow libraries like Library Genesis carry hardly any Hong Kong or Taiwan editions means that, among the material Claude learned from, Chinese-language books come overwhelmingly from the mainland — missing the local perspectives of Hong Kong and Taiwan. In a market this fiercely competitive and this transparent, the other leading large language models are very likely in a similar position.
The more law-abiding, the more refined, the more niche a cultural sphere is, the more easily it ends up structurally absent from AI’s collective unconscious — a deeply ironic yet entirely real cultural phenomenon. To say that Taiwan and Hong Kong may disappear in the age of AI is not scaremongering but genuine anxiety: fewer and fewer people pick up books, more and more rely on AI for their information, and yet Taiwan and Hong Kong editions are all but absent from AI’s training material. How can that not be cause for worry?
Free as in Freedom
Of course, the training material for large language models goes beyond books: public web pages, forums, code and more. PTT, Dcard, Bahamut, LIHKG, HKGolden, open Substack newsletters and blogs — all of it is fodder for AI. The absence of books — the most information-dense of long-form texts, cultural expression distilled over years — may well affect the deeper frameworks of narrative and values. Yet mainstream AI can still read and write traditional Chinese, including Cantonese, idioms, even memes, and can keep its finger on the pulse of the times.
In other words, Taiwan and Hong Kong still exist, for now, in AI’s worldview — and the key is the open internet. Civic media outlets that bear heavy operating costs yet insist on keeping all their content open — Hong Kong’s The Collective (集誌社), InMedia (獨立媒體), Court News (庭刊), The Witness (法庭線), HK Feature (誌) and Hong Kong Free Press, and Taiwan’s The Reporter (報導者) — not only provide open content directly to readers, but also serve as nourishment for machine learning, supplying the major AIs with Hong Kong and Taiwan’s local perspectives, and thereby serving the public indirectly.
The point of open information is freedom, not free of charge. Under the shadow of authoritarianism, every one of these civic media outlets struggles to keep going; their daily reporting on society can never be taken for granted. To ignore the importance of paid support is a society’s slow suicide. One day, not only will the civic media fold one by one — Taiwan and Hong Kong themselves will disappear in the age of AI.
Support The Collective’s 1:1 Matching Campaign
What matters is action. Before the end of this month, buy the e-book The Collective: Our Records on the Ground (《集誌——我們在地記錄》, HKD 140) at 3ook.com, and add any extra amount of support you wish. I will personally match, 1:1, both the book price and any additional amount, paid directly to The Collective, up to a cap of HKD 120,000.
Example: if 100 people buy The Collective: Our Records on the Ground at 3ook.com before the end of the month, each adding HKD 1,000 in support → I will match 140 × 100 + 1,000 × 100 = HKD 114,000.
p.s. Not wanting to steal the spotlight from the civic media in the main text, let me say this quietly down here: DHK Newsletter aims to provide web3 civic education and to supplement discourse beyond the official narrative. It, too, keeps all its articles fully open — and needs your support.


Leave a Reply