Meta has made available, for perusal, the pilfered books it used to school its AI.

Unveiled: Use of Book Piracy Data for Meta's AI Training

In the current digital age, two forces are jeopardizing our access to literature - a U.S. government under the thumb of tech overlords and some of these tech magnates themselves. On one hand, we have AI conGlomerates like Meta, sucking up countless books from piracy sites to avoid serving up AI-generated nonsense. On the other hand, the Trump administration toes the line of crippling one of the major funding sources for public libraries. Cobblers, meet your shoes.

The tech geniuses at The Atlantic have spearheaded an investigation into these publicly available data troves, zeroing in on Library Genesis (LibGen), a stockpile of pirated media amassing millions of books, academic papers, and articles. The Atlantic recently released a handy tool for sifting through this ocean of dodgy content, allowing you to search for your favorite authors and confirm if they've been used to train models from OpenAI, Mistral, and Meta.

Library Genesis, a shadow library online, boasts nearly 7.5 million books and 81 million academic papers. Despite the abundance of copyrighted material, it offers invaluable support to various communities. For instance, scientists have utilized it to relentlessly nose their way into academic works while bypassing exorbitant publisher fees. Even groups like the Electronic Frontier Foundation have embraced shadow libraries like Sci-Hub for their role in advancing science.

We reached out to Meta for comment, but they didn't grace us with their presence any quicker than a cat chases lasers. OpenAI, in a conversation with Gizmodo, maintained that the AI models driving their ChatGPT and API today were not engineered using these datasets. However, the cat's clearly out of the bag, and the pirate ship's crew is all too clear.

Last year, a former OpenAI employee had the gall to squawk that company was flagrantly breaching copyright law. OpenAI has since defended itself in court, arguing that using copyrighted works for AI training was a fair use. Tech sites like The Verge have reported on Meta's aspirations to wield the LibGen treasure map against OpenAI and Mistral. Court documents from a class action suit helmed by comedian Sarah Silverman have hinted at Meta senior researcher Melanie Kambadur voicing her need for books swiftly, stating that books were essential for AI training since web data paled in comparison. Further documents reveal company staff had contemplated licensing books for their AI research, but opted for a pirated archive instead. One director of engineering suggested licensing a single book would veto their fair use claim.

If Elon Musk sincerely believes that replacing fired staff with AI will make the government more efficient, he might want to think again. Despite chatbots spewing iterative responses based on prompts, they're highly unlikely to match the capabilities of a fully-staffed agency. In the end, these tech moguls could effectively wrench literature away from us – first by stealing authors' work, and then by limiting our access to books altogether.

The irony is palpable as the Trump administration attempts to dismantle the financial backbone supporting public libraries while hinging on AI for services traditionally executed by humans. On March 14, Trump issued an executive order with the intention of grinding the Institute of Museum and Library Services (IMLS) to a halt. This agency doles out funding to public libraries nationwide. Local and state taxes usually provide financial support for libraries, but many U.S. institutions rely heavily on federal grant funding for basic services. This extends to digital services offered by libraries, such as apps like Libby and Hoopla that let users borrow e-books and audiobooks from their local libraries. Jeff Jankowski, president of Hoopla Digital, warned NPR that the absence of federal funding could compel libraries to curtail or even dump their digital services entirely. Users might face longer waiting times for e-books or discover their desired books are off-limits altogether.

So there you have it - our era of litdicate is under siege. Let's hope justice prevails and we're not left in a world where our favorite books have been pilfered and our access to them has been obstructed. In the meantime, keep up with the barrage of legal battles and unsettling revelations as they unfold. Because in this modern-day treasure hunt, we're all either Sebastian or Charles, desperate to find our have-cookies, er, have-books.

Artificial intelligence conglomerates, such as Meta, have been mining pirated books from sites like Library Genesis to avoid producing AI-generated nonsense.
The powerful tech companies, including Meta, are using eroded copyright laws to amass pirated literature, helming a potential crisis for literature access.
The future of audiobooks, e-books, and other digital library services might be jeopardized due to recent attempts by the Trump administration to restrict federal funding for public libraries.
The contrast between the tech oligarchs' efforts to pilfer literature and their struggles to prove the legality of their AI research for copyrighted works serves to highlight the complexities and ironies of our digital age.