Premium

‘Unauthorised’: NZ authors’ books used to train big tech AI

Chris Keall

Technology Editor/Senior Business Writer·NZ Herald·

16 Nov, 2023 02:30 AM14 mins to read

Subscribe to listen

Access to Herald Premium articles require a Premium subscription. Subscribe now to listen.

Already a subscriber?

Listening to articles is free for open-access content—explore other articles or learn more about text-to-speech.

‌

Save

Share this article

Reminder, this is a Premium article and requires a subscription to read.

Big tech AIs like ChatGPT have ingested millions of books. That allows them to summarise novels, or even imitate an author's style. But experts say the key purpose of the exercise is to teach the AIs how to write and deliver more fluent responses. Image / Getty Creative

Well-known Kiwi writers are among the authors whose works have been dragged into legal action against Facebook owner Meta, claiming it used their books, without their permission, to train its artificial intelligence (AI) system - and pirated copies, to boot.

A lawsuit filed in California by comedian-memoirist Sarah

Books3 includes copies of books by writers of global best-sellers such as Stephen King, JK Rowling, Margaret Atwood and George RR Martin, plus numerous works by New Zealand authors including Eleanor Catton, Elizabeth Knox, Emily Perkins, Alan Duff and Albert Wendt, among many others. The degree of use of Books3 is an open question, but all the big tech firms have been open about using sources such as Project Gutenberg - a database of copyright-expired and copyright-free titles - for AI training.

Meta declined to offer anyone for interview, but said in a statement to the Herald, “Llama 2 is trained on datasets from publicly available sources. These sources contain a range of materials and may include materials protected by copyright. We respect third-party IP [intellectual property] rights and believe our use of the materials are consistent with existing law.”

Advertise with NZME.

A second lawsuit, filed in New York on September 20 by authors including John Grisham, Jonathan Franzen and, again, George RR Martin, also targeted what it saw as the illegal use of Books3 for AI training by Meta and its partners Microsoft and Bloomberg.

It cited growing concerns that authors could be replaced by systems like OpenAI’s ChatGPT that “generate low-quality ebooks, impersonating authors and displacing human-authored books”.

"It’s worse than bad fan fiction by a long measure" - Victoria University Press publisher Fergus Barrowman on ChatGPT's attempt to write a short story in the style of author Elizabeth Knox. Photo / Victoria Birkinshaw

Today, if you’re researching an essay, you can ask the likes of ChatGPT or Google’s Bard to give you the plot of a book, summarise critical or academic reaction, or - as Grisham and company fear - write a short story in the style of an author (and we’ll have author Elizabeth Knox’s verdict on ChatGPT’s efforts on that front shortly).

But Victoria University AI expert Dr Andrew Lensen explains that’s not the core reason why the big AI systems have ingested hundreds of thousands of novels.

Advertise with NZME.

Rather, books play a crucial role in the training of “generative” AI systems like ChatGPT and Bard, helping them to deliver more human-sounding responses to any query. ”Generative” AI can create words, images, videos or computer code when prompted with questions, drawing on databases of content from various sources, plus their users’ responses to previous replies.

Beyond thousands of books, version three of ChatGPT was also trained using Wikipedia articles, chat logs and other data posted to the internet, according to a New York Times report. “By pinpointing patterns in all that text, this system learned to predict the next word in a sequence. When someone typed a few words into this ‘large language model’, it could complete the thought with entire paragraphs of text. In this way, the system could write its own [X] posts, speeches, poems and news articles.”

Elizabeth Knox: A class action against big tech AIs "might be argued from laws around the exploitation of labour rather than copyright".

But if ChatGPT wrote too much like a robot, it would all fall flat. Which is why novels are so crucial to the process.

As freelance writer and programmer Alex Reisner put it in his story for the Atlantic that exposed big tech’s use of Books3, novels “provide information about how to construct long, thematically consistent paragraphs - something that’s essential to creating the illusion of intelligence. Consequently, tech companies use huge data sets of books, typically without permission, purchase, or licensing”.

As an open letter from the Authors Guild, signed by some 8000 writers, put it, “You’re spending billions of dollars to develop AI technology. It is only fair that you compensate us for using our writings, without which AI would be banal and extremely limited.”

In court documents filed in support of a bid to dismiss the authors’ claim, Meta said neither outputs from the company’s generative AI nor the model itself are “substantially similar” to existing books.

Copyright laws don’t protect authors

Experts say this “using books for training, not regurgitating them” argument stands up (although there are some other possible angles of attack, which we’ll get to).

“Our copyright laws do not protect authors from AI,” said Dr Andrew Lensen, a senior lecturer in artificial intelligence at Victoria University.

Advertise with NZME.

“Copyright prevents copying – but AI systems learn from data, such as books, rather than directly copying them. Much like how [Game of Thrones author] George RR Martin was inspired by JRR Tolkien, an AI system can be ‘inspired’ by books it reads.”

Certainly, the authors who filed the lawsuit in California have found it an uphill battle so far. On November 10, US District Judge Vince Chhabria said he would grant Meta’s motion to dismiss the authors’ allegations that text generated by Llama infringes their copyrights. But he also indicated he would give the authors permission to amend most of their claims, in what the Reuters news agency called a “trimmed” version of the original lawsuit.

Graeme Cosslett, the immediate past president of the Publishers Association of New Zealand (Panz), told the Herald he was not aware of any New Zealand author or publisher joining the legal actions against the AI makers.

‘Fair use’ defence

Ian Finch, a partner with James & Wells, a law firm that specialises in intellectual property issues, said: “New Zealand authors are automatically entitled to copyright protection in over 180 countries - including the United States - under the Berne Convention. This also includes moral rights, which is the right to be attributed to your work and in some cases object to its modification, distortion or mutilation, which could be considered prejudicial to the author’s honour and reputation.”

Finch adds, “Since the training of these AI systems appears to have occurred in the US, the question of whether the act of training the models using the pirated material constitutes copyright infringement will likely be subject to US copyright laws.”

“Under US copyright law, ‘Fair Use’ is a common defence to copyright infringement, and one that Meta has flagged in their motion to dismiss the case. In particular, it is argued that the training of the AI models is a transformative use of the work and does not reproduce the source works. It therefore is argued that the use of the works constitutes Fair Use under US law.”

Given that the US Fair Use defence is quite permissive, one option for New Zealand authors would be to assert their rights against the use of these models in New Zealand where the copyright law is much more restrictive, Finch said.

But this would still involve a qualitative test rather than a more straightforward quantitative benchmark (such as a set percentage of words or sentences copied). Whether a significant line or paragraph from a novel had been closely imitated would be in the eye of the beholder.

Finch also adds an interesting wrinkle: “Since these large language models are prone to hallucinations, biases and errors and do not attribute the source material, in some cases their outputs could constitute a breach of the authors’ moral rights. Particularly if they provide incorrect or defamatory information about the works in question.”

Logistical, financial challenges

“As always, there needs to be a trade-off between the costs of enforcing authors’ rights and the potential benefits of having these materials removed from the training data sets. We suspect that many authors will wait to see the outcome of the pending cases before deciding which course of action is appropriate in their situation,” Finch said.

Litigation would probably drag on and be expensive. Meta and other firms with deep pockets could afford an extended fight, Finch said. Publishers tend to have fewer resources.

And while New Zealand might have tighter copyright laws than the US, big tech firms could still argue for cases to be heard in their home jurisdiction. In a 2012 High Court hearing in Auckland, a psychiatrist - known only as “A” - wanted Google NZ held accountable as a publisher. The case was dismissed by Associate Judge David Abbott after Google argued its NZ office was primarily a sales operation. “A” would have to pursue its parent Google Inc (now Alphabet) in the US. The lawyer for “A” said his client could not afford a fight in the US courts.

But can AI write?

The Herald prompted ChatGPT to “Write a short story in the style of The Vintner’s Luck by Elizabeth Knox”.

The first three paragraphs:

In the heart of a picturesque French village nestled between rolling vineyards, there lived a vintner named Lucien. His vineyard, Maison d’Etoiles, was known throughout the region for producing the finest wines, but Lucien was a man who believed that his success was not solely the result of his own labour, but rather a gift from the heavens.

One crisp autumn morning, as the first rays of dawn kissed the golden leaves of his grapevines, Lucien had a dream unlike any he had ever experienced. He found himself standing in his vineyard, surrounded by a soft, ethereal light. In front of him stood a radiant woman, her long hair as black as the midnight sky, and her eyes shimmering like the stars.

“You have been chosen, Lucien,” she whispered, her voice as melodic as a symphony, “to receive a blessing from the vintner’s angels.”

Bad at fiction, uncannily good at blurbs

“Ew,” Knox told the Herald. It was bad enough that she saw it as evidence ChatGPT had not ingested her work. “It probably made an aggregate of inspirational guardian angel stories floating about the internet, with maybe some promotional material about the movie thrown in. Its bodiless, ingratiating, glib ‘life-affirming’ noises don’t have the novel or my style as their source.”

“Is there an audience for writing like that?” asked Victoria University Press publisher Fergus Barrowman.

“It takes a few of the very basic elements then adds appallingly cliche elements and soapy self-help elements.

“It’s worse than bad fan art by a long measure, and shows why AI is a long way from being a threat to literary fiction.”

Having said that, Barrowman added that his publishing house had asked ChatGPT to create promotional blurbs for a couple of poetry books. “And that was uncanny. Depressingly, the blurbs were almost there.”

Closer to the bone

The Herald also asked ChatGPT to write a short story in the style of Once Were Warriors author Alan Duff (it declined to write a novel).

We’ll spare you anything more than the opening paragraph, which was clunky to the point of being laughable:

The old Holden rumbled along the gravel road, its tires crunching the loose stones beneath them. In the driver’s seat, Jack Daniels gripped the steering wheel, his knuckles white against the dark leather. He hadn’t been back to this part of New Zealand in years, and the memories were flooding back like a tidal wave.

Given the same prompt, Google’s Bard chatbot produced a short story it called The Lost Ones, which mirrored many elements of Duff’s Warriors, including the names of the original novel’s characters.

The first three paragraphs:

The Heke whānau lived in a small, run-down house on the edge of town. The house was always full of noise and chaos, with the six Heke children running around and their parents, Jake and Beth, arguing constantly.

Jake was a violent man, and he often beat Beth and the children. Beth was a heavy drinker, and she often passed out on the couch, leaving the children to fend for themselves.

The eldest Heke child, Grace, was a bright and intelligent girl. She tried her best to keep the family together, but it was a difficult task. Her younger siblings were often hungry and neglected, and they often turned to crime and violence in order to survive.

Exploited labour?

Knox said a class action against big tech AI systems “might be argued from laws around the exploitation of labour rather than copyright. I mean, that’s what I’d do if I [were] the lawyers”.

But that argument could be hard to make as well.

“The use of published text for training these models at this scale is something that our regulations are not at all prepared for,” said Victoria University’s Lensen.

“Some will use the analogy that authors themselves learn to write based on other authors’ writing.

“But I think that’s a pretty disingenuous argument – after all, no author is reading 10,000 books an hour.

“I’d love to see compensation mandated. That wouldn’t fix the root problem of human creativity being diminished, but it is some sort of silver lining that provides some recognition of the hard work by authors.”

‘Responsibility to protect our authors’

Lawmakers have been slow to grapple with AI. The European Union passed a draft Artificial Intelligence Act in June, which will require big tech firms to disclose more about how they train their AI, and subject AI to risk tests. How this works in practice will have to wait until the final version of the AI Act passes - which could be next month.

In the United States, US President Joe Biden issued an executive order on October 30 to establish new standards to maximise the benefits of AI while safeguarding against its risks.

A list of US government departments are now charged with creating those standards, on an open timetable. But at least the process is under way.

Australia’s Budget 2023 allocated A$101.2 million ($108.6m) to a critical technology fund to help create governance rules for AI and support small and medium-sized enterprises’ adoption of AI technologies.

Here, “our Government has a responsibility to protect our creative industry, including our distinctive Kiwi authors”, Lensen said. “Yet, none of the major parties have an AI strategy or policy.”

“This technology is impacting society at a rapid rate, and we need our political leaders to take it seriously so it is used in a responsible and equitable way.”

While neither Labour nor National included AI policy in their election platforms, presumptive Technology Minister Judith Collins has consulted widely on the field. But when the Herald asked if she could give any hint on where the incoming Government might head on AI, or other tech issues like cyber security, she said: “We have policy work done on the areas mentioned but have decided these are probably best undertaken with the full resources of Government.”

Maybe ChatGPT could offer some advice on how to close a coalition deal.

Licensing deals the way forward?

For Cosslett, the answer lies in existing New Zealand and international laws.

“There is a clear way forward - the solution to this is licensing,” Cosslett said.

“If AI companies wish to train their tools with books under copyright, they have a legal and ethical responsibility to seek a licence from the creative owner.”

A licence agreement gives someone permission to use or reproduce copyrighted content - in some instances with a fixed annual fee covering blanket use.

“Using databases of books to train AIs comes down to whether or not the lawful rights of authors and publishers are recognised, remunerated and enforced. In other words,” Cosslett said.

“We are already seeing examples of this in some big tech firms. Adobe and Nvidia are only using licensed images for their AI training models. "

He also points to Microsoft’s “Copilot Copyright Commitment”, published in September, which included the line, “It is critical for authors to retain control of their rights under copyright law and earn a healthy return on their creations.” Copilot is Microsoft’s new “AI companion” that sits alongside its software apps, adding ChatGPT-style smarts.

That’s promising in that Microsoft is OpenAI’s single largest backer after the US$10 billion investment in the ChatGPT maker in January this year.

Cosslett said: “If the big tech companies have negotiated a licence to ensure legal use of books protected under copyright law, then this can be a legitimate business partnership.”

Chris Keall is an Auckland-based member of the Herald’s business team. He joined the Herald in 2018 and is the technology editor and a senior business writer.

Save