GPT-4 memorizes contents of copyrighted books, and it could be a cultural issue


A study points to another potential copyright problem and cultural challenge of today’s large language models: The more famous and popular a book is, the better the language model memorizes its content.

Researchers at the University of California, Berkeley, tested ChatGPT, GPT-4, and BERT for their ability to memorize books. According to the study, the language models memorized “a wide collection of copyrighted materials.” The more often the content of a book is found on the web, the better the language model memorizes it.

Book archeology in a large language model

According to the study, OpenAI’s models are particularly good at memorizing science fiction, fantasy, and bestsellers. These include classics such as 1984, Dracula, and Frankenstein, as well as more recent works such as Harry Potter and the Philosopher’s Stone.

The researchers compared Google’s BERT with ChatGPT and GPT-4, since the former’s training data is known. To their surprise, the researchers found that “BookCorpus,” a training set of supposedly free books by unknown authors, included works by Dan Brown or Fifty Shades of Grey. BERT memorizes information from these books because they were part of the training data.


Image: Kent K. Chang et al.

The more often a book appears on the Web, the more detailed it is memorized by a large language model, the researchers write. They tested memorization with different placeholder prompts that ChatGPT and GPT-4 had to complete.

You have seen the following passage in your training data. What is the proper name that fills in the [MASK] token in it? This name is exactly one word long, and is a proper name (not a pronoun or any other word). You must make a guess, even if you are uncertain.

Example: Input: Stay gold, [MASK]stay gold.

Output: Ponyboy Input: The door opened, and [MASK], dressed and hatted, entered with a cup of tea. Output: Gerty

Input: My back’s to the window. I expect a stranger, but it’s [MASK] who pushes open the door, flicks on the light. I can’t place that, unless he’s one of them. There was always that possibility.


Example prompt

Memorization determines the ability of the language model to perform downstream tasks about a book: The better a book is known, the more likely the language model is to successfully perform tasks such as naming the year of publication or correctly identifying characters from books.

Image: Kent K. Chang et al.

Language models as a tool for cultural analysis may suffer from narrative bias

The researchers are not primarily concerned with copyright issues. Rather, they are concerned with the potential opportunities and problems of using large-scale language models for cultural analysis, particularly the social biases caused by common narratives in popular science fiction and fantasy works.

Cultural analysis research could be heavily influenced by large-scale language models, and the different performance depending on the presence of the book in the training material could lead to bias in the research.


popular books represented in the language models is available here. The code used and more data from the study are available on Github.

As with image models, whether book citations become a copyright issue will likely depend on how closely the texts generated by the model match those of the books in the dataset. This will have to be decided in court.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top