‘Millions’ of NYT and NY Daily News stories taken by OpenAI for training data

16 hours ago

The New York Times, OpenAI, and Bing applications available on mobile devices. Image credit: Shutterstock/Tada Images.

In just three weeks of searching through OpenAI's training dataset, numerous articles from platforms like The New York Times and The New York Daily News have been discovered.

News publishers are currently sifting through publicly accessible information to identify cases where their copyrighted content has been used to train OpenAI's models. They argue that the tech company should be made to disclose this information directly.

They are currently requesting a legal ruling that would compel OpenAI to "disclose and acknowledge" the specific copyrighted materials that were utilized in training its language models from GPT-1 to GPT-4o.

The creator of ChatGPT, who opposed the request, stated that the publishers are seeking details on nearly 20 million pieces of content referenced in the case. This essentially translates to around 500 million separate queries.

On Friday, the publishers informed the court that their need for assistance from the AI company in reviewing the data would greatly decrease if OpenAI acknowledged that it used all or most of the copyrighted material from the News Plaintiffs to train its models.

A letter submitted to the court mentioned: “Although millions of works belonging to the News Plaintiffs have already been identified in the training datasets, there's no way of knowing how many additional works might still be hidden. OpenAI, having made the decision to copy these works, should be required to disclose this information.”

In December of last year, The New York Times became the first prominent news organization to initiate a copyright lawsuit against OpenAI and its collaborator Microsoft.

In April, the New York Daily News along with seven other newspapers owned by Alden Global Capital took similar actions. Since then, the two cases have been merged because OpenAI and Microsoft stated that they involve almost the same claims concerning the same new technology.

In the recent letter, the publishers of news content insisted that it is crucial to determine which specific parts of their copyrighted material were used for training the GPT models. This identification is essential for understanding the basis of their claims and defines the extent of their issues.

"However, News Plaintiffs and OpenAI fundamentally disagree on who should be responsible for recognizing this information."

The publishers reported that they have submitted many requests since February seeking details about the contents of OpenAI’s training datasets. In response, the tech company stated, "OpenAI will allow access to the pretraining data for the models used for ChatGPT, which will be provided for review according to an inspection protocol that will be negotiated between the involved parties, following a reasonable search."

Following extensive discussions that began last month, news publishers have started looking into OpenAI's training data, but only under tight restrictions. This setup has been referred to by the court as a "sandbox," indicating a carefully managed environment where only specific applications are allowed to operate.

However, the news organizations reported that they encountered significant and ongoing technical problems that prevented them from being able to search properly and determine the complete extent of OpenAI's violations.

They expressed frustration, stating that the process is “slow, overwhelming, and extremely costly.” They mentioned that they have invested what amounts to 27 days working with lawyers and experts in the OpenAI sandbox, yet they feel they are “still far from finished.”

The financial results released by The New York Times Company on Monday showed that they have already invested a minimum of $7.6 million in the lawsuit against OpenAI and Microsoft.

OpenAI: Navigating Uncharted Data Searches

In the same correspondence, OpenAI addressed the publishers' concerns regarding the inspection, stating that these issues have either been settled or are currently under active discussion. They attributed the problems to the publishers' consultants, who they claimed were causing the file system to become overloaded with improperly formatted search queries.

OpenAI mentioned: “Looking at it from a broader perspective, it's clear that all parties are exploring unknown territory when it comes to discovering training data.”

This case is unprecedented because the Plaintiffs are requesting access to hundreds of terabytes of unorganized text data. OpenAI has difficulty pinpointing the exact content the Plaintiffs want. Following Rule 34, OpenAI invited the Plaintiffs to review the data in its standard format instead of creating a separate testing environment. Since the sheer amount of data is overwhelming, OpenAI developed the necessary hardware and software for the Plaintiffs to conduct their inspection.

OpenAI specifically set up hundreds of terabytes of training data in a specialized storage system just for the Plaintiffs. They created a powerful virtual machine capable of accessing, searching, and analyzing the data. Additionally, they installed numerous software tools and loaded tens of gigabytes of the Plaintiffs' data as requested. They also took care of the required firewalls and secure virtual private network to facilitate the review process.

OpenAI stated that they will keep assisting publishers with technical difficulties as long as they act sincerely. However, they noted, “This hasn’t always happened,” claiming that some publishers have postponed progress for several months and have made numerous unrelated requests.

Representatives from the Authors Guild and the progressive news outlet Raw Story Media have also examined the OpenAI training data in relation to their own situations.

OpenAI had previously requested a judge to compel The New York Times to submit the private notes of its reporters. The publisher cautioned that this action could lead to significant negative repercussions and wider implications, and the request was ultimately rejected in September.

Reach out to us at [email protected] if you notice any errors, have suggestions for stories, or want to submit a letter for our "Letters Page" blog.