Artificial Intelligence & Machine Learning
Data Is King. It Is Also Often Unlicensed or Faulty
Study Found Over 70% of Analyzed Datasets Omitted Licensing InformationArtificial intelligence is no longer a buzzword. Nearly all businesses - and their employees - use the technology in some capacity, driving an unprecedented pace of AI adoption. The AI market is projected to reach $407 billion by 2027. Attribute it to the ton of datasets used to train large language models, or LLMs, such as ChatGPT to make them prompt, accessible and tailor-made.
But there's a caveat. LLMs are trained on massive amounts of data generated from publicly available sources including Wikipedia, news, blogs, publications and Common Crawl, which hosts a large dataset of webpages - 250 billion to be precise. The problem lies in the datasets themselves - how they are licensed, sourced and used.
A report published in the Nature Machine Intelligence journal presents a large-scale audit of dataset licensing and attribution in AI, analyzing over 1,800 datasets used in training AI models on platforms such as Hugging Face. The study revealed widespread miscategorization, with over 70% of datasets omitting licensing information and over 50% containing errors. In 66% of the cases, the licensing category was more permissive than intended by the authors.
The report cautions against a "crisis in misattribution and informed use of popular datasets" that is driving recent AI breakthroughs but also raising serious legal risks.
"Data that includes private information should be used with care because it is possible that this information will be reproduced in a model output," said Robert Mahari, co-author of the report and JD-PhD at MIT and Harvard Law School.
In the vast ocean of data, licensing defines the legal boundaries of how data can be used. Particularly in industries such as healthcare and finance, where sensitive data is involved, correct licensing ensures compliance with sector-specific regulations.
"The rise in restrictive data licensing has already caused legal battles and will continue to plague AI development with uncertainty," said Shayne Longpre, co-author of the report and research Ph.D. candidate at MIT. "There is uncertainty with which licenses need to be heeded and which do not. This is complicated by jurisdiction, who the copyright owner is, and the frequent re-packaging and re-licensing of datasets by downstream developers."
The Copyright Conundrum: Is AI a Derivative Work?
Copyright law grants exclusive rights to data creators to create derivatives of their work. The legal question revolves around whether a trained ML model can be considered a derivative work of the training data. AI crawls through existing data to arrive at an output, making a "digital copy of the underlying data." If the crawled data is copyrighted, its use to train LLMs can have legal consequences.
"It remains challenging to determine which data is problematic vs. permissive. Data opacity also prevents better understanding of model safety, security and privacy," Longpre said.
In December 2023, The New York Times sued OpenAI and Microsoft for copyright infringement, alleging their LLMs were built by copying millions of its copyrighted news articles. In January 2023, Getty Images accused Stability AI, the creator of image-generating model Stable Diffusion, of using millions of images from Getty's database to train its model without permission or compensation. Earlier this year, a California court partially dismissed a copyright case against OpenAI brought on by several artists, including actress Sarah Silverman, in which they alleged OpenAI violated the Digital Millennium Copyright Act by removing copyright management information.
But while there have been lawsuits, many news organizations have partnered with OpenAI to have more control over their data. Financial Times signed a licensing agreement with OpenAI, which enables ChatGPT users to "see select attributed summaries, quotes and links to FT journalism in response to relevant queries." Axel Springer is the first publishing house to partner with OpenAI, allowing the use of Axel's data to train OpenAI's LLMs. Similar agreements have been made by News Corp and Le Monde.
Broader Industry Impact
Further exploration of these legal loopholes reveals the role of third parties. OpenAI, for example, has imposed terms and conditions that restrict how its APIs can be used. The terms and conditions act as a contract between OpenAI and its users, who must adhere to the restrictions to avoid violating intellectual property laws. But this is not the same as copyright or licensing laws. Third parties who did not agree to OpenAI's terms may still use its content, making them less liable to uphold data transparency.
This lack of clarity also creates a legal gray area, where it's unclear who owns the data generated by OpenAI. Licensing uncertainty can have a "chilling effect" on innovation, where businesses could slow down AI adoption and hinder R&D for fear of potential lawsuits.
"Greater transparency by AI developers is an important next step, but developers fear that it will encourage copyright lawsuits … so [the developers] have abstained from transparency," Longpre said.
Faulty Datasets = Faulty LLMs
Besides the legal and ethical ramifications of using unlicensed or miscategorized datasets for training LLMs, erroneous datasets can compromise their accuracy and effectiveness. The researchers said faulty and unlicensed data can lead to "data leakages between training and test data, expose personally identifiable information, present unintended biases or behaviors, and result in lower quality models."
This is not unprecedented. IBM Watson was known for revolutionizing healthcare with AI by transforming medical diagnosis and treatment. But trained on faulty data and a limited set of medical databases, the AI gave inaccurate or irrelevant recommendations, marring Watson's credibility in the healthcare space, before it was ultimately sold off in 2022.
How 'Fine' Can Fine-Tuning LLMs Be?
A key part of training an LLM is fine-tuning, where targeted datasets are evaluated with human feedback to produce higher-quality responses. But it is practically unachievable to manually evaluate each dataset used to train an LLM.
"AI fine-tuning data has become more restricted in recent years, which may affect how data can be used," Mahari said. "While 'fair use' is often used to justify the use of web-based pre-training data for AI, using fine-tuning data, which was created for the sole purpose of training AI, is not likely to qualify as fair use."
There is also a growing need for clearer licensing frameworks specific for AI training data. AI researchers have advocated for consent-based frameworks, where data owners can approve or reject the use of their work for LLM training.
To mitigate legal risks, model developers need to put safeguards in place, but this requires knowing the origin or provenance of that data. Tools such as the Data Provenance Explorer, an open-source repository, allow developers to download, filter and explore data provenance and characteristics and provide detailed information on hundreds of fine-tuning datasets.
But as AI evolves, so should legal frameworks and regulations. "Legal clarity regarding when data can be used and how is needed urgently to balance AI innovation with incentives for continued human expression and compensation for content creators," Mahari said.