Logo

Can Content Licensing Opportunities for AI Training Last?

Movies & TV
Can Content Licensing Opportunities for AI Training Last?
Data licensing for AI training has become a sudden and fast-evolving market in the last couple years. Since 2023, AI companies have pursued licensing deals with media rights holders to secure access to content and use it as high-quality data to train AI models of any modality, including text, image, music and video.
Thirty content owner deals with AI developers have been publicly confirmed among news media publishers, stock image companies and others, according to VIP+ tracking. Meanwhile, more deals are occurring privately.
SEE ALSO: Complete Updated Index of Content Licensing Deals Struck With Publishers
Yet a bigger question has lingered about the future and longevity of licensing content for AI training. Is this type of licensing just a brief window of opportunity for creators and rights holders to monetize their content — or can the licensing market for AI training data persist and grow longer term?
“I’ve asked multiple AI companies for their opinions on this. One told me very bluntly they think demand will only be a short-term phenomenon. But most think [licensing AI training data] in some form will continue at scale for a very long time,” said Dave Davis, general manager at Protege Media, the audiovisual content licensing arm of AI training data platform Protege.
VIP+ sources agreed that perspectives are divided among both rights holders and AI developers — with one camp seeing content licensing for AI training as a “one-shot” opportunity that would only last a couple years, while the other camp expects the market to be ongoing and to grow.
SEE ALSO: The VIP+ Special Report ‘Generative AI & Licensing’
Wait-and-see mindsets are starting to fade, as rights holders consider the potential precariousness of the licensing opportunity. “People see [licensing their data] as an attractive opportunity and don’t want to miss out or be left behind,” said Alex Bestall, CEO and founder at music licensing agency Rightsify, which operates the opt-in dataset licensing division Global Copyright Exchange. “In the beginning, a lot of people were saying, ‘Let’s wait for these court cases to settle,’ but that’s not going to be for a while.”
Definitive conclusions about licensing are difficult to make because underlying dependencies are still uncertain, evolving fast and constantly subject to change.
Most simplistically, a durable licensing market for human-created content to train AI will depend on there being consistent demand for data from developers. Sources for VIP+ observed the nature of that demand has undergone shifts in the last couple years as licensing activity has ramped up.
These shifts suggest a few takeaways about how licensing for training data could progress:
1. Licensing demand will fluctuate with new approaches to building AI models: AI companies are expected to continually develop new model architectures or ways of building and powering AI models that could change their need for data.
Adopting more efficient approaches that reduce the need for scaled data — thought necessary to create high-performing generalist large language models — could in turn reduce developer interest in licensing. Most clearly, Chinese hedge fund DeepSeek’s R1 model has undermined the scaled-data narrative, as developers claimed it used less data to train through a process called distillation that a source to VIP+ called a “complete sidestep of any kind of content licensing.”
“Right now, everyone’s kind of thrown off by DeepSeek,” said Bestall. “Everything changes so fast, so it’s hard to tell how it will be in a year or two. There could be a new breakthrough that shows scaling combined with a new architecture is the way forward.”
On the other hand, new approaches developers choose to adopt would require them to retrain their models from scratch, which could push them to pursue or renew any licensing deals to acquire high-value data deemed necessary for the model.
2. Licensing will behave differently depending on the modality: Already, sources suggested that LLMs have gotten good enough that additional licensing of English-language text isn’t likely to be needed for pre-training, outside of specialized subject matter — though news media deals would likely have lasting importance paired with retrieval augmented generation (RAG) used in AI-powered search.
Likewise, image generation has gotten supremely good for the billions of scraped images available on the web. “Images have not been licensed as much for the last year,” a source who preferred to speak on background told VIP+. By contrast, video generation models still fail to accurately simulate the 3D real world, and developers continue to need more or higher-quality data.
SEE ALSO: How Licensing Deals Look in Training AI for TV & Film
3. Data licensing will be less of a volume game and more about specialization: More than data scale, some researchers have argued performance gains are coming from the quality of the data that’s used to train models, including the accuracy and completeness of annotations (labeling) applied to the data.
Even though most AI developers claim that any publicly available content shared online is fair use, both Davis and Bestall noted companies have been willing to license in order to access higher-quality, more specialized data that’s otherwise inaccessible — effectively that they can’t get from scraping the internet, whether because it’s privately stored or doesn’t already exist in sufficient quantities. These focused datasets have been used for different purposes, including fine-tuning, to fill “gaps” in training (e.g., subject matter diversity, such as pictures of things not well represented in stock libraries) or to improve the quality of outputs.
“It’s less about how many hours of data and more about which categories and how much in each category,” said Bestall.
“Instead of needing 50,000 hours of movies, [developers] might just need 1,000 shots of very specific stuff, like horses running across a prairie,” said Davis. “Most of these models are training to be able to output HD 1080p quality video. As they try to improve pixel quality, they might want to train only on HDR or Dolby Vision content.”
Bestall argued that longer-term content licensing for AI would be more of a “long-tail business, with many clients needing smaller, specialized datasets” rather than just a handful of AI companies licensing large-scale datasets to pre-train large models.
“Instead of five to 10 companies acquiring lots of data, it’ll be a lot of companies acquiring smaller amounts of data,” he said. “A lot of these companies still need large amounts of data, but as things move forward, it will be more about synthetic data and fine-tuning. We’ve seen it with some clients we’ve worked with for two years now. Their recent needs are much more specific, whereas before it was, like, we need as much as possible.”
4. Increased use of synthetic data could blunt demand for human-created data, offset by the risk of model collapse: The rising use of synthetic (AI-generated) data to train models is a threat to the licensing of human-created data. Some developers think synthetic data can train good models as well as human-created data — and that, increasingly, at least some portion of the data used in pre-training would be synthetic going forward. The production of synthetic data out of a model trained on copyrighted works has been referred to as a form of copyright laundering by disguising the original source material used to create it.
Yet even as synthetic data use rises, sources expected that developers would continue to need human-created data for several reasons. Few models are trained entirely on synthetic data, meaning training will be done with a mix of human and machine made.
Second, researchers have warned that overtraining on low-quality synthetic data can result in a phenomenon called model collapse, where a model produces incoherent outputs. This would be a particular obstacle for video models because AI video often still contain problems. Further, human-made data will always be priced at a premium to synthetic data, Bestall noted.
SEE ALSO: Why Synthetic Data Will Transform AI Licensing
5. Licensing will continue among developers building “clean-data” models: Albeit rare, developers such as Moonvalley, Adobe and others have prioritized training AI models exclusively on “ethically sourced” data. The definition of ethical data for AI training is not established industrywide but generally means the use of data with consent, in practice meaning it’s owned, licensed or in the public domain. For example, the nonprofit Fairly Trained has been certifying AI models that meet such criteria for data sourcing.
SEE ALSO: A Complete Updated Index of Content Owner Lawsuits Against AI Companies
6. Licensing will hinge on the legality of AI training on data without permission: While convictions fall on either side of fair use and other debates, no litigation outcome, legislation or regulation has yet concretely established a legal basis for content licensing to occur. As courts begin to rule on infringement lawsuits in the near term — sources highlighted the Thomson Reuters v. Ross Intelligence ruling that favored the publisher — doubt could spur AI companies to do deals to avoid extra liability. But the reverse is also possible as the U.S. and other jurisdictions advance pro-AI politics. “Since the new administration came in, people are more confident in the fair use angle even though the Thomson Reuters case was ruled on,” said Bestall.

Riff on It

Riffs (1)

Scott  Evans Scott Evans

This is a Test