Thesis · April 2026

The Great Data Inversion

Why entertainment's most distressed services companies are its most valuable AI assets.

The AI training data market is $2.7 billion today and heading for $11 billion by 2030. The companies sitting on the most valuable datasets in it have no idea they have them. They think they're in the translation business.

CuriosityStream was losing money. Streaming company, documentaries, not exactly Netflix. Then someone realized the library, 210,000 hours of it, was worth more as AI training data than as a streaming service. Net loss to $8.2M EBITDA. The content didn't change. The buyer for it did.

Reddit locked in $60 million a year from Google. Then renegotiated because the first deal was too cheap. Total AI licensing: roughly $130 million annually. Ten percent of the company's revenue. For text that users typed for free.

Shutterstock made $104 million licensing images to AI companies in 2023. By 2025 that was $203 million. Twenty-one percent of total revenue, growing faster than everything else in the business. They signed a six-year deal with OpenAI. The stock photo company is quietly becoming a data company.

Then it got strange. OpenAI offered $500 million for Medal.tv. A platform where gamers upload clips of themselves playing video games. Two billion clips a year. Ten million users. The primary asset is teenagers screen-recording their Fortnite kills.

Medal said no. Spun out an AI lab, raised $134 million from Khosla, and kept the data. Half a billion dollars on the table and they walked away because the data was worth more.

AI  Data  Licensing
Four deals. 26 months. The content didn’t change — the buyer for it did.
Proportional scale · max = $500M
01
CuriosityStream 2023 · 210,000 hrs of documentary content “Net loss to $8.2M EBITDA. Same library.”
$23.4M1× — baseline
02
Reddit · Google 2024 · 1B+ posts, 16 years of human discourse “First deal was too cheap. They renegotiated.”
$60M2.6×
03
Shutterstock 2025 · 700M+ licensed images · 6-yr OpenAI deal “21% of total revenue from AI licensing. Growing fastest.”
$203M8.7×
04
Medal.tv Declined · 2B clips/yr · 10M gamers · OpenAI offer “$134M from Khosla. Kept the data. Built the lab.”
$500Moffered — walked
21.4×
Total value escalation $23.4M → $500M · 26 months · text, image, video, gaming clips

Sources: Public filings, press releases · OpenAI–Medal.tv offer reported, unconfirmed by OpenAI · Reddit renegotiated above initial $60M/yr figure · Shutterstock 6-year deal signed with OpenAI

Text, images, video, audio. All getting licensed. And gaming services companies, the ones generating the most structured, most commercially specific operational data in entertainment, are pricing that asset at zero.

They're pricing the labor. They should be pricing the data.

This has happened before. In 2016, Quintiles merged with IMS Health. Quintiles ran clinical trials for pharma companies. IMS counted pills across pharmacies. Neither was exciting. Together, they became IQVIA. Market cap: north of $45 billion. The services produced proprietary data. The data made the services defensible. Neither half could have built the other.

That was healthcare. The same structure exists in entertainment right now. Nobody is building it.

The data nobody's pricing

So who actually owns this data?

Most people looking at gaming services stop at "translation." They see localization companies and think labor arbitrage. But a company that's localized 5,000 games across 60 languages isn't a translation vendor. It's sitting on a parallel text dataset. The same content, professionally translated into dozens of language pairs. That is exactly what multilingual AI models need for training. And the companies producing it every day have no idea what it's worth.

Each service line produces a different type of data. Each type maps to a different AI market:

Service LineData GeneratedAI ApplicationReadiness
LocalizationParallel text datasets, 60+ language pairsMachine translation, multilingual LLMsHigh
Trust & SafetyToxicity classifications, moderation decisionsAI safety models, content moderationHigh
QA / TestingBug patterns, gameplay friction, usability dataAI game testing, player experience optimizationModerate
Voice / DubbingMulti-language voice recordingsVoice AI, dubbing, voice cloningHigh

Three of four are ready now. Localization data is already being traded. Trust and safety data is in massive demand: 80% of platform moderation budgets go to AI tools, and the market is growing at 26.6% CAGR toward $10 billion. Voice and dubbing connects to the $20 billion voice AI market. ElevenLabs just raised at $11 billion on $330 million in revenue. They need voice data. These companies have been recording it for years.

One caveat. This is the kill assumption, so I won't bury it: ownership. Standard localization contracts give the client ownership of the translation memory. The vendor is just holding it. Which means a localization company's text corpus might belong to its clients.

That's checkable in diligence. And it shifts the thesis toward the data types where ownership is less contested. Moderation decision logs. QA testing patterns. Voice recordings. These are operational byproducts, not deliverables. Nobody wrote a contract clause over a company's internal bug-tracking data. The localization data is still valuable if the vendor retained rights. But it's the upside case, not the foundation.

The Publicis precedent

Publicis acquired Epsilon's first-party data for $4.4 billion in 2019. Analysts were skeptical. CNBC ran the headline. By 2025, Publicis's operating margin was 14.1% versus WPP's 5.7%. Annualized ten-year return: Publicis +7.4%, WPP -9.2%. The re-rating wasn't immediate. It compounded over five years as data revenue grew. Wall Street doubted the deal, then spent half a decade watching Publicis pull away from every competitor that didn't make the same bet.

Where we are in the cycle

Every services-to-platform conversion follows the same four-phase pattern. I've tracked it across six sectors. The sequence is remarkably consistent.

Phase 1: first mover proves the concept, then flames out or goes dark. PhyMatrix in healthcare. Rolled up physician practices in the '90s, went bankrupt. Keywords Studios in gaming. 130 acquisitions, proved the consolidation model, then EQT took them private at £2.2 billion. Hipgnosis in music catalogs. Proved the asset class, then the valuation collapsed. Every one of them validated the thesis and cleared the field.

Phase 2: the vacuum. Smart operators build quietly while everyone else looks the other way. DaVita and US Physical Therapy built during the post-PPM bust from 2002 to 2010. PE had fled healthcare services. No competing bidders. Distressed multiples. Eight years of compounding without a bidding war.

Gaming services is in Phase 2 right now. Keywords proved it works and disappeared behind PE walls. 14,600 gaming employees were laid off in 2024. Structural, not cyclical. The market is $25-30 billion globally. Keywords held roughly 10% share. The other 90% is hundreds of sub-scale companies, most doing $2-20 million in revenue, operating independently, without institutional backing.

Services-to-Platform Cycle · Six-sector pattern
1
First Mover
Proves the concept, then exits or flames out. PhyMatrix in healthcare. Keywords Studios in gaming. Hipgnosis in music. Each validated the thesis and cleared the field.
2
The Vacuum
Smart operators build quietly after the first mover exits. Distressed multiples. No competing bidders. 14,600 layoffs. This phase makes the next decade.
●  We are here
3
Platform Expansion
Standalone units at 3–4× become platforms at 10–15×. Multiple expansion happens before market maturity produces the big outcomes.
4
Data Differentiation
Platforms with proprietary data separate permanently from pure services players. The gap compounds. Publicis vs. WPP is the analog.

Phase 3 is platform multiple expansion. Standalone practices at 3-4x become platforms at 10-15x. Phase 4 is data differentiation. The platforms with proprietary data separate permanently from the ones that are just services.

The question isn't whether this plays out. The analogs are too consistent. The question is who's building during the vacuum.

The operators who build during Phase 2, after the first mover proves the concept and before the market reprices the opportunity, own the category for decades.

The move

Entry multiples on gaming services companies are 5-8x right now. The cycle analogs say 3-5 years before those rise to 8-13x. Fifteen to twenty year runway to full maturity. This is the point where CrowdStrike entered cybersecurity. Nine years after SOX created the compliance demand. Well before market maturity produced $80 billion outcomes.

The play is straightforward. Acquire fragmented gaming services companies at services multiples. Architect for data capture from Day 1. Not as a bolt-on. Not as Deal 5 optimization. As infrastructure.

40% of PE return variance explained by entry price alone — across 50,000+ deals

The Publicis lesson is that data ownership is not something you figure out later. It's something you build for from the beginning.

Then the cross-sector positioning that no gaming services incumbent has pulled off. The same talent running QA for a major battle royale can run QA for BMW's virtual showroom. The same artists building game environments can build architectural visualizations. Qualitest proved it. Cross-sector QA positioning got them 10-12x at $200 million revenue from Bridgepoint. BISim bridged gaming to defense simulation and exited to BAE Systems at $200 million. Same talent. Different label. 2x the multiple.

Right now, IT services firms like Globant, Accenture, and Capgemini are hiring gaming talent at a premium, marking it up 3-4x, and selling it to enterprise clients. That is a talent arbitrage being captured by the wrong intermediary. Build the $30-50 million version at gaming services entry multiples and you get enterprise exit multiples without needing 130 acquisitions to get there.

EU AI Act enforces in August. Four months. After that, web-scraped training data is a compliance liability in every European market. Licensed, provenance-tracked datasets become the only legal option for AI companies operating in Europe. The window where these companies are priced as distressed services plays while sitting on appreciating data assets? That window has a clock on it.

Regulatory Timeline

EU AI Act enforcement begins August 2026. Article 10 requires AI training data to have documented provenance and licensing. Web-scraped datasets without clear rights chains become non-compliant. Companies with licensed, human-generated training data — localization firms, dubbing studios, QA providers — hold the only compliant supply.

Same deals. Same talent. Same EBITDA. $50-110 million more in exit value from positioning alone.

Gaming Services Private Equity AI Training Data Data Assets Cross-Sector Roll-Up Thesis
Next piece
What the PPM Bust Teaches Services Rollup Operators