The Great Data Inversion,JC Langley

CuriosityStream was losing money. Streaming company, documentaries, not exactly Netflix. Then someone realized the library, 210,000 hours of it, was worth more as AI training data than as a streaming service. Net loss to $8.2M EBITDA. The content didn't change. The buyer for it did.

The AI training data market is $2.7 billion today and heading for $11 billion by 2030. And the companies sitting on the most valuable datasets in it have no idea they have them. They think they're in the translation business.

CuriosityStream was content as data. A library someone could point to and say “license this.” What comes next is stranger, because the most valuable datasets in entertainment aren't libraries at all. They're operational exhaust.

Reddit locked in $60 million a year from Google. Then renegotiated because the first deal was too cheap. Total AI licensing: roughly $130 million annually. Ten percent of the company's revenue. For text that users typed for free.

Shutterstock made $104 million licensing images to AI companies in 2023. By 2025 that was $203 million. Twenty-one percent of total revenue, growing faster than everything else in the business. They signed a six-year deal with OpenAI. The stock photo company is quietly becoming a data company.

Then it got strange. OpenAI offered $500 million for Medal.tv. A platform where gamers upload clips of themselves playing video games. Two billion clips a year. Ten million users. The primary asset is teenagers screen-recording their Fortnite kills.

Medal said no. Spun out an AI lab, raised $134 million from Khosla and General Catalyst, and kept the data. Half a billion dollars on the table and they walked away because the data was worth more.

AI Data Licensing

Four deals. 26 months. The content didn’t change — the buyer for it did.

Proportional scale · max = $500M

CuriosityStream 2023 · 300,000+ hrs of documentary content “Adjusted EBITDA went from negative to positive $8.2 million. Same library.”

$23.4M1× — baseline

Reddit · Google 2024 · 1B+ posts, 16 years of human discourse “First deal was too cheap. They renegotiated.”

$60M2.6×

Shutterstock 2025 · 700M+ licensed images · 6-yr OpenAI deal “21% of total revenue from AI licensing. Growing fastest.”

$203M8.7×

Medal.tv Declined · 2B clips/yr · 10M gamers · OpenAI offer “$134M from Khosla. Kept the data. Built the lab.”

$500Moffered — walked

21.4×

Total value escalation $23.4M → $500M · 26 months · text, image, video, gaming clips

Sources: Public filings, press releases · OpenAI–Medal.tv offer reported, unconfirmed by OpenAI · Reddit renegotiated above initial $60M/yr figure · Shutterstock 6-year deal signed with OpenAI

Text, images, video, audio. All getting licensed. And gaming services companies, the ones generating the most structured, most commercially specific operational data in entertainment, are pricing that asset at zero.

They're pricing the labor. They should be pricing the data.

This has happened before. In 2016, Quintiles merged with IMS Health. Quintiles ran clinical trials for pharma companies. IMS counted pills across pharmacies. Neither was exciting. Together, they became IQVIA. Market cap peaked north of $45 billion. The services produced proprietary data. The data made the services defensible. Neither half could have built the other.

That was healthcare. The same structure exists in entertainment right now. Nobody is building it.

The data nobody's pricing

So who actually owns this data?

Most people looking at gaming services stop at "translation." They see localization companies and think labor arbitrage. But a company that's localized 5,000 games across 60 languages isn't a translation vendor. It's sitting on a parallel text dataset. The same content, professionally translated into dozens of language pairs. That is exactly what multilingual AI models need for training. And the companies producing it every day have no idea what it's worth.

One company already proved it works. Flitto, a Korean localization platform, pivoted its multilingual parallel corpora into AI training data licensing. Revenue grew 77% year-over-year. Over 65% of revenue now comes from overseas data markets. They won a $7 million export award for it. Lionbridge and RWS both launched dedicated AI training data divisions. The localization-to-data pipeline isn't theoretical. It's operating.

Each service line produces a different type of data. Each type maps to a different AI market:

Service Line	Data Generated	AI Application	Readiness
Localization	Parallel text datasets, 60+ language pairs	Machine translation, multilingual LLMs	High
Trust & Safety	Toxicity classifications, moderation decisions	AI safety models, content moderation	High
QA / Testing	Bug patterns, gameplay friction, usability data	AI game testing, player experience optimization	Moderate
Voice / Dubbing	Multi-language voice recordings	Voice AI, dubbing, voice cloning	High

Three of four are ready now. Localization data is already being traded. Trust and safety data is in massive demand: 80% of platform moderation budgets go to AI tools, and the market is growing at 26.6% CAGR toward $10 billion. Voice and dubbing connects to the $20 billion voice AI market. ElevenLabs just raised at $11 billion on $330 million in revenue. They need voice data. These companies have been recording it for years.

One caveat. This is the kill assumption, so I won't bury it: ownership. Standard localization contracts give the client ownership of the translation memory. The vendor is just holding it. Which means a localization company's text corpus might belong to its clients.

That's checkable in diligence. And it shifts the thesis toward the data types where ownership is less contested. Moderation decision logs. QA testing patterns. Voice recordings. These are operational byproducts, not deliverables. Nobody wrote a contract clause over a company's internal bug-tracking data. The localization data is still valuable if the vendor retained rights. But it's the upside case, not the foundation.

The Publicis precedent

Publicis acquired Epsilon's first-party data for $4.4 billion in 2019. Analysts were skeptical. CNBC ran the headline. By 2025, Publicis has dramatically outperformed WPP on every financial metric. Annualized ten-year return: Publicis +7.4%, WPP -9.2%. The re-rating wasn't immediate. It compounded over five years as data revenue grew. Wall Street doubted the deal, then spent half a decade watching Publicis pull away from every competitor that didn't make the same bet.

Where we are in the cycle

Every services-to-platform conversion follows the same four-phase pattern. I've tracked it across six sectors. The sequence is remarkably consistent.

Phase 1: first mover proves the concept, then flames out or goes dark. PhyMatrix in healthcare. Rolled up physician practices in the '90s, collapsed. Keywords Studios in gaming. Over 60 acquisitions, proved the consolidation model, then EQT took them private at £2.2 billion. Hipgnosis in music catalogs. Proved the asset class, then the valuation collapsed. Every one of them validated the thesis and cleared the field.

Phase 2: the vacuum. Smart operators build quietly while everyone else looks the other way. US Physical Therapy built during the post-PPM bust from 2002 to 2010. PE had fled healthcare services. No competing bidders. Distressed multiples. Eight years of compounding without a bidding war.

Gaming services is in Phase 2 right now. Keywords proved it works and disappeared behind PE walls. 14,600 gaming employees were laid off in 2024. Structural, not cyclical. The market is roughly $13 billion globally. Keywords held about 6% share. The other 94% is hundreds of sub-scale companies, most doing $2-20 million in revenue, operating independently, without institutional backing.

Services-to-Platform Cycle · Six-sector pattern

First Mover

Proves the concept, then exits or flames out. PhyMatrix in healthcare. Keywords Studios in gaming. Hipgnosis in music. Each validated the thesis and cleared the field.

The Vacuum

Smart operators build quietly after the first mover exits. Distressed multiples. No competing bidders. 14,600 layoffs. This phase makes the next decade.

● We are here

Platform Expansion

Standalone units at 3–4× become platforms at 10–15×. Multiple expansion happens before market maturity produces the big outcomes.

Data Differentiation

Platforms with proprietary data separate permanently from pure services players. The gap compounds. Publicis vs. WPP is the analog.

The question isn't whether this plays out. The analogs are too consistent. The question is who's building during the vacuum.

The operators who build during Phase 2, after the first mover proves the concept and before the market reprices the opportunity, own the category for decades.

The move

Entry multiples on gaming services companies are 5-8x right now. The cycle analogs say 3-5 years before those rise to 8-13x. Fifteen to twenty year runway to full maturity. This is the point where CrowdStrike entered cybersecurity. Nine years after SOX created the compliance demand. Well before market maturity produced $80 billion outcomes.

The play is straightforward. Acquire fragmented gaming services companies at services multiples. Architect for data capture from Day 1. Not as a bolt-on. Not as Deal 5 optimization. As infrastructure.

40% of PE return variance explained by entry price alone — across 50,000+ deals

The Publicis lesson is that data ownership is not something you figure out later. It's something you build for from the beginning.

Then the cross-sector positioning that no gaming services incumbent has pulled off. The same talent running QA for a major battle royale can run QA for BMW's virtual showroom. The same artists building game environments can build architectural visualizations. Qualitest proved it. Cross-sector QA positioning got them acquired by Bridgepoint at $200 million in revenue. BISim bridged gaming to defense simulation and exited to BAE Systems at $200 million. Same talent. Different label. 2x the multiple.

Right now, IT services firms like Globant, Accenture, and Capgemini are hiring gaming talent at a premium, marking it up 3-4x, and selling it to enterprise clients. That is a talent arbitrage being captured by the wrong intermediary. Build the $30-50 million version at gaming services entry multiples and you get enterprise exit multiples without needing sixty acquisitions to get there.

The obvious objection is synthetic data. If AI can generate its own training data, the scarcity premium on human-generated datasets collapses. Meta, Google, and Nvidia are all investing in exactly that. For commodity data, they're probably right. But operational byproduct data is definitionally resistant to synthesis. You can't synthesize authentic moderation failures. You can't generate the pattern of where real code actually breaks across 5,000 QA cycles. You can't fabricate how a native Korean speaker actually pronounces English game dialogue. Synthetic data works for the generic middle. It doesn't work for the specific edges where AI models actually fail.

EU AI Act enforces in August. Four months. After that, web-scraped training data is a compliance liability in every European market. Licensed, provenance-tracked datasets become the only legal option for AI companies operating in Europe.

Regulatory Timeline

EU AI Act enforcement begins August 2026. Article 10 requires AI training data to have documented provenance and licensing. Web-scraped datasets without clear rights chains become non-compliant. Companies with licensed, human-generated training data — localization firms, dubbing studios, QA providers — hold the only compliant supply.

The window where these companies are priced as distressed services plays while sitting on appreciating data assets has a clock on it. Same deals. Same talent. Same EBITDA. Fifty to a hundred and ten million more in exit value from positioning alone. That gap is available right now because nobody in gaming services is thinking about it yet.