The AI training data market is $2.7 billion today and heading for $11 billion by 2030. The companies sitting on the most valuable datasets in it have no idea they have them. They think they're in the translation business.
CuriosityStream was losing money. Streaming company, documentaries, not exactly Netflix. Then someone realized the library, 210,000 hours of it, was worth more as AI training data than as a streaming service. Net loss to $8.2M EBITDA. The content didn't change. The buyer for it did.
Reddit locked in $60 million a year from Google. Then renegotiated because the first deal was too cheap. Total AI licensing: roughly $130 million annually. Ten percent of the company's revenue. For text that users typed for free.
Shutterstock made $104 million licensing images to AI companies in 2023. By 2025 that was $203 million. Twenty-one percent of total revenue, growing faster than everything else in the business. They signed a six-year deal with OpenAI. The stock photo company is quietly becoming a data company.
Then it got strange. OpenAI offered $500 million for Medal.tv. A platform where gamers upload clips of themselves playing video games. Two billion clips a year. Ten million users. The primary asset is teenagers screen-recording their Fortnite kills.
Medal said no. Spun out an AI lab, raised $134 million from Khosla, and kept the data. Half a billion dollars on the table and they walked away because the data was worth more.
Sources: Public filings, press releases · OpenAI–Medal.tv offer reported, unconfirmed by OpenAI · Reddit renegotiated above initial $60M/yr figure · Shutterstock 6-year deal signed with OpenAI
Text, images, video, audio. All getting licensed. And gaming services companies, the ones generating the most structured, most commercially specific operational data in entertainment, are pricing that asset at zero.
They're pricing the labor. They should be pricing the data.
This has happened before. In 2016, Quintiles merged with IMS Health. Quintiles ran clinical trials for pharma companies. IMS counted pills across pharmacies. Neither was exciting. Together, they became IQVIA. Market cap: north of $45 billion. The services produced proprietary data. The data made the services defensible. Neither half could have built the other.
That was healthcare. The same structure exists in entertainment right now. Nobody is building it.
The data nobody's pricing
So who actually owns this data?
Most people looking at gaming services stop at "translation." They see localization companies and think labor arbitrage. But a company that's localized 5,000 games across 60 languages isn't a translation vendor. It's sitting on a parallel text dataset. The same content, professionally translated into dozens of language pairs. That is exactly what multilingual AI models need for training. And the companies producing it every day have no idea what it's worth.
Each service line produces a different type of data. Each type maps to a different AI market:
| Service Line | Data Generated | AI Application | Readiness |
|---|---|---|---|
| Localization | Parallel text datasets, 60+ language pairs | Machine translation, multilingual LLMs | High |
| Trust & Safety | Toxicity classifications, moderation decisions | AI safety models, content moderation | High |
| QA / Testing | Bug patterns, gameplay friction, usability data | AI game testing, player experience optimization | Moderate |
| Voice / Dubbing | Multi-language voice recordings | Voice AI, dubbing, voice cloning | High |
Three of four are ready now. Localization data is already being traded. Trust and safety data is in massive demand: 80% of platform moderation budgets go to AI tools, and the market is growing at 26.6% CAGR toward $10 billion. Voice and dubbing connects to the $20 billion voice AI market. ElevenLabs just raised at $11 billion on $330 million in revenue. They need voice data. These companies have been recording it for years.
One caveat. This is the kill assumption, so I won't bury it: ownership. Standard localization contracts give the client ownership of the translation memory. The vendor is just holding it. Which means a localization company's text corpus might belong to its clients.
That's checkable in diligence. And it shifts the thesis toward the data types where ownership is less contested. Moderation decision logs. QA testing patterns. Voice recordings. These are operational byproducts, not deliverables. Nobody wrote a contract clause over a company's internal bug-tracking data. The localization data is still valuable if the vendor retained rights. But it's the upside case, not the foundation.
Publicis acquired Epsilon's first-party data for $4.4 billion in 2019. Analysts were skeptical. CNBC ran the headline. By 2025, Publicis's operating margin was 14.1% versus WPP's 5.7%. Annualized ten-year return: Publicis +7.4%, WPP -9.2%. The re-rating wasn't immediate. It compounded over five years as data revenue grew. Wall Street doubted the deal, then spent half a decade watching Publicis pull away from every competitor that didn't make the same bet.
Where we are in the cycle
Every services-to-platform conversion follows the same four-phase pattern. I've tracked it across six sectors. The sequence is remarkably consistent.
Phase 1: first mover proves the concept, then flames out or goes dark. PhyMatrix in healthcare. Rolled up physician practices in the '90s, went bankrupt. Keywords Studios in gaming. 130 acquisitions, proved the consolidation model, then EQT took them private at £2.2 billion. Hipgnosis in music catalogs. Proved the asset class, then the valuation collapsed. Every one of them validated the thesis and cleared the field.
Phase 2: the vacuum. Smart operators build quietly while everyone else looks the other way. DaVita and US Physical Therapy built during the post-PPM bust from 2002 to 2010. PE had fled healthcare services. No competing bidders. Distressed multiples. Eight years of compounding without a bidding war.
Gaming services is in Phase 2 right now. Keywords proved it works and disappeared behind PE walls. 14,600 gaming employees were laid off in 2024. Structural, not cyclical. The market is $25-30 billion globally. Keywords held roughly 10% share. The other 90% is hundreds of sub-scale companies, most doing $2-20 million in revenue, operating independently, without institutional backing.
Phase 3 is platform multiple expansion. Standalone practices at 3-4x become platforms at 10-15x. Phase 4 is data differentiation. The platforms with proprietary data separate permanently from the ones that are just services.
The question isn't whether this plays out. The analogs are too consistent. The question is who's building during the vacuum.
The operators who build during Phase 2, after the first mover proves the concept and before the market reprices the opportunity, own the category for decades.
The move
Entry multiples on gaming services companies are 5-8x right now. The cycle analogs say 3-5 years before those rise to 8-13x. Fifteen to twenty year runway to full maturity. This is the point where CrowdStrike entered cybersecurity. Nine years after SOX created the compliance demand. Well before market maturity produced $80 billion outcomes.
The play is straightforward. Acquire fragmented gaming services companies at services multiples. Architect for data capture from Day 1. Not as a bolt-on. Not as Deal 5 optimization. As infrastructure.
The Publicis lesson is that data ownership is not something you figure out later. It's something you build for from the beginning.
Then the cross-sector positioning that no gaming services incumbent has pulled off. The same talent running QA for a major battle royale can run QA for BMW's virtual showroom. The same artists building game environments can build architectural visualizations. Qualitest proved it. Cross-sector QA positioning got them 10-12x at $200 million revenue from Bridgepoint. BISim bridged gaming to defense simulation and exited to BAE Systems at $200 million. Same talent. Different label. 2x the multiple.
Right now, IT services firms like Globant, Accenture, and Capgemini are hiring gaming talent at a premium, marking it up 3-4x, and selling it to enterprise clients. That is a talent arbitrage being captured by the wrong intermediary. Build the $30-50 million version at gaming services entry multiples and you get enterprise exit multiples without needing 130 acquisitions to get there.
EU AI Act enforces in August. Four months. After that, web-scraped training data is a compliance liability in every European market. Licensed, provenance-tracked datasets become the only legal option for AI companies operating in Europe. The window where these companies are priced as distressed services plays while sitting on appreciating data assets? That window has a clock on it.
EU AI Act enforcement begins August 2026. Article 10 requires AI training data to have documented provenance and licensing. Web-scraped datasets without clear rights chains become non-compliant. Companies with licensed, human-generated training data — localization firms, dubbing studios, QA providers — hold the only compliant supply.
Same deals. Same talent. Same EBITDA. $50-110 million more in exit value from positioning alone.
