Market Report: AI Data Licensing Deals (2020

Market Report: AI Data Licensing Deals (2020 - Present)

Mar 23, 2025

Freeman Lewin

A comprehensive summary of significant publicly reported data and content licensing deals as of the above date, with a particular focus on details disclosed in US SEC filings and earnings transcripts. These deals are not just financial transactions but strategic moves that highlight the changing business landscape for enterprise data rights holders

If you're interested in licensing your data to AI models, reach out to us here.

News and Media Content Licensing Deals

Associated Press – OpenAI (July 2023): The Associated Press (AP) agreed to license its entire news story archive (dating back to 1985) to OpenAI (AP News). Financial terms were not disclosed, but AP received access to OpenAI’s technology and product expertise as part of the deal. The license is non-exclusive and focused on using AP text for training OpenAI’s models (not for direct content display). Industry: Digital and print media.
Axel Springer – OpenAI (Dec 2023): Axel Springer (owner of Politico, Bild, Business Insider, etc.) announced a multi-year content licensing deal with OpenAI for its publications’ archives (Axel Springer). In return, OpenAI reportedly gives Axel Springer content a “favorable position” in ChatGPT’s search results. Terms (likely non-exclusive) and payment were not publicly disclosed. Industry: Digital and print media.
Le Monde – OpenAI (Mar 2024): France’s Le Monde signed a multi-year agreement to license its news content to OpenAI (Le Monde). OpenAI can train on Le Monde’s full corpus, and ChatGPT responses referencing Le Monde articles will include the publication’s logo, titles, and hyperlinks to the source. Financial terms weren’t disclosed. Industry: Digital and print media.
Prisa (El País, etc.) – OpenAI (Mar 2024): OpenAI partnered with Spain’s Prisa Media, owner of El País, AS, Cinco Días, and others (OpenAI). This enables ChatGPT to train on and surface Prisa’s Spanish-language news content with attribution. Terms were not public, but OpenAI stated such publisher deals help provide high-quality, real-time information in multiple languages. Industry: Digital and print media.
Financial Times – OpenAI (Apr 29, 2024): OpenAI signed a licensing deal with the Financial Times to access its archives of news and feature articles (Verdict). The FT’s content will be used as training data for ChatGPT, and any FT article used as a source will be linked in ChatGPT’s answers. The deal is multi-year; financial details weren’t disclosed. Industry: Digital and print media.
News Corp (Wall Street Journal, etc.) – OpenAI (May 2024): OpenAI reached a five-year content deal with News Corp (owner of The Wall Street Journal, New York Post, and others) (The Guardian). This exclusive deal allows OpenAI to use News Corp’s journalism in its models, and it was reported to be valued at over $250 million over the 5-year term. Industry: Digital and print media.
The Atlantic – OpenAI (May 2024): The Atlantic entered a strategic multi-year content and product partnership with OpenAI (The Atlantic). OpenAI gains access to The Atlantic’s archives to train its models and feature Atlantic content in ChatGPT, with attribution and links back to the site. In exchange, The Atlantic’s team gets early access to OpenAI’s technology to develop AI-driven products. Terms were not disclosed. Industry: Digital and print media.
Vox Media – OpenAI (May 2024): Vox Media (owner of The Verge, New York Magazine, Vox.com, etc.) signed a content and product partnership with OpenAI (Reuters). OpenAI can train on Vox Media’s archives and integrate that content into ChatGPT, while Vox will leverage OpenAI’s tech to build new AI-powered products for its audiences. No public financial details. Industry: Digital and print media.
Condé Nast – OpenAI (Aug 2024): OpenAI announced a multi-year deal with Condé Nast to license content from its flagship brands (The New Yorker, Vogue, GQ, Wired, etc.) (Reuters). OpenAI’s products (ChatGPT and the prototype “SearchGPT”) will be able to display and use Condé Nast content for training and answering queries. Financial terms were not disclosed. Industry: Digital and print media.
Time Magazine – OpenAI (2024): A month before the Condé Nast deal, it was revealed that Time was also among the publishers OpenAI partnered with in 2023–2024 (Reuters). The deal grants OpenAI access to Time’s content archive for model training. Industry: Digital and print media.
Axios – OpenAI (Jan 2025): Axios entered a three-year licensing and tech partnership with OpenAI (News generative AI deals revealed: Who is suing, who is signing?). OpenAI is providing funding that Axios will use to launch four new local news bureaus, and all Axios journalists get access to OpenAI’s enterprise tools. In return, Axios’s journalism will be used to train OpenAI’s models and will appear in ChatGPT with attribution and links. Financial specifics were not disclosed, but the commitment suggests a multi-million dollar funding. Industry: Digital and print media.
Guardian – OpenAI (2024): The Guardian (UK) has reportedly signed a deal with OpenAI (exact date undisclosed) to license its news content for AI training (Press Gazette). The Guardian’s deal is one of several between publishers and OpenAI who are forging partnerships instead of pursing legal action. No public details on term or payment. Industry: Digital and print media.
Schibsted – OpenAI (2024): Schibsted, a major Norwegian media group (publishes Aftenposten, VG, etc.), also inked a content licensing agreement with OpenAI (Schibsted). Financial details were not publicly announced, but the deal presumably includes a multi-year access to Schibsted’s news archives for model training. Industry: Digital and print media.
Future plc – OpenAI (2024): Future plc (UK-based publisher of tech, gaming, and specialty magazines/websites) signed a deal with OpenAI to license its content (Press Gazette). No specifics released, but likely involves OpenAI using Future’s articles (from sites like TechRadar, PC Gamer, etc.) in training data. Industry: Digital and print media.
Hearst Magazines – OpenAI (2024): Hearst’s magazine division (e.g. Cosmopolitan, Esquire, Popular Mechanics) reportedly partnered with OpenAI to license content (Hearst). Terms undisclosed, but Hearst president indicated that the partnership will enhance both companies offerings, indicating that the arrangement is not purely financial. Industry: Digital and print media.
Dotdash Meredith – OpenAI (late 2024): Digital publisher Dotdash Meredith (owner of People, Better Homes & Gardens, etc.) is also listed among OpenAI’s content partners (Axios). This likely allows OpenAI to train on its lifestyle and news content. Not publicly announced in detail; presumed multi-year, undisclosed payment. Industry: Digital and print media.
Google – Associated Press (Jan 2025): In Google’s first deal with a news publisher for AI, it signed an agreement with AP to provide a real-time news feed for Google’s generative AI (Gemini chatbot) (AP News). AP will supply up-to-date news wire content to help train and improve Gemini’s answers, ensuring timely and accurate information. Google didn’t disclose payment details. The deal builds on a longstanding Google-AP relationship in news licensing. Industry: Digital and print media.
Mistral AI – Agence France-Presse (Jan 2025): Prolific Paris-based startup Mistral AI struck a multi-year content partnership with AFP (Sifted). AFP will provide all of its text news stories (about 2,300 per day in multiple languages) to Mistral’s new chatbot “Le Chat” for training and real-time answers. The goal is to improve the factual accuracy of Mistral’s model with trusted news. Financial terms and contract length were not disclosed . Industry: Digital and print media.
Reuters – Meta (October 2024): Thomson Reuters (Reuters News) confirmed it has licensed news content to train LLMs (Axios). In its May 2024 6-K, Reuters said generative-AI related content licensing drove a 15% revenue increase in early 2024. Separately, Meta revealed a partnership whereby its AI assistant can provide news answers with summaries and links from Reuters content. Details of the Meta deal (likely 2023) were not publicly disclosed, but it is a non-exclusive license for news data. Industry: Digital and print media.
Perplexity AI – Multiple News Publishers (2023): Perplexity, an AI search chatbot, launched a revenue-sharing program with publishers in 2023. Participants include The Independent (UK), Los Angeles Times, Lee Enterprises (US local newspapers), Adweek, World History Encyclopedia, Texas Tribune, Fortune, Time, Der Spiegel and others (Press Gazette). These publishers license their content to Perplexity, which uses it to answer user queries with citations. In return, the publishers will share in advertising revenue when their content is shown, get visibility into how their content is used, and receive other benefits. Most of these deals are non-exclusive and involve no upfront payment but a revenue split model. Industry: Digital and print media.
Prorata.ai – News Publishers (2024): Prorata, an AI attribution platform, secured licensing deals to train on content from publishers like DMG Media (Daily Mail), The Guardian, Sky News, Prospect magazine, Fortune, Axel Springer, Financial Times, and The Atlantic (Press Gazette). These agreements, mostly in 2024, allow Prorata’s AI to use the publishers’ archives with permission. Details on each deal’s value are not public; they appear to be partnership licenses similar to Perplexity’s (possibly with revenue-sharing or subscription fees). Industry: News/media. Data: News articles and magazine content. Industry: Digital and print media

Editor Note: OpenAI has disclosed that by late 2024 it had partnered with at least 20 news organizations globally for licensed training data (Reuters). Its offers to news publishers have been reported at $1–5 million per year for content access (The Verge), though marquee deals like News Corp’s far exceed that. Meanwhile, some major publishers (e.g. NY Times) have chosen litigation over licensing, underscoring the range of responses to AI firms’ approaches (AP News) (Press Gazette).

Social Media and Community Data Deals

Reddit – Google & Others (2023): Reddit, a platform hosting vast user discussions, signed data licensing deals with several companies (including Google and PR firm Cision) to provide API access to Reddit’s historical and ongoing content (CBS News). These deals are reportedly worth about $60 million per year in aggregate for Reddit. Reddit’s IPO filing (Feb 2024) revealed it had secured data licensing contracts valued at $203 million total, with $66.4 million to be recognized as 2024 revenue. Contracts are multi-year and likely non-exclusive, covering Reddit comments and posts for uses like sentiment analysis and training LLMs. Industry: Social media. Data: User-generated text (posts, comments).

In it's October 30th 10-K, Reddit reported that the aggregate amount of its long term contracts, consisting primarily of long term data licensing contracts exceeding one year is $294.8 million—but it's still early days—reddit explained, stating that they are "in the early stages of [its] data licensing efforts.
X (Twitter) – Various (2023): X (formerly Twitter) offers Firehose API access licensing, which provides real-time feeds of all public tweets for AI training and analytics. In 2023, X set the price at $42,000 per month (roughly $0.5M/year) for limited access or about $2.5 million per year for the full firehose (Mashable). Enterprise clients (including AI model developers) can pay for this firehose to train language models on social media data. One known client is the financial firm Bloomberg, which licensed Twitter data to build a finance sentiment model (announced 2023). IBM also reportedly licensed Twitter’s dataset in 2023 to train business AI models rather than rely on scraping (an alternative to OpenAI, which lost free access) (AP News). Industry: Social media. Data: Social network posts (tweets).
Stack Overflow – Google (March 2024): Stack Overflow (SO), a Q&A platform for developers, entered a strategic partnership with Google Cloud to integrate Google’s Gemini AI with SO’s knowledge base (ZDNET). The deal lets Google train its Gemini LLM on 15+ years of SO’s question-and-answer data, enhancing coding solutions in Google’s AI products. In turn, Stack Overflow benefits by embedding Google’s AI into its site and tools for developers. Terms were not disclosed (likely a partnership rather than a direct purchase), and the data use is almost certainly non-exclusive. A similar deal was inked between Stack Overflow and OpenAI, which generated substantial backlash and infuriated contributors, who took pains to corrupt the dataset - vibe coders beware. Industry: Online community (developer). Data: Crowdsourced Q&A (programming knowledge).
Yelp – Perplexity AI (Mar 2024): Yelp licensed its database of business reviews and ratings to Perplexity AI’s search chatbot (Verge). This allows Perplexity to provide answers about local restaurants and services with Yelp-sourced information. It’s unclear if this 2024 deal introduced new terms beyond Yelp’s existing API programs, but Yelps 2023 10-K highlights that Yelp’s “other” revenue grew from $21 M in 2020 to $47 M in 2023, possibly due in part to generative AI licensing (~$25 M jump). Industry: Local reviews platform. Data: User reviews, business info.
LinkedIn – OpenAI (Presumed/Unpublicized): Microsoft-owned LinkedIn did not announce specific licensing to OpenAI, but as part of Microsoft’s partnership with OpenAI, LinkedIn’s public data (profiles, job postings) was made available to train certain enterprise GPT models. Microsoft has rights to LinkedIn data since acquisition, so this is an internal licensing. No external press release; inferred from Microsoft’s AI use cases.
Discord: Social platforms like Discord have not publicly licensed data, but OpenAI’s models were known to be trained on some public Discord forum data. In response, in 2023 Discord updated policies for opting out of data scraping.

Editor Note: The above examples illustrate how social and community data, which was often freely scraped in the past, is increasingly monetized via formal licensing as AI demand grows.

Visual Media and Creative Content Deals

Shutterstock – OpenAI (2021 & July 2023): Shutterstock, a stock image and footage provider, first partnered with OpenAI in 2021 to provide a library of images for training DALL·E (Music Business Worldwide). In July 2023, they expanded this into a six-year deal giving OpenAI broad license to Shutterstock’s images, videos, and music clips for model training. In exchange, Shutterstock integrates OpenAI’s generative AI (like DALL·E) into its platform and launched a Contributor Fund to compensate artists whose works train the AI (Shutterstock). Shutterstock’s CFO stated the initial deals with Big Tech firms ranged from $25–50 million each, and most were later expanded (Reuters). This implies OpenAI (and others, below) paid in that range. Industry: Stock media. Data: Stock images, videos, music.
OpenAI – Shutterstock Contributors (2023): As part of OpenAI’s arrangement with Shutterstock, a Contributor Fund was set up to pay Shutterstock’s artists and photographers whose images are used in AI training. This is notable as a usage-based royalty model. (While not a separate “deal,” it’s a term within the Shutterstock-OpenAI deal that addresses usage rights and compensation.)
Shutterstock – Google, Meta, Amazon, Apple (late 2022 – 2023): In the wake of ChatGPT’s debut, Shutterstock struck licensing agreements with Google, Meta, Amazon, and Apple to supply hundreds of millions of its stock images, videos, and audio for AI training (Reuters). Each of these deals was reportedly $25–50 million initially, with some expanded later (Apple’s deal in particular, previously undisclosed, was revealed by Reuters). These are non-exclusive multi-year licenses for each tech giant’s internal model training. Industry: Stock media. Data: Images, videos, music.
Shutterstock – NVIDIA, LG, etc. (2023): Shutterstock also entered collaborations with NVIDIA, LG, and other firms to develop generative AI tools, likely involving licensing Shutterstock 3D and image data (Reuters). Specific terms weren’t given, but these partnerships indicate Shutterstock leveraging its library across the AI industry. Industry: Stock media / Tech. Data: Images (and possibly 3D models).
Getty Images – Stability AI (lawsuit 2023): Getty Images, another stock image leader, did not initially license its content to Stability AI’s Stable Diffusion; instead, it sued Stability in early 2023 for allegedly scraping 12 million Getty photos without permission (Reuters). However, Getty’s earnings later revealed it has engaged in “data licensing” deals with other partners. In the first half of 2024, Getty reported in its June 30, 2024 10-Q that it had earned $3.2 million (2.6% of revenue) in its “Other” segment primarily from data licensing. Getty’s CEO Craig Peters noted small data licensing deals in Q2 and Q3 2024 with a longstanding partner, aligned with the interest of the business and the interest of its creators (Insider Monkey). (He also said Getty passed on many deals not in its long-term interest). Those partner(s) weren’t named, but one known collaboration is Getty Images – NVIDIA (2023): Getty partnered with NVIDIA to train Generative AI by Getty Images, a tool using Getty’s fully licensed content (Verge). This NVIDIA partnership likely involved a licensing deal (terms undisclosed) where Getty’s library was used to develop a custom generative model offered to Getty’s clients (Generative AI by Getty Images). Industry: Stock media. Data: Professional photos, illustrations, video.
Getty Images – Shutterstock Merger (Jan 2025): As of Jan 2025, Getty and Shutterstock have merged into a single company (Getty Images). This consolidation may affect future data licensing strategies as the combined entity controls a huge portion of commercial imagery.
Freepik/EyeEm – Unnamed Tech Giants (2023): Freepik, which acquired photo platform EyeEm, licensed the majority of its 200 million image archive to two large tech companies in 2023 (Reuters). The pricing was stated at $0.02–$0.04 per image, implying each deal is worth on the order of $4–8 million. At least two deals (roughly $6 M each) were signed, and Freepik’s CEO noted five more similar deals were in the pipeline. The buyers were not identified; presumably companies developing image-generative AI. Licenses are likely non-exclusive bulk data sales. Industry: Stock photos. Data: Photographs.
Photobucket – (2024): Photobucket, an image hosting site, revealed in 2024 that it was in talks with multiple tech companies to license its enormous library of 13 billion photos and videos (Reuters). CEO Ted Leonard indicated proposed rates between $0.05 and $1.00 per photo (and >$1 per video). Even at the low end, a full library deal could exceed $650 million, though partial or subset deals are likely. As of mid-2024 no specific deal was publicly closed. Photobucket’s archive, which is significantly larger than Shutterstock’s, makes it highly appealing for foundational AI model training, but may be hampered by the scale of its annotation requirements. Industry: Photo storage/hosting. Data: User photos and videos.
Adobe – NVIDIA (2023): Adobe, instead of licensing to external models, trained its own Firefly generative AI on its stock content (which artists agreed to license). Per NVIDIA, in 2023 Adobe partnered with NVIDIA to integrate Firefly into NVIDIA’s Picasso cloud AI service (NVIDIA). This partnership could be seen as Adobe licensing its curated dataset (Adobe Stock images) via a collaboration with NVIDIA. Terms weren’t disclosed, as it’s more of a product integration than a raw data sale. Industry: Stock media. Data: Stock images (licensed from contributors).
Symphonic Distribution – Musical AI (Aug 2023): In the music industry, Symphonic Distribution (an independent music distributor) partnered with startup Musical AI to license its entire music catalog for AI training (Music Ally). Through this opt-in program, Symphonic’s artists and labels can agree to have their songs used to train generative AI music models in exchange for compensation. This deal marked one of the first licensed music datasets for AI, aiming to be a “properly-licensed” alternative to unauthorized AI music scraping. Musical AI (which raised seed funding in 2023) will train models on opted-in tracks to create AI-generated music. Financial details weren’t public; likely a revenue share or per-track fee to rights holders. Industry: Music. Data: Audio recordings (music).
Universal Music Group – Google (Possibly in 2023): UMG and Google confirmed in mid-2023 they were exploring licensing agreements for artists’ voices and music to train AI that can generate “deepfake” songs legally (Billboard). As of 2024, no finalized deal has been announced, but this could involve major payments to artists/labels for use of their vocals in AI. It highlights the trend toward licensing in the music sector to preempt copyright issues.

Editor Note: While content deals have been aplenty, there continues to be a knowledge vacuum when it comes to price transparency and stability. It is expected that higher quality image datasets will outperform hyper-growth datasets, as foundational models consolidate and the market turns to fine-tuning opensource models.

Academic and Book Publisher Deals

Taylor & Francis (Informa) – Microsoft (July 2023): Academic publisher Taylor & Francis (part of Informa PLC) revealed it made a deal to sell access to its research publications for AI training. One confirmed partnership was with Microsoft in 2023, reportedly worth $10 million (The Bookseller). Informa’s mid-2023 financial report later indicated T&F expected to earn £58 M (~$75 M) in 2023 from licensing content to AI firms (Inside Higher Ed). This suggests multiple deals: the $10M Microsoft deal plus at least one other major AI customer (undisclosed) and additional pipeline deals. These agreements likely grant non-exclusive access to large swaths of academic journals and textbooks for training LLMs, with provisions to protect authors’ rights (no verbatim reproduction in outputs, etc.). Industry: Academic publishing. Data: Scholarly journals and books.

Wiley – Multiple AI Companies (2023–2024): John Wiley & Sons confirmed it has entered at least two agreements to license its published content for AI training (The Bookseller). By early 2024, Wiley anticipated $44 million in revenue from AI licensing deals (Inside Higher Ed). The company has emphasized author compensation and copyright protection in these arrangements. (Partners weren’t named; they could include big tech or AI startups. Wiley likely licensed large datasets of scientific journals, textbooks, and databases.) Industry: Academic/professional publishing. Data: Research articles, books.
Oxford University Press – AI Partnerships (2024): OUP acknowledged in mid-2024 that it is “actively working with companies developing LLMs” to license content for “responsible development” of AI (The Bookseller). While no specific deal was named, OUP’s efforts are toward agreements similar to its peers (Inside Higher Ed). (This might involve Oxford’s vast catalog of academic journals and perhaps the Oxford English Dictionary data). Industry: Academic publishing. Data: Journals, reference works.
Cambridge University Press – Opt-in Initiative (2024): CUP had not publicly confirmed a signed deal by 2024, but it announced an “opt-in” system for authors regarding AI uses (The Bookseller). This indicates CUP is preparing or negotiating licensing arrangements and wants to secure author permissions. Industry: Academic publishing. Data: Academic journals, books.
Elsevier – (Exploratory): Elsevier (RELX Group) was rumored to be in talks with AI firms in 2023, given its valuable scientific content, though specific deals haven’t been disclosed. (RELX’s LexisNexis division, however, is known to license legal text data for AI—e.g. to startups training legal AI.)
HarperCollins – “Large Tech Company” (2023): HarperCollins, a major book publisher, made a deal with an unnamed AI technology company in 2024 to license a selection of its books for training AI models (404Media). Authors were given the choice to opt in their works to this program or decline. Leaked details from agents indicated HarperCollins would pay authors $2,500 per book title included in the training dataset. While the AI partner was not officially named; industry insiders speculated it could be OpenAI or maybe Google. The contract length and usage terms were not public, but presumably the AI firm gets a non-exclusive license to the text of those books for model training, and authors receive a one-time fee. Industry: Book publishing. Data: Books (full text).

Editor Note: Smaller academic publishers (e.g. Taylor & Francis’ parent Informa also licenses corporate report content; Springer Nature and IEEE have likely explored AI deals as well). By 2025, the trend is clear: publishers can profit handsomely by monetizing their archives for AI training, and many have quietly struck deals, sometimes to the indignation of authors amidst ethical concerns.

Other Enterprise Data Licensing Deals

Thomson Reuters – OpenAI/Microsoft (2023): Reuters’ parent company struck deals to license its news and possibly financial data for generative AI training (Reuters). In May 2024, Thomson Reuters reported a 15% YoY increase in revenue primarily due to “generative AI-related content licensing” (May 2024 6K). This likely refers to agreements supplying news feeds (and perhaps legal or financial info via Westlaw/Refinitiv) to AI developers. While clients weren’t named, OpenAI and Microsoft are logical partners, given Reuters content now appears in Bing Chat and other AI systems. Industry: News/Financial data. Data: Newswire text (and potentially financial/legal datasets).
Bloomberg – GPT Developer (2022): Bloomberg LP did not license data externally; instead it used its proprietary financial news and data to train Bloomberg GPT, a 50-billion parameter finance-specific LLM (launched 2023). This “deal” was internal (Bloomberg is both data owner and model builder), so no external contract, but it underscores the value of exclusive data. Bloomberg’s approach contrasts with others who sold data rights.
IBM – Multiple AI Firms (2023–2024): IBM has vast enterprise datasets (e.g. The Weather Company’s weather data, IBM’s code and customer support logs, etc.). In its Q2 2024 earnings call, IBM disclosed multi-year licensing agreements to supply enterprise data to several AI companies, which significantly boosted its AI-related revenue (2Q 2024 Earnings). Details are scant due to confidentiality, but these could include licensing proprietary datasets (weather, industry research, or software knowledge bases) to model developers. For instance, IBM’s CEO mentioned IBM would pursue deals to provide data in domains like customer service and IT automation to AI startups. Industry: IT/Analytics. Data: Various (weather, technical, business data).
C3.ai – Various (2024): C3.ai, an enterprise AI software provider, reported new data licensing deals in its FY2024 Q3 earnings. These deals contributed to an 18% YoY revenue growth. C3.ai partners with industries like energy, defense, and manufacturing; one known partner is Baker Hughes (oil industry data). The licensed data likely includes sensor data, supply chain info, and other corporate datasets that C3 can use to train domain-specific AI models. Industry: Enterprise AI. Data: Industrial IoT and enterprise datasets.
NVIDIA – NeMo Toolkit Data (2023): NVIDIA, while known for hardware, also curates datasets for its NeMo™ large language model services. In late 2023, NVIDIA reported a boost in data center revenue partly due to licensing agreements for its “AI-ready” training data via the NVIDIA NeMo Retriever product. This indicates NVIDIA packaged certain valuable datasets (perhaps code, or synthetic data, or scraped and cleaned text corpora) and licensed them to enterprise customers building models, contributing to record data center sales. (These could be bundled deals of NVIDIA hardware + data.) Notably, subsequent NVIDIA reports did not break out this data-licensing contribution and seems less inclined over the past recenty months to make NeMo into a consumer product - instead focusing on its core GPU business. Industry: AI infrastructure. Data: Curated text/image datasets for training.
Planet Labs – Various (2020–2024): Satellite imagery company Planet Labs generates daily Earth images. Its business model is data licensing via subscriptions and API access. By 2024, Planet emphasized (through its 2024 10-Q) delivering “AI-ready” satellite datasets, leveraging its archive for machine learning applications. Clients include agriculture, government, and climate AI projects. While not a single deal, Planet’s revenue (over $180M in 2024) largely comes from multi-year data licenses to companies (some likely training geospatial AI on this data). Industry: Satellite imagery. Data: Earth observation images (vision data).
SoundHound AI – Semiconductor Company (2023): Voice AI firm SoundHound disclosed a one-time voice data licensing deal with a U.S. semiconductor company in 2023, yielding $3.6 million in revenue. per its 23'-24' annual report. The deal provided SoundHound’s extensive voice/audio dataset (speech recordings, voice commands, etc.) to the chipmaker, likely to improve on-chip AI models for speech recognition. It was non-recurring and characterized as a licensing of voice data for AI, completed in FY2023. (The unnamed chip company could be Qualcomm, NVIDIA, AMD, or Intel, all of whom develop AI speech capabilities.) Industry: AI speech / Semiconductors. Data: Voice recordings and conversational audio.
Tempus – Multiple (2024): Tempus (NASDAQ: TEM) is a precision medicine tech company with a large library of clinical and molecular data. In its August 2024 8-K it reported 40% YoY growth in data licensing revenue, reaching $3.7 M in the first half of 2024. This implies deals where Tempus licenses anonymized patient data, genomic sequences, or clinical records to pharmaceutical or AI companies for model training (e.g. AI drug discovery or medical LLMs). Tempus’s data licensing is a growing revenue stream as part of its overall 32% revenue growth. Industry: Healthcare/Biotech. Data: Genomic and clinical data.
Automattic (WordPress/Tumblr): Automattic, which runs WordPress.com and Tumblr, hosts billions of user-generated blog posts. In 2023, it was reported to be considering data partnerships for AI. Matt Mullenweg (CEO) indicated they might allow AI firms to train on WordPress content under certain licenses (WordPress’s default is open unless blocked by robots.txt). While no formal deals were public, this is a space to watch as blogging platforms monetize their content archives for AI.
Syndigate – AI Companies (2023): Syndigate, a content syndication service, aggregates news from the Middle East and globally. By late 2023, it began licensing its multilingual news feeds to AI model developers (no public announcements, but listed by industry analysts). Likely clients are AI firms needing non-English news data.
Others: Numerous other data licensing arrangements were reported: e.g. Wolters Kluwer in legal/tax content; Relativity in legal documents; Experian in consumer credit data for AI credit models; Twitter (pre-Musk) – Academic institutions (like licensing the “Decahose” tweet stream for research); Meta – Common Crawl (though Common Crawl is free, Meta paid some data brokers for specialty datasets); Figure-OpenAI (though this agreement was later terminated).

Editor Note: Each of the above deals reflects the nascent market for high-quality, proprietary data to feed AI systems. As AI applications (like large language models, generative image and audio models, etc.) proliferate, we see content-rich companies (news agencies, publishers, platform providers) negotiating compensation and usage terms for their data. Deals range from direct content fees (lump-sum or annual payments) to partnerships with revenue sharing or technical collaboration. Contract lengths are typically multi-year (3–6 years is common) and can be exclusive or non-exclusive. Usage restrictions usually aim to ensure attribution in AI outputs and prevent the AI from simply republishing content verbatim without credit. The values vary widely – from low millions per year for magazine/newspaper archives up to hundreds of millions for major media conglomerates or large image datasets. You might not know it yet, but you've got a valuable dataset that's just waiting to be monetized!

The Émet Protocol solves a critical challenge in solving the world's data availability problem by enabling the trustless exchange of data for payment between buyers and sellers of all types. So far, virtually all of the data licensing deals that have occurred between data rights holders and GenAI models have occurred on a 1:1 basis, making it difficult for small model developers to compete and supplement their foundational models with high-quality data - we're aiming to change that.

November 7, 2024

The Evolution of Data Infrastructure: From Lakes to Marketplaces

November 7, 2024

The Evolution of Data Infrastructure: From Lakes to Marketplaces

November 7, 2024

The Evolution of Data Infrastructure: From Lakes to Marketplaces

November 12, 2024

The AI Data Crisis: How Permissionless Data Protocols Could Lead to AI Innovation.

November 12, 2024

The AI Data Crisis: How Permissionless Data Protocols Could Lead to AI Innovation.

November 12, 2024

Mar 23, 2025

Mar 23, 2025

Freeman Lewin

Freeman Lewin

The Evolution of Data Infrastructure: From Lakes to Marketplaces

The Evolution of Data Infrastructure: From Lakes to Marketplaces

The Evolution of Data Infrastructure: From Lakes to Marketplaces

The AI Data Crisis: How Permissionless Data Protocols Could Lead to AI Innovation.

The AI Data Crisis: How Permissionless Data Protocols Could Lead to AI Innovation.

The AI Data Crisis: How Permissionless Data Protocols Could Lead to AI Innovation.

LinkedIn

Twitter / X

About Us

Contact

Projects

Terms of Use

Privacy Policy

LinkedIn

Twitter / X

About Us

Contact

Projects

Terms of Use

Privacy Policy

LinkedIn

Twitter / X

About Us

Contact

Projects

Terms of Use

Privacy Policy

Contact Us