The AI Data Crisis: Why Permissionless Data Protocols Could Lead to AI Innovation.
According to Statista, as of January 2024, English was the lingua franca of the internet with 52.1% of websites using it. Spanish ranked second, with 5.5% of web content, while the content in the German language followed, with 4.8%.
So it makes sense that the vast majority of AI models are trained to output English or other Latin scripts. But what happens when we need to develop specialized models for languages with inconsistencies in their dialects, like Arabic, Hebrew, and the worlds second most spoken language, Chinese? Dialectal variations across different regions adds a layer of complexity which is difficult to qualify for unless the data is properly "treated" and labeled appropriately. Which brings us to the issue at hand.
The Hidden Crisis in AI
The AI industry is facing a data consent crisis. Not just a quantity (or as I like to call it, a "data availability crisis") crisis (though that's coming - some predict we'll "run out" of quality training data by 2026), but a crisis of quality, consent, and completeness. We are building AI models with compromised materials - and no matter how brilliant our architectural plans are, our foundations are becoming unstable.
The Consent Conundrum
Consider the recent lawsuit by The New York Times against OpenAI, or Scarlett Johansson's battle over AI voice cloning. These aren't just legal skirmishes - they're symptoms of a deeper problem. The industry's "ask forgiveness, not permission" approach to training data is backfiring spectacularly. And the same is true for the industry's obsession with synthetic data. As LinkedIn quietly adds AI training options to profile settings and Reddit sells user conversations, we're watching the erosion of trust in real-time.
Beyond Bias: The Quality Question
But consent is just the tip of the iceberg. Most enterprises we talk to about their data needs share what was once a startling confession: much of their training data has been essentially unvalidated, scraped from websites with minimal quality checks. It's like trying to teach a child using textbooks where half the pages are filled with typos and misconceptions. The same is true for for the growing corpus of AI-developed internet infrastructure - which relies on community developed resources like stack exchange - without considering the implications of community sabotage.
The Synthetic Data Trap
While synthetic data has its place, it's becoming a crutch - one which is quickly impacting the quality of language model outputs. Consider the development of a churn prediction model for a gaming company. The use of synthetic data would completely miss the complex network of player relationships that drive actual churn behavior. These "network effects" simply can't be synthesized effectively. And don't take our word for it - a recent study which was published in Nature found that the indiscriminate use of model-generated content in training causes irreversible defects in the resulting models - there is a high-price for free data.
A Path Forward: Permissionless Data Licensing Protocols
This is where permissionless data licensing protocols come into effect. Think of data licensing protocols as the missing infrastructure layer for AI training data. In a permissionless environment like the Émet Fair Data Protocol (the "Émet Protocol"), anyone, even you, can upload, embed, encrypt and license data - and on the other side of equation, buyers of data can search, preview, and purchase the same. These protocols introduce a framework for fair, consensual, and quality-assured data exchange in a scalable environment by leveraging blockchain-based smart contracts to control the execution process, on-chain data storage to ensure data availability. In many cases, the consent conundrum is addressed through the use of composable legal contracts which protect both data providers, and buyers, and connect to the on-chain smart contracts to effectuate payment. As for the quality question, by making it easier for data buyers to purchase or license data, and creating an open market for the same, the cream will naturally rise to the top - with respected data suppliers commanding a higher price over time.
The Economics of Better Data Availability and Permissionless Systems
Permissionless data systems have an added economic benefit to the ecosystem in that they are catalyzing entirely new models of data monetization, from individuals selling their professional expertise and social media engagement patterns to communities collectively licensing their interaction and dialect data. These emerging markets are creating novel revenue streams that simply weren't possible, or profitable, in traditional data brokerage systems. And as these monetization pathways mature, they will begin to drive a virtuous cycle of increased data availability and quality, which in turn will dramatically reducing costs for AI developers. Small teams and startups that once faced prohibitive pricing for training data will be able to access, at the click of a button, diverse, high-quality datasets at fraction of traditional costs. This democratization of data access is particularly powerful for specialized AI applications and domain-specific models, where data has historically been scarce or expensive. One can imagine, a single engineer in Brooklyn developing, using Aramaic, Yiddish, and Hebrew datasets, a whole new compendium of Hasidic texts which cross sectarian divides - or a single developer in Egypt shipping a multi-dialect Arabic LLM.
The net result of a permissionless data licensing protocol is a more vibrant AI ecosystem where innovation isn't limited to well-funded tech giants, but extends to independent developers and niche applications that can now compete on more equal footing.
Let's build!