In the rapidly evolving landscape of artificial intelligence (AI), the race has taken a new turn, transforming into a sprint for data acquisition. Cutting-edge AI models, capable of achieving remarkable feats like excelling in the U.S. bar exam and generating human-like text, are pushing the boundaries of their capabilities. To enhance their prowess, these AI systems are increasingly reliant on diverse and sophisticated datasets, encompassing images and scientific papers. However, accessing such data proves to be a challenge due to their limited availability and higher associated costs.The efficacy of AI software is intricately linked to the quality of the datasets used for training. While social media posts are readily available on the internet, they often carry biases or prejudices, and images are frequently of low resolution. Microsoft's encounter with biased outputs from an AI model trained on Twitter posts serves as a cautionary tale. Consequently, AI companies are now seeking more reliable sources, turning their attention to scientific papers and books crafted by seasoned authors, despite the inherent difficulty in locating such resources.
According to data categorization by Epoch researchers, an estimated 17 trillion high-quality words are freely accessible on the internet, in stark contrast to the vast but lower-quality pool of up to 71 quadrillion words. Alarming projections suggest that, if AI models continue consuming information at the current pace, they could exhaust superior data sources by 2026.In response to the looming scarcity, developers are exploring the use of AI to generate bespoke data for specific models. Numerous projects are already leveraging synthetic content, often obtained from data-generating services like Mostly AI. For instance, American Express utilizes such data to detect unusual fraud patterns, while Alphabet's Waymo employs fabricated scenarios to train its self-driving software. Gartner anticipates a significant shift, forecasting that 60% of AI data will be synthetic by 2024, a substantial leap from the 1% recorded in 2021.
Nevertheless, the hunger for real-world information persists among AI models, particularly in the vast repositories held by major publishers and offline sources. This presents a potential boon for companies like RELX, the owner of The Lancet and LexisNexis legal database. Shares in RELX have surged over 30% in the past year, reflecting the increasing demand for their AI software. Similarly, News Corp, publisher of the Wall Street Journal and the Times, is in negotiations for content deals with AI developers, anticipating a lucrative revenue stream.Despite the promises of enriched datasets, striking deals with large publishers and offline repositories will impose additional costs on AI companies. Already allocating approximately 15% of their revenue to data sorting and cleaning, as estimated by venture capital firm Andreessen Horowitz, AI enterprises face further financial strain. Royalty payments to content creators, compounded by escalating computing power and cloud storage costs, contribute to the thinning profit margins. In the face of legal challenges from entities like Warner Music Group and Getty Images, AI companies, including industry giants like OpenAI, are compelled to reconcile and pay the price for their voracious data consumption. The data dash, it seems, will not be without its costs.