Data drying up for AI firms with increasing use of ant-crawl measures: Report

The data used to train AI is becoming limited as most of the important web sources used to train AI models have restricted the use of their data, according to a study published this week by the Data Provenance Initiative, an MIT-led research group.

The study looked at 14,000 web domains that were commonly used for training AI models and found an "emerging crisis in consent."

The researchers found that in the three data sets called C4, RefinedWeb, and Dolma 5 per cent of all data, and 25 per cent of data from the highest-quality sources, have been restricted. The Robots Exclusion Protocol has been used for restriction.

"We're seeing a rapid decline in consent to use data across the web that will have ramifications not just for AI companies, but for researchers, academics, and noncommercial entities," Shayne Longpre, the study's lead author, told New York Times.

Data is crucial for AI systems, they are fed on images, text, and videos for training. Generative AI tools like ChatGPT, Google's Gemini, and Anthropic's Claude learn from data to write, code, and create images and videos. The more high-quality data is fed into these models, the better their outputs.

For years, AI developers collected data easily, but the boom in the industry in recent years has led to tension among owners of the data. Some publishers have created paywalls and have changed their terms of service to limit the use of their data for AI generative purposes, while some have blocked the automated web crawlers used by companies such as OpenAI, Anthropic, and Google.

Reddit and Stack Overflow have started charging AI companies for data access. A few publishers have also taken legal action against AI companies.

Of late, OpenAI, Google, and Meta have gone to extreme lengths to gain data, including transcribing YouTube videos and bending their own data policies. Some companies have struck deals with publishers including The Associated Press and News Corp, the owner of The Wall Street Journal, to gain access to their data.

All the same, one researcher said the companies have all the data they need and the current fencing of data was akin to bolting the barn door when the horse has left.

Stella Biderman, the executive director of EleutherAI, a nonprofit AI research organisation, echoed those fears.

"Major tech companies already have all of the data," she said. "Changing the license on the data doesn't retroactively revoke that permission, and the primary impact is on later-arriving actors, who are typically either smaller start-ups or researchers," she told NYT.

Image Source: Unsplash

Voltaire

Data drying up for AI firms with increasing use of ant-crawl measures: Report

Related Posts

Comments

Humanoid robots compete in marathon in Beijing

Microsoft’s 2016 AI paper tops 21st-century citation rankings

Wikimedia launches dataset on Kaggle to dissuade AI scraping, ease server load

Maharashtra Health Officials and Experts Convene to Discuss Responsible Alcohol Consumption

Delhi Medical Association and Legal Services Authorities Organize Awareness Session on Medico-Legal Challenges

Zuckerberg considered wiping friends list to revive Facebook, buying Snapchat

Ex-OpenAI staffers back Musk's claim in latest filing

Trump overturns IRS DeFi tax rule, marking first pro-crypto victory in Congress

Musk wants to become 'AGI dictator,' tried to sabotage investor interest with 'fake' takeover bid: OpenAI claims in countersuit

Apple airlifts nearly 15 lakh iPhones from India to US to escape tariff

UK govt to launch Minority Report-style project that predict crimes

Meta launches Llama 4 models; super voracious, candid as Grok

Wikimedia claims it's groaning under traffic from bots scraping for AI

ChatGPT subscriptions soar in India, monetisation still fraction of the US'

OpenAI accused of training LLMs on copyrighted O'Reilly books