AI

Reddit Implements New Measures to Protect Against AI Data Scraping

25 June 2024

|

Paikan Begzad

Summary

Reddit has announced significant updates to its Robots Exclusion Protocol (robots.txt file) to prevent unauthorized data scraping by AI crawlers. Traditionally, the robots.txt file guides web bots on whether they can crawl a site, mainly to facilitate search engine indexing. However, the rise of AI has led to websites being scraped extensively to train AI models, often without proper attribution.

In addition to updating the robots.txt file, Reddit will continue to implement rate-limiting and blocking measures against unknown bots and crawlers that do not comply with its Public Content Policy or lack an agreement with the platform. According to Reddit, these changes are not expected to impact most users or legitimate entities like researchers and organizations such as the Internet Archive. Instead, they aim to discourage AI companies from using Reddit content to train their large language models without authorization.

This move follows a recent investigation by Wired, which revealed that AI-powered search startup Perplexity has been scraping content despite being blocked in the robots.txt file. Perplexity's CEO responded, stating that the robots.txt file is not legally binding. Reddit's new policies signal to AI companies that they must negotiate agreements to access Reddit’s data. For example, Reddit has a $60 million deal with Google, allowing the tech giant to train its AI models using Reddit content.

Reddit emphasized that all entities accessing its content must comply with its policies designed to protect user data. These changes follow Reddit's recent policy updates, which provide guidelines on how commercial entities can access and use Reddit’s data.