Cloudflare is fighting back against AI web scrapers

(Image credit: Getty Images)

published 5 July 2024

Cloudflare has announced a new tool to help internet users block AI web scrapers and crawlers, as firms flood the net with bots to glean content to train their models.

The feature, described as an ‘easy button’, will allow users to block AI bots and web crawlers with a single click, and is available for all Cloudflare customers, including those on its free tier.

In a blog post launching the feature, Cloudflare said the popularity of generative AI has caused a sharp increase in demand for content to train models, and it wants to “help preserve a safe Internet for content creators”.

Last year, Cloudflare announced users would have the ability to manage AI crawlers that “behave well” with new bot categories. These are bots that follow robots.txt file, don’t use unlicensed content to train their models, or run inference for retrieval of augmented generative (RAG) systems using web data.

Cloudflare found the vast majority (85%) of its customers preferred to block AI crawlers when browsing the web, and now they’ve added a way for users to do this.

To enable the feature, navigate to the security > bots section of the Cloudflare dashboard and click the toggle labeled AI scrapers and crawlers.

Cloudflare said it will update the tool over time as new fingerprints of misbehaving bots that it sees scraping the web for model training

To guarantee it stays on top of AI crawler activity on the web, Cloudflare surveyed the traffic across its network to gauge which bots are the worst offenders.

Cloudflare found the top four AI crawlers by activity were ByteDance’s Bytespider, the Amazonbot, Anthropic’s Claudebot, and OpenAI’s GPTBot, noting Bytespider not only leads in terms of number of requests but also in both the extent of its crawling and the frequency with which it is blocked.

AI bots accessed two-fifths of the top one million internet properties

In the blog post, Cloudflare noted recent news of some of the major hyperscalers trying to get their hands on as much internet data as possible to gain a competitive edge in a booming market.

Google, for example, signed an AI content licensing agreement with Reddit to get access to user-generated content, reportedly worth around $60 million per year.

OpenAI got into hot water after it was accused of using Scarlett Johansson’s voice in its new GPT-4o multimodal model.

As companies struggle to collect more and more data, the internet will likely continue to see a flood of AI bots moving forward.

RELATED WHITEPAPER

Should you adopt a responsible approach to AI?

In June, AI bots accessed around 39% of the top one million internet properties using Cloudflare, but notably only 2.98% of these domains took action to block or challenge those requests.

Cloudflare said it has observed website operators completely blocking access to AI crawlers using robots.txt, but the blocks rely on the bot operator adhering to the Robots Exclusion Protocol, which they often don’t.

Unfortunately, the firm noted it has observed bot operators trying to appear as though they are a real browser by using spoofed user agents, but stated its machine learning model has been able to catch this activity so far.

Bots will be assigned a score to reflect that it has been correctly identified as a ‘likely bot’, which Cloudflare said it would continually update leveraging its global signals.

Enterprise Bot Management customers can flag suspicious activity by submitting a False Negative Feedback Loop report, Cloudflare have also set up a reporting tool where any customer can report an AI bot that’s scraping their site without

Solomon Klappholz is a former staff writer for ITPro and ChannelPro. He has experience writing about the technologies that facilitate industrial manufacturing, which led to him developing a particular interest in cybersecurity, IT regulation, industrial infrastructure applications, and machine learning.

Get the ITPro daily newsletter

AI bots accessed two-fifths of the top one million internet properties

RELATED WHITEPAPER