UPDATED 16:52 EDT / JULY 03 2024

AI

Cloudflare rolls out feature for blocking AI companies’ web scrapers

Cloudflare Inc. today debuted a new no-code feature for preventing artificial intelligence developers from scraping website content.

The capability is available as part of the company’s flagship CDN, or content delivery network. The platform is used by a sizable percentage of the world’s websites to speed up page loading times for users. According to Cloudflare, the new scraping prevention feature is available in both the free and paid tiers of its CDN.

Many AI companies use content from the public web to train their large language models. OpenAI, Google LLC and several other market players enable website operators to opt out of scraping. However, not all LLM developers provide such an option, which is the issue that Cloudflare hopes to address with its scraping prevention tool.

The feature uses AI to detect automated content extraction attempts. According to Cloudflare, its software can spot bots that scrape content for LLM training projects even when they attempt to avoid detection. 

“Sadly, we’ve observed bot operators attempt to appear as though they are a real browser by using a spoofed user agent,” Cloudflare engineers wrote in a blog post today. “We’ve monitored this activity over time, and we’re proud to say that our global machine learning model has always recognized this activity as a bot.”

One of the crawlers that Cloudflare managed to detect is a bot that collects content for Perplexity AI Inc., a well-funded search engine startup. Last month, Wired reported that the manner in which the bot scrapes websites makes its requests appear as regular user traffic. As a result, website operators have struggled to block Perplexity AI from using their content.

Cloudflare assigns every website visit that its platform processes a score of 1 to 99. The lower the number, the greater the likelihood that the request was generated by a bot. According to the company, requests made by the bot that collects content for Perplexity AI consistently receive a score under 30.

“When bad actors attempt to crawl websites at scale, they generally use tools and frameworks that we are able to fingerprint,” Cloudflare’s engineers detailed. “For every fingerprint we see, we use Cloudflare’s network, which sees over 57 million requests per second on average, to understand how much we should trust this fingerprint.”

Cloudflare will update the feature over time to address changes in AI scraping bots’ technical fingerprints and the emergence of new crawlers. As part of the initiative, the company is rolling out a tool that will enable website operators to report any new bots they may encounter. 

Image: Cloudflare

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU