🌐 Why Cloudflare can’t stop Google from scraping sites to feed its AI Models

Cloudflare’s inability to stop Google’s scraping for AI, distills SemiAnalysis’s Meta compute-to-talent reinvention, Windsurf’s OpenAI split, jailbroken GROK-4 system prompt.

Jul 12, 2025

Read time: 8 min

📚 Browse past editions here.

( I publish this newletter daily. Noise-free, actionable, applied-AI developments only).

⚡In today’s Edition (12-July-2025):

🌐 Why Cloudflare can’t stop Google from scraping sites to feed its AI Models
🛠️ SemiAnalysis published detailed report unpacking Meta’s unprecedented reinvention from Compute to Talent in the pursuit of Superintelligence

🗞️ Byte-Size Briefs:

OpenAI’s Windsurf deal is off — and Windsurf’s CEO is going to Google.
A news super hard Math benchmark FrontierMath Tier 4 is released. And o4-mini (high) is the #1 here with only 6.3% accuracy.
GROK-4 and GROK-4-HEAVY system prompt was jailbroken. It defaults to chain-of-thought and ensemble modules

Connect with me on X (Twitter)

🌐 Why Cloudflare can’t stop Google from scraping sites to feed its AI Models

Background

Last week, Cloudflare, the firm that guards and routes website traffic, said that it has launched a permission-based tool that lets clients automatically shut out artificial intelligence outfits that want their data, which could shake up how publishers compete in the rush to build A.I.

But, this blocking strategy will not work as expected with Google. Let’s see why.

Cloudflare is what’s called a content delivery network, or CDN. It helps businesses deliver online content and applications faster by caching the data closer to end-users.

What are AI crawlers?: AI crawlers are automated bots designed to extract large quantities of data from websites, databases and other sources of information to train large language models from the likes of OpenAI and Google.

So now , every new web domain that signs up to Cloudflare will be asked if they want to allow or block AI crawlers. At least 16% of the world’s internet traffic gets routed through Cloudflare, one of the world’s largest content delivery networks.

Cloudflare can shut the door on almost every known AI crawler, but not on Google’s. Google uses the exact same Googlebot process to build its search index and to harvest text for Gemini, its large language model. If Cloudflare blocked those requests at the edge, every site behind Cloudflare would vanish from Google Search, wiping out search-driven readership and the ad revenue that comes with it. Google’s crawler simply does too many jobs at once, and Cloudflare has no safe way to tell which request fuels classic search and which one feeds Gemini.

🚦 How Cloudflare will block most AI crawlers

Cloudflare fingerprints bots like GPTBot, ClaudeBot, Meta-ExternalAgent, and many others, then lets site owners flip a single switch to keep them out. The system matches user-agent strings, IP blocks, and behavioral patterns, so requests from OpenAI or Anthropic can be dropped without touching legitimate traffic. Since July 2024 more than 1M zones have enabled that feature and traffic from the former top offender Bytespider has fallen by 71.45%.

🕵️‍♂️ Why Google traffic is a special case

Shared identity. Googlebot does the crawling for ranking pages and for collecting training text. The HTTP headers and IP space look identical, so an edge service cannot split one purpose from the other in real time.
Business dependence. Cloudflare’s customers rely on Google referrals; internal metrics show Google sends roughly 1 referral click for every 14 HTML fetches it performs. Losing those clicks would demolish many publishers’ revenue.
No separate crawler for AI Overviews. Google bundles its new AI answers into regular Search infrastructure and has refused to expose a dedicated bot, despite pressure from Cloudflare CEO Matthew Prince on X and in interviews.

📄 Robots.txt and the Google-Extended gap

Google’s official workaround is a special token in robots.txt called Google-Extended. When a site writes

User-agent: Google-Extended  
Disallow: /

Google promises not to use the site for Gemini training, but the crawler still enters as Googlebot, and enforcement is voluntary. Cloudflare can help owners generate that file automatically, yet it still can’t enforce the rule once the request reaches the server. In short, robots.txt is a handshake, not a gate.

📈 Google traffic is too valuable to risk

Cloudflare’s own telemetry shows AI crawlers like GPTBot send almost 0 traffic back to the pages they scrape, while Google at least sends some readers. By June 2025 the crawl-to-referral ratio for Google sat at 14:1, but OpenAI hit 1,700:1 and Anthropic 73,000:1. Publishers will tolerate Google’s imbalance because losing every Google visitor would be worse.

🔧 What site owners can still do

Use Cloudflare’s managed robots.txt to insert the Google-Extended rule automatically, so Gemini training is discouraged without hurting SEO.
Add no-snippet or data-nosnippet meta tags if you also want to keep paragraphs out of AI Overviews, though this may reduce visibility in featured snippets.
Monitor the AI Audit dashboard to see whether Google obeys your preferences and to spot any unverified bots that might ignore them.

📣 Where the fight is headed

Cloudflare CEO Prince says he is “encouraged” that Google will eventually split out a new crawler or a header flag so Cloudflare can block AI usage without harming search indexing. For now, though, Cloudflare’s hands are tied: the only technical lever that would let it distinguish Google’s AI scrape from its search crawl does not exist. Until Google creates that separation, Cloudflare has to let the traffic through. That is why, despite its aggressive stance against OpenAI, Anthropic, and others, Cloudflare cannot currently block Google from scraping sites for AI.

Connect with me on X (Twitter)

🛠️ SemiAnalysis published detailed report unpacking Meta’s unprecedented reinvention from Compute to Talent in the pursuit of Superintelligence

Meta is throwing $30B at data, 1GW tents, and lavish salaries to reboot Llama Behemoth. Meta is also quietly building one of the world’s largest AI training clusters in Ohio. Within their infrastructure organization they are calling this cluster Prometheus.

Prometheus in Ohio draws 1GW today, Hyperion in Louisiana targets 2GW by 2027, both packed with Nvidia racks and powered by on-site gas turbines.

On a technical level, they believe the major contributors to the failed run were as follows:

Chunked attention
Expert choice routing
Pretraining data quality
Scaling strategy and coordination

Chunked attention: Engineers split each long prompt into fixed blocks so the attention matrix stayed small. Tokens inside one block never looked outside, creating “blind spots” that erased dependencies across distant sentences. This shortcut kept memory low, yet it wrecked tasks that need global context, so perplexity and downstream scores stalled.

Expert choice routing: The team used a sparse Mixture-of-Experts where specialists pick the tokens they will process. If the router skews traffic, some experts overflow while others idle, leaving parameters under-trained and step time uneven. During inference the model also jumped between many GPU partitions, adding network latency that wiped out the theoretical speed gain.

Pretraining data quality: The crawl contained heavy duplication, near-duplicates, and label leakage. Such contamination teaches the network to memorise rather than reason, hurting generalisation and making evaluation look better than reality. Cleaning late in the pipeline meant bad shards still slipped through, so the model learned fuzzy or conflicting patterns.

Scaling strategy and coordination

Schedules assumed perfect hardware delivery and linear scaling law gains. In practice GPU supply shocks, mismatched software versions, and pipeline back-pressure caused idle clusters and rushed patches. These hiccups amplified the other three issues, locking the run into a lower performance bracket while burning budget.

Meta ditched slow H-shape datacenter builds for prefab tents that spin up in months; no diesel, just smart throttling on scorching days.

The old “H-shape” datacenter was a big concrete box that took years to finish. Meta scrapped that blueprint and went with lightweight prefab structures the team casually calls tents. Each tent is shipped in finished modules, dropped on a pad, wired up, and pumping heat from GPUs in a few months, not years.

Because the goal is speed, Meta leaves out the usual diesel backup generators. Power comes straight from nearby substations, plus on-site gas turbines at the larger campuses. When the grid is strained on especially hot days, Meta’s software will slow or pause some training jobs rather than fire up diesel engines.

So this sums up Meta’s shift: trade slower, fortress-style buildings for quick-build tents, ditch bulky diesel backups, and rely on smart throttling to keep the GPUs humming while chasing bigger language models.

🗞️ Byte-Size Briefs

OpenAI’s Windsurf deal is off — and Windsurf’s CEO is going to Google.
Windsurf’s $3B sale to OpenAI collapsed, and Google DeepMind scooped up CEO Varun Mohan, co-founder Douglas Chen, and part of the team. Google takes no equity, just a non-exclusive license to Windsurf’s engine. The 250-person startup keeps its name and keeps selling its tools.
A reverse-acquihire works like a talent magnet, where the lab hires the brains and leases the tech, so regulators see no merger and founders still get paid.
The hot prize is agentic coding. Windsurf’s system turns plain English specs into production-ready code, then tests, refactors, and ships it with minimal human nudges.
DeepMind wants that loop inside Gemini so the model can plan, write, and check code in one flow. Mohan’s crew brings data pipelines, evaluation harnesses, and safety rails already tuned on years of enterprise use.
Windsurf keeps most engineers and its paying customers. It now competes on the open market, but Google’s license means its own invention may power Gemini features first. The shuffle shows talent portability matters more than ownership in today’s AI race.
A news super hard Math benchmark FrontierMath Tier 4 is released. And o4-mini (high) is the #1 here with only 6.3% accuracy.
It contains several hundred unpublished, expert-level mathematics problems that takes specialists hours to days to solve. Difficulty Tiers 1-3 cover undergraduate through early graduate level problems, while Tier 4 is research-level mathematics.
GROK-4 and GROK-4-HEAVY system prompt was jailbroken. It defaults to chain-of-thought and ensemble modules

That’s a wrap for today, see you all tomorrow.

Connect with me on X (Twitter)

Rohan's Bytes

Discussion about this post