đ Why Cloudflare canât stop Google from scraping sites to feed its AI Models
Cloudflareâs inability to stop Googleâs scraping for AI, distills SemiAnalysisâs Meta compute-to-talent reinvention, Windsurfâs OpenAI split, jailbroken GROK-4 system prompt.
Read time: 8 min
đ Browse past editions here.
( I publish this newletter daily. Noise-free, actionable, applied-AI developments only).
âĄIn todayâs Edition (12-July-2025):
đ Why Cloudflare canât stop Google from scraping sites to feed its AI Models
đ ď¸ SemiAnalysis published detailed report unpacking Metaâs unprecedented reinvention from Compute to Talent in the pursuit of Superintelligence
đď¸ Byte-Size Briefs:
OpenAIâs Windsurf deal is off â and Windsurfâs CEO is going to Google.
A news super hard Math benchmark FrontierMath Tier 4 is released. And o4-mini (high) is the #1 here with only 6.3% accuracy.
GROK-4 and GROK-4-HEAVY system prompt was jailbroken. It defaults to chain-of-thought and ensemble modules
đ Why Cloudflare canât stop Google from scraping sites to feed its AI Models
Background
Last week, Cloudflare, the firm that guards and routes website traffic, said that it has launched a permission-based tool that lets clients automatically shut out artificial intelligence outfits that want their data, which could shake up how publishers compete in the rush to build A.I.
But, this blocking strategy will not work as expected with Google. Letâs see why.
Cloudflare is whatâs called a content delivery network, or CDN. It helps businesses deliver online content and applications faster by caching the data closer to end-users.
What are AI crawlers?: AI crawlers are automated bots designed to extract large quantities of data from websites, databases and other sources of information to train large language models from the likes of OpenAI and Google.
So now , every new web domain that signs up to Cloudflare will be asked if they want to allow or block AI crawlers. At least 16% of the worldâs internet traffic gets routed through Cloudflare, one of the worldâs largest content delivery networks.
Cloudflare can shut the door on almost every known AI crawler, but not on Googleâs. Google uses the exact same Googlebot process to build its search index and to harvest text for Gemini, its large language model. If Cloudflare blocked those requests at the edge, every site behind Cloudflare would vanish from Google Search, wiping out search-driven readership and the ad revenue that comes with it. Googleâs crawler simply does too many jobs at once, and Cloudflare has no safe way to tell which request fuels classic search and which one feeds Gemini.
đŚ How Cloudflare will block most AI crawlers
Cloudflare fingerprints bots like GPTBot, ClaudeBot, Meta-ExternalAgent, and many others, then lets site owners flip a single switch to keep them out. The system matches user-agent strings, IP blocks, and behavioral patterns, so requests from OpenAI or Anthropic can be dropped without touching legitimate traffic. Since July 2024 more than 1M zones have enabled that feature and traffic from the former top offender Bytespider has fallen by 71.45%.
đľď¸ââď¸ Why Google traffic is a special case
Shared identity. Googlebot does the crawling for ranking pages and for collecting training text. The HTTP headers and IP space look identical, so an edge service cannot split one purpose from the other in real time.
Business dependence. Cloudflareâs customers rely on Google referrals; internal metrics show Google sends roughly 1 referral click for every 14 HTML fetches it performs. Losing those clicks would demolish many publishersâ revenue.
No separate crawler for AI Overviews. Google bundles its new AI answers into regular Search infrastructure and has refused to expose a dedicated bot, despite pressure from Cloudflare CEO Matthew Prince on X and in interviews.
đ Robots.txt and the Google-Extended gap
Googleâs official workaround is a special token in robots.txt called Google-Extended. When a site writes
User-agent: Google-Extended
Disallow: /
Google promises not to use the site for Gemini training, but the crawler still enters as Googlebot, and enforcement is voluntary. Cloudflare can help owners generate that file automatically, yet it still canât enforce the rule once the request reaches the server. In short, robots.txt is a handshake, not a gate.
đ Google traffic is too valuable to risk
Cloudflareâs own telemetry shows AI crawlers like GPTBot send almost 0 traffic back to the pages they scrape, while Google at least sends some readers. By June 2025 the crawl-to-referral ratio for Google sat at 14:1, but OpenAI hit 1,700:1 and Anthropic 73,000:1. Publishers will tolerate Googleâs imbalance because losing every Google visitor would be worse.
đ§ What site owners can still do
Use Cloudflareâs managed robots.txt to insert the Google-Extended rule automatically, so Gemini training is discouraged without hurting SEO.
Add no-snippet or data-nosnippet meta tags if you also want to keep paragraphs out of AI Overviews, though this may reduce visibility in featured snippets.
Monitor the AI Audit dashboard to see whether Google obeys your preferences and to spot any unverified bots that might ignore them.
đŁ Where the fight is headed
Cloudflare CEO Prince says he is âencouragedâ that Google will eventually split out a new crawler or a header flag so Cloudflare can block AI usage without harming search indexing. For now, though, Cloudflareâs hands are tied: the only technical lever that would let it distinguish Googleâs AI scrape from its search crawl does not exist. Until Google creates that separation, Cloudflare has to let the traffic through. That is why, despite its aggressive stance against OpenAI, Anthropic, and others, Cloudflare cannot currently block Google from scraping sites for AI.
đ ď¸ SemiAnalysis published detailed report unpacking Metaâs unprecedented reinvention from Compute to Talent in the pursuit of Superintelligence
Meta is throwing $30B at data, 1GW tents, and lavish salaries to reboot Llama Behemoth. Meta is also quietly building one of the worldâs largest AI training clusters in Ohio. Within their infrastructure organization they are calling this cluster Prometheus.
Prometheus in Ohio draws 1GW today, Hyperion in Louisiana targets 2GW by 2027, both packed with Nvidia racks and powered by on-site gas turbines.
On a technical level, they believe the major contributors to the failed run were as follows:
Chunked attention
Expert choice routing
Pretraining data quality
Scaling strategy and coordination
Chunked attention: Engineers split each long prompt into fixed blocks so the attention matrix stayed small. Tokens inside one block never looked outside, creating âblind spotsâ that erased dependencies across distant sentences. This shortcut kept memory low, yet it wrecked tasks that need global context, so perplexity and downstream scores stalled.
Expert choice routing: The team used a sparse Mixture-of-Experts where specialists pick the tokens they will process. If the router skews traffic, some experts overflow while others idle, leaving parameters under-trained and step time uneven. During inference the model also jumped between many GPU partitions, adding network latency that wiped out the theoretical speed gain.
Pretraining data quality: The crawl contained heavy duplication, near-duplicates, and label leakage. Such contamination teaches the network to memorise rather than reason, hurting generalisation and making evaluation look better than reality. Cleaning late in the pipeline meant bad shards still slipped through, so the model learned fuzzy or conflicting patterns.
Scaling strategy and coordination
Schedules assumed perfect hardware delivery and linear scaling law gains. In practice GPU supply shocks, mismatched software versions, and pipeline back-pressure caused idle clusters and rushed patches. These hiccups amplified the other three issues, locking the run into a lower performance bracket while burning budget.
Meta ditched slow H-shape datacenter builds for prefab tents that spin up in months; no diesel, just smart throttling on scorching days.
The old âH-shapeâ datacenter was a big concrete box that took years to finish. Meta scrapped that blueprint and went with lightweight prefab structures the team casually calls tents. Each tent is shipped in finished modules, dropped on a pad, wired up, and pumping heat from GPUs in a few months, not years.
Because the goal is speed, Meta leaves out the usual diesel backup generators. Power comes straight from nearby substations, plus on-site gas turbines at the larger campuses. When the grid is strained on especially hot days, Metaâs software will slow or pause some training jobs rather than fire up diesel engines.
So this sums up Metaâs shift: trade slower, fortress-style buildings for quick-build tents, ditch bulky diesel backups, and rely on smart throttling to keep the GPUs humming while chasing bigger language models.
đď¸ Byte-Size Briefs
OpenAIâs Windsurf deal is off â and Windsurfâs CEO is going to Google.
Windsurfâs $3B sale to OpenAI collapsed, and Google DeepMind scooped up CEO Varun Mohan, co-founder Douglas Chen, and part of the team. Google takes no equity, just a non-exclusive license to Windsurfâs engine. The 250-person startup keeps its name and keeps selling its tools.
A reverse-acquihire works like a talent magnet, where the lab hires the brains and leases the tech, so regulators see no merger and founders still get paid.
The hot prize is agentic coding. Windsurfâs system turns plain English specs into production-ready code, then tests, refactors, and ships it with minimal human nudges.
DeepMind wants that loop inside Gemini so the model can plan, write, and check code in one flow. Mohanâs crew brings data pipelines, evaluation harnesses, and safety rails already tuned on years of enterprise use.
Windsurf keeps most engineers and its paying customers. It now competes on the open market, but Googleâs license means its own invention may power Gemini features first. The shuffle shows talent portability matters more than ownership in todayâs AI race.
A news super hard Math benchmark FrontierMath Tier 4 is released. And o4-mini (high) is the #1 here with only 6.3% accuracy.
It contains several hundred unpublished, expert-level mathematics problems that takes specialists hours to days to solve. Difficulty Tiers 1-3 cover undergraduate through early graduate level problems, while Tier 4 is research-level mathematics.
GROK-4 and GROK-4-HEAVY system prompt was jailbroken. It defaults to chain-of-thought and ensemble modules
Thatâs a wrap for today, see you all tomorrow.