2026-04-30 / 5 min read | AISEOWeb

Your Website Is Probably Invisible to AI and You Dont Even Know It

I built a site, optimised it for AI crawlers, then discovered Cloudflare was silently blocking every single one. Here is what I found and how to fix it.

So I just went through the process of building and deploying a new website from scratch. I did everything right. Structured data, sitemap, RSS feed, meta tags, OG images, the lot. I even created llms.txt and llms-full.txt files specifically for AI crawlers so that language models could understand my site content without having to parse HTML.

Then I ran an audit and discovered that none of it mattered. Because Cloudflare was silently blocking every single AI crawler from accessing my site.

What was actually happening

When I checked my robots.txt I expected to see the simple file I'd written. Three lines. Allow everything, here's the sitemap. Instead I got a wall of Cloudflare-managed rules that I never asked for:


User-agent: ClaudeBot
Disallow: /

User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /



Every major AI crawler was being blocked. ClaudeBot, GPTBot, Google-Extended, CCBot, Amazonbot, Bytespider, even Applebot-Extended. Cloudflare had injected its own managed robots.txt on top of mine and the default setting blocks all AI bots.

The thing is I had already gone into the Cloudflare dashboard and disabled the AI bot protection. But there was a separate setting called is_robots_txt_managed that was still set to true. That's the one that injects the managed robots.txt. Disabling AI bot protection and disabling the managed robots.txt are two different things. And if you only do one your site is still invisible to AI.



Why this matters more than you think

We're at a point where AI models are becoming a significant source of traffic and discovery. When someone asks Claude or ChatGPT or Perplexity about a topic and your content is relevant, you want to show up. If AI crawlers can't access your site then your content doesn't exist in their training data or their retrieval systems. You're invisible.

And it's not just about AI chatbots. Google uses Google-Extended for AI features in search. If that's blocked you're potentially missing out on AI-generated search summaries that could drive traffic to your site.

The full list of what you actually need

Based on what I learned setting this up here's everything that matters for AI SEO right now:

robots.txt needs to explicitly allow all crawlers. Sounds obvious but check what's actually being served, not what's in your file. Cloudflare, Vercel and other CDNs might be injecting their own rules.

llms.txt is a standard that's emerging for AI crawlers. It's a plain text file at the root of your site that gives a concise summary of what your site is about, who you are and what content is available. Think of it like a README for AI. I also serve an llms-full.txt that includes the full text of every blog post so AI models can ingest the content directly without parsing HTML.



Structured data using JSON-LD tells both search engines and AI systems exactly what your content is. I use Person schema for my profile, WebSite schema for the site itself and BlogPosting schema on every blog post with the headline, date, author and keywords. This is how AI systems understand the relationships between your content.

Sitemap at /sitemap.xml with all your pages, their priorities and change frequencies. Both traditional search engines and AI crawlers use this.

RSS feed at /feed.xml is still relevant. AI aggregation services and research tools pull from RSS feeds. If you don't have one you're missing a distribution channel.

Meta tags need to be comprehensive. Not just a title and description but OG tags with images, Twitter card markup, canonical URLs and max-snippet:-1 so search engines can show as much of your content as they want in snippets.

Security headers matter too. X-Content-Type-Options, X-Frame-Options, Referrer-Policy. These don't directly affect SEO but they affect trust signals and some crawlers look at them.



The Cloudflare fix specifically

If you're on Cloudflare here's exactly what you need to do. Go to the API or dashboard and set:

ai_bots_protection to disabled

content_bots_protection to disabled

crawler_protection to disabled

is_robots_txt_managed to false


That last one is the killer. You can disable all the bot protection you want but if the managed robots.txt is still active it will override your own file and block everything anyway.

Via the API it looks like this:


curl -X PUT "https://api.cloudflare.com/client/v4/zones/{zone_id}/bot_management" \
  -H "Authorization: Bearer {token}" \
  -d '{"ai_bots_protection":"disabled","is_robots_txt_managed":false}'

After I made this change my robots.txt went from a wall of disallow rules to the three lines I actually wrote. And suddenly my content was accessible to every AI system out there.

Server-side rendering matters too

One thing I caught during my audit was that my blog content wasn't appearing in the HTML source. It was only rendering after JavaScript loaded on the client side. For a human visitor that's fine because the page loads and the content appears. But for a crawler that doesn't execute JavaScript, which is most of them, the page looks empty.

I fixed this by making sure all content is server-side rendered. The animations still trigger on mount but the actual text, the blog posts, the about page content, it's all in the initial HTML response. If you're using a framework like SvelteKit or Next.js make sure your content isn't gated behind client-side rendering.

The difference this makes

Before fixing these settings my site was effectively invisible to AI. After fixing them every AI crawler can access every page, read the structured data, pull from the RSS feed and ingest the full content through llms.txt.

The content I'm writing now will be discoverable by AI systems. When someone asks about NixOS and AI agents, about using Claude Code for infrastructure, about GPL firmware requests, my posts have a chance of being referenced. Before the fix they had zero chance.

If you've got a website and you haven't checked what your robots.txt actually serves versus what you think it serves, do it now. You might be surprised.

← ALL POSTS GET IN TOUCH →