Do AI crawlers execute JavaScript?

It varies by crawler. Googlebot executes JavaScript relatively well. GPTBot, ClaudeBot, and PerplexityBot have limited or no JavaScript execution capability. The safest approach is to ensure all important content is available in the initial HTML response via server-side rendering or static generation. If you rely on client-side JavaScript to render content, a significant portion of AI crawlers may see an empty page.

What schema markup should I implement first?

Start with Organization schema on your homepage and Article schema on your blog posts and guides. These two cover the most fundamental entity signals (who you are and what you publish). Then add FAQ schema to pages with FAQ sections, and Product schema to product and pricing pages. Prioritize pages that target queries where you want AI visibility.

How do I check if AI crawlers can access my site?

Run 'curl -A GPTBot https://yoursite.com/important-page' from a terminal. If you get a full HTML response with your content visible, the crawler can access it. If you get a 403 Forbidden, a CAPTCHA challenge page, or an empty response, something is blocking access. Also check your server logs for requests from AI crawler user agents and note the HTTP status codes returned.

Should I implement llms.txt?

It's a forward-looking practice but not yet critical. The llms.txt convention is still emerging and not universally supported by AI platforms. If you have the bandwidth, implementing it is low effort and signals to AI systems that you welcome AI access. It's similar to where robots.txt was in the early days of search: likely to become standard but not yet required for visibility.

Does page speed really matter for AI crawlers?

Yes, but not in the way Core Web Vitals matter for Google rankings. AI crawlers have limited time per site. Slower pages mean fewer pages crawled in total, and very slow pages may time out before content is fully loaded. More importantly, page speed affects your traditional search rankings, which directly feeds AI retrieval systems. A fast site benefits both channels.

How do I prevent my WAF from blocking AI crawlers?

Most WAFs allow you to create rules based on user agent strings. Whitelist the known AI crawler user agents: GPTBot, ClaudeBot, PerplexityBot, Google-Extended, and Amazonbot. For Cloudflare specifically, check the Security > Bots section for AI crawler settings. Monitor your access logs for 403 or 429 responses to AI crawler user agents to catch blocking you didn't intend.

My site is a single-page application. What should I do?

Add server-side rendering for all content-heavy pages. If you're using React, migrate to Next.js or add a pre-rendering service. If you're using Vue, use Nuxt. The goal is that a curl request to any content page returns complete HTML without JavaScript execution. You don't need to SSR your entire application, but every page you want AI to cite must deliver content in the initial HTML response.

How do I know if my technical GEO changes are working?

Monitor three things: server logs showing AI crawler activity with 200 status codes on your content pages, search rankings for your content pages (since search rank feeds AI retrieval), and actual AI citations of your content across platforms. The first two you can check with standard tools. For citation tracking across ChatGPT, Gemini, Perplexity, and other platforms, tools like BabyPenguin provide automated monitoring that shows whether your technical improvements are translating into actual AI visibility.

Technical GEO for Developers: The Engineering Guide to AI Search Visibility

Most GEO advice is written for marketers. It covers messaging, positioning, and content strategy. That's important, but it skips the technical layer that determines whether AI systems can even access and understand your content in the first place. If your site blocks AI crawlers, renders content only via client-side JavaScript, or lacks structured data, no amount of brilliant content strategy will get you cited.

This guide is for developers and technical teams. It covers the infrastructure, markup, rendering, and technical decisions that directly affect whether AI search engines can find, access, parse, and cite your content. Think of it as the engineering foundation that everything else is built on.

How AI Search Engines Access Your Content

To optimize for AI search, you first need to understand how these systems actually retrieve content. There are three primary access patterns.

Direct Crawling

AI companies operate their own crawlers. OpenAI uses GPTBot (user agent: GPTBot), Anthropic uses ClaudeBot (ClaudeBot), Google uses Google-Extended (Google-Extended) for AI training and Googlebot for AI Overviews, and Perplexity uses PerplexityBot (PerplexityBot). These crawlers fetch your pages, extract content, and use it for training data and/or real-time retrieval.

Each crawler has different capabilities. Some execute JavaScript, others don't. Some respect robots.txt directives, others may not check specific crawl-delay rules. Understanding which crawlers matter for your audience determines where you focus technical optimization.

Search Engine Retrieval

ChatGPT retrieves information through Bing's search index. Gemini uses Google's index. This means your traditional search ranking directly affects your AI visibility. If your pages don't rank in Bing, ChatGPT's browsing mode won't find them. If they don't rank in Google, Gemini and AI Overviews won't surface them. Traditional SEO technical foundations (crawlability, indexation, page speed) remain critical because they feed the retrieval layer of AI systems.

Structured Data Extraction

AI systems also consume structured data, particularly JSON-LD schema markup, knowledge graphs, and API endpoints. This is the most underutilized access pattern. While most AI systems can extract information from unstructured HTML, structured data gives them machine-readable facts with explicit relationships and types. It's the difference between an AI inferring that something is a product price versus knowing it definitively.

Robots.txt Configuration

Your robots.txt file is the first gate AI crawlers encounter. Getting it wrong means complete invisibility to that platform.

Recommended Configuration

For maximum AI visibility, allow all major AI crawlers:

# Allow AI crawlers
User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Amazonbot
Allow: /

# Block specific paths you don't want crawled
User-agent: *
Disallow: /admin/
Disallow: /api/
Disallow: /internal/

Common mistakes to check:

A blanket Disallow: / for User-agent: * blocks all AI crawlers unless you've added specific allow rules above it.
Blocking GPTBot means ChatGPT's browsing mode can't access your content for real-time retrieval.
Blocking Google-Extended prevents Google from using your content for AI features but does not affect regular search indexing (that's controlled by Googlebot).
Some CDN or security configurations add blanket bot blocking that inadvertently catches AI crawlers. Audit your edge rules.

Selective Blocking

If you want to allow retrieval but block training, some platforms respect this distinction. Google-Extended specifically controls AI training usage while Googlebot controls search indexing and AI Overviews. However, most other AI crawlers don't offer this granularity. Blocking GPTBot blocks both training and retrieval for ChatGPT. The decision to block or allow is binary for most platforms.

Rendering: SSR, SSG, and Why It Matters

How your application renders content is one of the most impactful technical decisions for AI visibility. The core issue: many AI crawlers do not execute JavaScript, or execute it with limited capability.

Client-Side Rendering (CSR)

If your content is rendered entirely via client-side JavaScript (typical of single-page applications built with React, Vue, or Angular without SSR), AI crawlers may see an empty or near-empty page. Googlebot executes JavaScript reasonably well, but GPTBot, ClaudeBot, and PerplexityBot have varying and often limited JavaScript execution capabilities. Relying on CSR for important content is the single biggest technical risk for AI visibility.

Server-Side Rendering (SSR)

SSR delivers fully rendered HTML in the initial response. Every crawler, regardless of JavaScript capability, receives complete content. If you're using Next.js, Nuxt, SvelteKit, or similar frameworks, ensure your content pages use SSR or SSG rather than client-only rendering.

For Next.js specifically:

Use getServerSideProps or Server Components (App Router) for dynamic content.
Use getStaticProps / generateStaticParams for content that can be pre-rendered at build time.
Verify by checking the page source (not the rendered DOM) in your browser. If the content appears in the raw HTML, SSR is working.

Static Site Generation (SSG)

SSG pre-renders pages at build time, producing static HTML files. This is the most reliable option for AI crawlability: the content exists as plain HTML before any crawler arrives. For content-heavy sites (blogs, guides, documentation), SSG is ideal. The tradeoff is that content updates require a rebuild, but incremental static regeneration (ISR) in frameworks like Next.js mitigates this.

Testing Rendering

To verify what AI crawlers see:

Use curl -A "GPTBot" https://yoursite.com/page to see the raw HTML response.
Check Google Search Console's URL Inspection tool for Googlebot's rendered view.
Use Chrome DevTools with JavaScript disabled to approximate what non-JS crawlers see.
Test with tools like fetch or wget that don't execute JavaScript.

Schema Markup for AI Systems

Schema markup gives AI systems structured, machine-readable information about your content. While AI can extract information from unstructured HTML, schema markup removes ambiguity and makes extraction reliable. It's the difference between a crawler inferring context and knowing it explicitly.

Essential Schema Types

Organization schema establishes your brand entity. It tells AI systems who you are, where you're located, and how to find you across platforms.

{
  "@context": "https://schema.org",
  "@type": "Organization",
  "name": "Your Brand",
  "url": "https://yourbrand.com",
  "logo": "https://yourbrand.com/logo.png",
  "sameAs": [
    "https://linkedin.com/company/yourbrand",
    "https://twitter.com/yourbrand",
    "https://github.com/yourbrand"
  ],
  "description": "One-sentence description of what your company does.",
  "foundingDate": "2023"
}

The sameAs property is particularly important for AI systems because it links your entity across platforms, helping AI build a coherent understanding of your brand.

Article schema for blog posts and content pages tells AI what the content is about, who wrote it, and when it was published and updated.

{
  "@context": "https://schema.org",
  "@type": "Article",
  "headline": "Your Article Title",
  "author": {
    "@type": "Person",
    "name": "Author Name",
    "url": "https://yourbrand.com/team/author"
  },
  "datePublished": "2026-03-15",
  "dateModified": "2026-04-10",
  "publisher": {
    "@type": "Organization",
    "name": "Your Brand"
  }
}

The dateModified field is important for AI retrieval systems that factor in content freshness.

FAQ schema provides explicit question-answer pairs that AI can extract directly.

{
  "@context": "https://schema.org",
  "@type": "FAQPage",
  "mainEntity": [
    {
      "@type": "Question",
      "name": "What is your product?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "Clear, concise answer here."
      }
    }
  ]
}

Product schema for product and pricing pages gives AI structured information about what you sell, how much it costs, and how it's rated.

HowTo schema for tutorial and guide content provides step-by-step instructions in a structured format AI can follow and cite.

Implementation Best Practices

Use JSON-LD format. It's Google's recommended format and the most widely supported by AI systems. Embed it in a <script type="application/ld+json"> tag in the page head or body.
One entity per type per page. Don't put multiple Organization schemas on the same page. Use @id references to link related entities.
Validate with Google's Rich Results Test (https://search.google.com/test/rich-results) and Schema.org's validator. Invalid schema is worse than no schema because it sends confusing signals.
Keep schema consistent with visible content. If your schema says the article was published in 2026 but the visible date says 2024, that's a trust signal problem. Schema should reflect what's on the page.

Performance and Crawl Budget

AI crawlers, like search engine crawlers, have limited time and resources for each site. Page performance directly affects how much of your content gets crawled and indexed.

Core Web Vitals

While AI crawlers don't measure Core Web Vitals the way Google does for ranking, performance still matters:

Time to First Byte (TTFB): Slow TTFB means crawlers wait longer for each page, reducing the total number of pages they can crawl in a session. Aim for under 200ms.
Total page weight: Bloated pages with large images, unused JavaScript, and excessive third-party scripts slow down content extraction. Optimize images, lazy-load non-critical resources, and minimize bundle sizes.
Render-blocking resources: CSS and JavaScript that block rendering delay content availability for crawlers that do execute JavaScript. Use async or defer for non-critical scripts.

Crawl Efficiency

XML sitemap: Maintain an up-to-date sitemap with lastmod dates. AI crawlers and the search engines they rely on use sitemaps to discover and prioritize content. Include only canonical, indexable URLs.
Internal linking: Clear internal link structures help crawlers discover content efficiently. Orphaned pages (no internal links pointing to them) may not be crawled at all.
Canonical tags: Duplicate content confuses AI systems. Use rel="canonical" to indicate the authoritative version of each page.
HTTP status codes: Ensure important pages return 200. Redirect chains (301 to 301 to 200) waste crawl budget. Broken links (404s) from external sources mean lost citation opportunities.

API and Machine-Readable Content

An emerging practice is providing content in machine-readable formats beyond standard web pages.

llms.txt

The llms.txt convention (similar to robots.txt) is an emerging standard for providing AI systems with a structured overview of your site's content. While not yet universally adopted by AI platforms, it's a forward-looking practice. The file typically goes at your domain root and contains a markdown-formatted description of your site's key content areas, products, and documentation.

Content APIs

Some AI systems can consume API endpoints directly. Providing a clean, well-documented public API for your product information, documentation, or content catalog makes it easier for AI systems to access accurate, up-to-date information. This is particularly relevant for SaaS products where pricing, features, and documentation change frequently.

Security and Access Control

Several common security configurations inadvertently block AI crawlers.

WAF rules: Web Application Firewalls sometimes block bot traffic aggressively. If your WAF blocks requests from cloud IP ranges or non-browser user agents, AI crawlers may be blocked. Whitelist known AI crawler user agents and IP ranges.
Rate limiting: Aggressive rate limiting can throttle or block crawlers. AI crawlers typically crawl at a reasonable rate, but if your rate limits are very tight, they may get blocked. Monitor your access logs for AI crawler requests that return 429 status codes.
Cloudflare and CDN settings: Bot management features in CDNs like Cloudflare may classify AI crawlers as unwanted bots. Check your bot management settings to ensure AI crawlers are allowed. Cloudflare specifically has an "AI crawlers" toggle in its security settings.
Login walls and paywalls: Content behind authentication is invisible to AI crawlers. If you paywall content, AI won't cite it. Consider making at least summaries or abstracts publicly accessible.

Monitoring and Debugging

Technical GEO requires ongoing monitoring. Things break: deployments change rendering behavior, security rules get updated, new crawlers emerge.

Log Analysis

Monitor your server access logs for AI crawler activity. Look for:

Requests from GPTBot, ClaudeBot, PerplexityBot, and Google-Extended user agents.
HTTP status codes returned to these crawlers (200 = success, 403 = blocked, 429 = rate limited, 5xx = server error).
Which pages AI crawlers are actually accessing. If they're only hitting your homepage and not your content pages, you have a discoverability problem.
Response sizes. If AI crawlers receive significantly smaller responses than regular users, you likely have a rendering or content access issue.

Automated Testing

Build AI crawlability checks into your CI/CD pipeline:

Test that key pages return full content without JavaScript execution.
Validate schema markup on every deploy.
Check that robots.txt hasn't been modified to block AI crawlers.
Monitor page load times for content pages.
Verify that new pages are included in the sitemap.

Cross-Platform Verification

Use BabyPenguin to track whether your technical improvements translate into actual AI visibility. Technical accessibility is a prerequisite, not a guarantee. Monitoring citations across ChatGPT, Gemini, Perplexity, and other platforms tells you whether the full pipeline, from crawling to citation, is working.

Technical GEO Audit Checklist

Run this checklist quarterly or after any significant technical change:

Robots.txt: Verify GPTBot, ClaudeBot, PerplexityBot, and Google-Extended are not blocked.
Rendering: Confirm key content pages deliver full content in the initial HTML response (test with curl or wget).
Schema markup: Validate Organization, Article, FAQ, and Product schema on relevant pages.
Sitemap: Ensure sitemap is current, includes all important URLs, and has accurate lastmod dates.
Performance: Check TTFB under 200ms and total page weight under 2MB for content pages.
Security: Verify WAF, rate limiting, and CDN bot management settings allow AI crawlers.
Canonical tags: Confirm every page has a correct canonical tag.
Internal links: Check for orphaned content pages with no internal links.
HTTPS: Ensure all content is served over HTTPS. Mixed content or HTTP-only pages may be deprioritized.
Log review: Confirm AI crawlers are actively accessing your content pages with 200 status codes.