Limited Time: Code VIP50 = 50% off forever on all plans
A developer-focused guide to the technical foundations of Generative Engine Optimization. Schema markup, crawlability, rendering, structured data, and the infrastructure that powers AI citability.
Most GEO advice is written for marketers. It covers messaging, positioning, and content strategy. That's important, but it skips the technical layer that determines whether AI systems can even access and understand your content in the first place. If your site blocks AI crawlers, renders content only via client-side JavaScript, or lacks structured data, no amount of brilliant content strategy will get you cited.
This guide is for developers and technical teams. It covers the infrastructure, markup, rendering, and technical decisions that directly affect whether AI search engines can find, access, parse, and cite your content. Think of it as the engineering foundation that everything else is built on.
To optimize for AI search, you first need to understand how these systems actually retrieve content. There are three primary access patterns.
AI companies operate their own crawlers. OpenAI uses GPTBot (user agent: GPTBot), Anthropic uses ClaudeBot (ClaudeBot), Google uses Google-Extended (Google-Extended) for AI training and Googlebot for AI Overviews, and Perplexity uses PerplexityBot (PerplexityBot). These crawlers fetch your pages, extract content, and use it for training data and/or real-time retrieval.
Each crawler has different capabilities. Some execute JavaScript, others don't. Some respect robots.txt directives, others may not check specific crawl-delay rules. Understanding which crawlers matter for your audience determines where you focus technical optimization.
ChatGPT retrieves information through Bing's search index. Gemini uses Google's index. This means your traditional search ranking directly affects your AI visibility. If your pages don't rank in Bing, ChatGPT's browsing mode won't find them. If they don't rank in Google, Gemini and AI Overviews won't surface them. Traditional SEO technical foundations (crawlability, indexation, page speed) remain critical because they feed the retrieval layer of AI systems.
AI systems also consume structured data, particularly JSON-LD schema markup, knowledge graphs, and API endpoints. This is the most underutilized access pattern. While most AI systems can extract information from unstructured HTML, structured data gives them machine-readable facts with explicit relationships and types. It's the difference between an AI inferring that something is a product price versus knowing it definitively.
Your robots.txt file is the first gate AI crawlers encounter. Getting it wrong means complete invisibility to that platform.
For maximum AI visibility, allow all major AI crawlers:
# Allow AI crawlers
User-agent: GPTBot
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Amazonbot
Allow: /
# Block specific paths you don't want crawled
User-agent: *
Disallow: /admin/
Disallow: /api/
Disallow: /internal/
Common mistakes to check:
Disallow: / for User-agent: * blocks all AI crawlers unless you've added specific allow rules above it.GPTBot means ChatGPT's browsing mode can't access your content for real-time retrieval.Google-Extended prevents Google from using your content for AI features but does not affect regular search indexing (that's controlled by Googlebot).If you want to allow retrieval but block training, some platforms respect this distinction. Google-Extended specifically controls AI training usage while Googlebot controls search indexing and AI Overviews. However, most other AI crawlers don't offer this granularity. Blocking GPTBot blocks both training and retrieval for ChatGPT. The decision to block or allow is binary for most platforms.
How your application renders content is one of the most impactful technical decisions for AI visibility. The core issue: many AI crawlers do not execute JavaScript, or execute it with limited capability.
If your content is rendered entirely via client-side JavaScript (typical of single-page applications built with React, Vue, or Angular without SSR), AI crawlers may see an empty or near-empty page. Googlebot executes JavaScript reasonably well, but GPTBot, ClaudeBot, and PerplexityBot have varying and often limited JavaScript execution capabilities. Relying on CSR for important content is the single biggest technical risk for AI visibility.
SSR delivers fully rendered HTML in the initial response. Every crawler, regardless of JavaScript capability, receives complete content. If you're using Next.js, Nuxt, SvelteKit, or similar frameworks, ensure your content pages use SSR or SSG rather than client-only rendering.
For Next.js specifically:
getServerSideProps or Server Components (App Router) for dynamic content.getStaticProps / generateStaticParams for content that can be pre-rendered at build time.SSG pre-renders pages at build time, producing static HTML files. This is the most reliable option for AI crawlability: the content exists as plain HTML before any crawler arrives. For content-heavy sites (blogs, guides, documentation), SSG is ideal. The tradeoff is that content updates require a rebuild, but incremental static regeneration (ISR) in frameworks like Next.js mitigates this.
To verify what AI crawlers see:
curl -A "GPTBot" https://yoursite.com/page to see the raw HTML response.fetch or wget that don't execute JavaScript.Schema markup gives AI systems structured, machine-readable information about your content. While AI can extract information from unstructured HTML, schema markup removes ambiguity and makes extraction reliable. It's the difference between a crawler inferring context and knowing it explicitly.
Organization schema establishes your brand entity. It tells AI systems who you are, where you're located, and how to find you across platforms.
{
"@context": "https://schema.org",
"@type": "Organization",
"name": "Your Brand",
"url": "https://yourbrand.com",
"logo": "https://yourbrand.com/logo.png",
"sameAs": [
"https://linkedin.com/company/yourbrand",
"https://twitter.com/yourbrand",
"https://github.com/yourbrand"
],
"description": "One-sentence description of what your company does.",
"foundingDate": "2023"
}
The sameAs property is particularly important for AI systems because it links your entity across platforms, helping AI build a coherent understanding of your brand.
Article schema for blog posts and content pages tells AI what the content is about, who wrote it, and when it was published and updated.
{
"@context": "https://schema.org",
"@type": "Article",
"headline": "Your Article Title",
"author": {
"@type": "Person",
"name": "Author Name",
"url": "https://yourbrand.com/team/author"
},
"datePublished": "2026-03-15",
"dateModified": "2026-04-10",
"publisher": {
"@type": "Organization",
"name": "Your Brand"
}
}
The dateModified field is important for AI retrieval systems that factor in content freshness.
FAQ schema provides explicit question-answer pairs that AI can extract directly.
{
"@context": "https://schema.org",
"@type": "FAQPage",
"mainEntity": [
{
"@type": "Question",
"name": "What is your product?",
"acceptedAnswer": {
"@type": "Answer",
"text": "Clear, concise answer here."
}
}
]
}
Product schema for product and pricing pages gives AI structured information about what you sell, how much it costs, and how it's rated.
HowTo schema for tutorial and guide content provides step-by-step instructions in a structured format AI can follow and cite.
<script type="application/ld+json"> tag in the page head or body.@id references to link related entities.https://search.google.com/test/rich-results) and Schema.org's validator. Invalid schema is worse than no schema because it sends confusing signals.AI crawlers, like search engine crawlers, have limited time and resources for each site. Page performance directly affects how much of your content gets crawled and indexed.
While AI crawlers don't measure Core Web Vitals the way Google does for ranking, performance still matters:
async or defer for non-critical scripts.lastmod dates. AI crawlers and the search engines they rely on use sitemaps to discover and prioritize content. Include only canonical, indexable URLs.rel="canonical" to indicate the authoritative version of each page.An emerging practice is providing content in machine-readable formats beyond standard web pages.
The llms.txt convention (similar to robots.txt) is an emerging standard for providing AI systems with a structured overview of your site's content. While not yet universally adopted by AI platforms, it's a forward-looking practice. The file typically goes at your domain root and contains a markdown-formatted description of your site's key content areas, products, and documentation.
Some AI systems can consume API endpoints directly. Providing a clean, well-documented public API for your product information, documentation, or content catalog makes it easier for AI systems to access accurate, up-to-date information. This is particularly relevant for SaaS products where pricing, features, and documentation change frequently.
Several common security configurations inadvertently block AI crawlers.
Technical GEO requires ongoing monitoring. Things break: deployments change rendering behavior, security rules get updated, new crawlers emerge.
Monitor your server access logs for AI crawler activity. Look for:
Build AI crawlability checks into your CI/CD pipeline:
robots.txt hasn't been modified to block AI crawlers.Use BabyPenguin to track whether your technical improvements translate into actual AI visibility. Technical accessibility is a prerequisite, not a guarantee. Monitoring citations across ChatGPT, Gemini, Perplexity, and other platforms tells you whether the full pipeline, from crawling to citation, is working.
Run this checklist quarterly or after any significant technical change:
Get answers to the most common questions about Generative Engine Optimization.