Limited Time: Code VIP50 = 50% off forever on all plans

Does Self-Promotional Content Get Cited by AI? A Data Study

April 12, 202610 min read

Does Self-Promotional Content Get Cited by AI? A Data Study

There's a persistent belief in content marketing that the best way to get AI to talk about your brand is to publish more brand-owned content. More product pages, more features comparisons, more company blog posts that weave your product name into every third paragraph. The data says this strategy is almost entirely wrong. When BabyPenguin analyzed citation patterns across AI engines, the same finding surfaced repeatedly: content that reads like an advertisement is systematically deprioritized, and the brands that appear most often in AI-generated answers are the ones that taught AI something useful, not the ones that sold the hardest.

This isn't intuition. It's a measurable, reproducible pattern across multiple independent datasets. And understanding it is probably the highest-leverage insight in generative engine optimization right now, because most brands are still doing the opposite of what the data recommends.

What the Citation Data Actually Shows

Research by Omniscient Digital, analyzing more than 23,000 citations across major AI platforms, found that product pages earn only 12% of branded query citations, even when the query is specifically about that brand. Meanwhile, reviews, listicles, and forum discussions collectively account for 57% of citations on those same queries. Think about what that means: when someone asks ChatGPT or Gemini about your product, the answer is more than four times as likely to draw from a third-party review site or a Reddit thread than from your own product page.

BabyPenguin's tracking data shows consistent patterns across categories. Owned product pages and feature pages appear as AI citations at a much lower rate than informational blog posts, data studies, knowledge base articles, and how-to guides, even when the owned pages are highly optimized for traditional SEO. The promotional signal seems to be recognized and discounted by AI retrieval systems almost immediately.

The Princeton GEO paper provides a theoretical grounding for why this happens. AI language models are trained on enormous corpora of web text, and that corpus contains a strong pattern: promotional content tends to be less accurate, less specific, and less useful than informational content. Models learn to weight sources by their informational density. A page that says "Our project management tool is the best in class with industry-leading features" contains almost zero informational signal. A page that explains how a specific project management approach reduces context-switching for distributed teams contains substantial informational signal. The latter gets cited; the former doesn't.

Why AI Engines Distrust Self-Promotional Pages

It's worth being precise about what "self-promotional" means in this context, because many content teams genuinely believe their content is informational when AI retrieval systems are classifying it as promotional. Several signals push a page toward the promotional category in AI evaluation:

  • First-person brand voice throughout: Content that says "we," "our product," and "our solution" repeatedly signals to AI that the source has a commercial stake in the claims it makes.
  • Feature lists without context: A page that lists product capabilities without explaining the problem each capability solves, or without providing any data on outcomes, reads as a sales sheet rather than a knowledge resource.
  • High product mention density: When your brand or product name appears in every paragraph, the informational-to-promotional ratio drops sharply. AI models pick up on this ratio.
  • Absence of external references: Informational content cites other sources, links to research, and acknowledges the broader landscape. Promotional content is self-contained and self-referential.
  • CTA-heavy structure: Pages that direct readers toward demos, trials, and purchases repeatedly throughout the text are structurally promotional, even if some of the surrounding content is informational.

Research from Previsible's study of 5,000 prompts found that AI models consistently prefer sources that demonstrate expertise without an obvious commercial agenda. The study found that when content quality was controlled for, third-party sources were cited at significantly higher rates than first-party sources making equivalent claims. The credibility discount applied to owned content is real and measurable.

The Third-Party Advantage: Why G2, Capterra, and Reddit Win

If you track where your brand appears in AI-generated answers using a tool like BabyPenguin, you'll almost certainly find that mentions coming from third-party review sites, comparison pages, and forum discussions vastly outnumber mentions coming from your own domain. This isn't a failure of your SEO. It's how AI citation logic works.

G2, Capterra, and TrustRadius have three properties that AI engines value enormously. First, they aggregate opinions from many different users, which makes individual promotional bias statistically unlikely. Second, they provide structured comparative data: feature tables, rating breakdowns, user segments, highly informational and easy for AI to extract. Third, they have established trust signals across millions of training data points. When AI models learned from the web, they saw these platforms cited repeatedly as reliable comparative sources. That pattern is now baked in.

Reddit and Quora provide a different kind of trust signal: authentic peer discourse. A thread where someone asks "which CRM is actually worth it for a 10-person team" and receives 47 replies from real users contains enormous informational value for an AI trying to synthesize a recommendation. The lack of a commercial agenda is precisely what makes it valuable. AI systems have learned that forum discussions, whatever their other limitations, are rarely written to sell something.

This is why your review site presence isn't a nice-to-have in a GEO strategy, it's a primary citation driver. See how to audit how AI models talk about your brand for a framework for assessing where your current mentions are coming from and which third-party sources are missing your brand entirely.

What Brand-Owned Content Does Get Cited

This is where the data gets more nuanced, and where the practical opportunity lies. Not all brand-owned content is penalized equally. BabyPenguin's citation data reveals a clear hierarchy within owned content:

  1. Original data studies and research reports are the most-cited category of brand-owned content by a significant margin. When a company publishes a genuine analysis of their dataset, even if the dataset comes from their product, the informational value is high enough to overcome the promotional discount. The key word is "genuine": cherry-picked data assembled to support a predetermined conclusion is recognized as such and performs poorly.
  2. Comprehensive how-to guides and technical documentation perform well because their informational density is very high. A 3,000-word guide to implementing a specific workflow has almost no promotional signal and enormous informational signal. AI engines cite this type of content frequently.
  3. Knowledge base and help center articles are among the most-cited brand-owned pages, despite being largely invisible in traditional marketing metrics. A help article explaining how to set up a specific integration or troubleshoot a specific error has a purely informational structure. It contains no promotional language. It answers a precise question. These characteristics make it highly citable.
  4. Informational blog posts that happen to be published by a brand can perform well if they genuinely educate rather than sell. The differentiator is whether the post would be useful to someone who has no intention of buying the product. If the answer is yes, the post has a reasonable chance of being cited. If the answer is no, if the "educational" content only makes sense in the context of purchasing or using the product, it will be treated as promotional.

Product pages, features pages, pricing pages, and landing pages consistently appear at the bottom of brand-owned content citation rates. This doesn't mean these pages are unimportant, they serve critical conversion functions. But they're not GEO assets. They're sales assets. Treating them as the same thing is a strategic error.

The Promotional Spectrum: How to Audit Your Own Content

One practical exercise that surfaces this issue quickly: take your top 20 published pages and score each one on a simple promotional spectrum from 1 (purely educational, no commercial agenda visible) to 5 (explicitly promotional, commercial intent in every section). Then cross-reference this score against BabyPenguin citation data for each page, or against manual prompt testing across ChatGPT, Gemini, and Grok.

The correlation is almost always stark. Pages scoring 4-5 on the promotional spectrum appear in AI answers rarely or never. Pages scoring 1-2 appear regularly. Pages in the 3 range, the mixed content that tries to educate and sell simultaneously, perform inconsistently depending on which sections AI engines pull from.

This audit has a second value beyond citation analysis: it helps you identify which existing pages could be refactored to improve their informational density. A blog post that currently scores 4 because it mentions the product in every section might score 2 if the promotional references were concentrated in a single closing section and the body of the post was purely educational. That structural change can dramatically improve citability without requiring new content creation.

The Practical Reframe: Educate to Get Cited, Sell Elsewhere

The strategic insight from this data is clean: your most citable content is your most educational content. Your knowledge base articles, your data studies, your methodology guides, your technical how-tos, these are your GEO assets. Your product pages and landing pages are your conversion assets. Both categories matter, but they serve different functions and should be evaluated on different metrics.

This reframe has real implications for content investment. Many brands spend the majority of their content budget on bottom-funnel assets, product comparisons, use case pages, ROI calculators, that have low citability. Shifting investment toward genuinely informational content may look counterintuitive against a traditional conversion-focused content strategy, but it's exactly what the citation data recommends.

It also has implications for how you write content that is nominally informational but currently over-promotional. Auditing for product mention density, removing CTAs from the body of educational posts, adding external citations and data references, and separating the "here is what we know" sections from the "here is why you should buy from us" sections, these are all tractable editorial changes that move content down the promotional spectrum and up the citability ranking.

For more on what content structures perform best in AI retrieval, see how to structure content that AI models quote and answer-first writing that LLMs love.

What This Means for Your GEO Strategy

The data study finding, that self-promotional content is systematically undercited by AI engines, points toward a counterintuitive but compelling strategic conclusion. The brands that will win in AI search are the ones that build genuine knowledge resources, not the ones that produce the most polished promotional content. This is a meaningful shift from the traditional content marketing paradigm, where brand-owned channels are the primary vehicle for brand messaging.

In AI search, your most powerful brand-building assets are often the pages that make no mention of your product at all, the data studies, the explainer guides, the methodology breakdowns, because these are the pages that AI engines trust enough to quote. When AI quotes your educational content and readers discover it came from your brand, the trust transfer is enormous. You didn't tell them you were an expert. An independent AI synthesis told them. That's a fundamentally different and more credible form of brand building.

The brands already tracking which of their content types get cited, and which don't, have a significant advantage in optimizing this mix. BabyPenguin tracks brand citations and source appearances across ChatGPT, Gemini, and Grok at the page level, so you can see exactly which of your owned pages are earning AI citations and which are invisible, and build your content strategy around what the data actually shows.