The three pipeline families every AI engine mixes
Licensed feeds, direct web crawl, third-party aggregators. Every engine uses all three — the mix determines where you need to focus.
There is no single source of truth inside an AI engine's retrieval stack. Every engine blends three families of data, and the relative weights of those families are what makes one engine optimise differently from another. Understanding the three families before you look at specific engines is the fastest way to build an accurate mental model.
- Licensed feeds
- Structured catalog data delivered through a formal partnership. Examples: Shopify's Agentic Commerce Protocol, Google Merchant Center, Bing Shopping feed, Amazon's catalog API. Tightly controlled, high fidelity.
- Direct web crawl
- Classic web crawl of your public pages. Reads HTML, schema.org markup, llms.txt, ai-sitemap.xml. Every engine runs its own crawler (GPTBot, ClaudeBot, PerplexityBot, Google-Extended, Bingbot, YouBot, xAI crawler).
- Third-party aggregators
- Content about your products published elsewhere. Reddit threads, Trustpilot reviews, Wikipedia entries, independent review sites, YouTube reviews, X posts. Used for corroboration and sentiment.
9
distinct product-data sources we track across the six mainstream AI engines
Surfient retrieval research panel, April 2026 — derived from 2,400 tracked citations across ChatGPT, Gemini, Perplexity, Claude, Copilot, You.com.
Engine-by-engine: where each one actually pulls from
The six mainstream engines have meaningfully different source mixes. Optimising only for ChatGPT's mix leaves Gemini and Claude partially unaddressed.
Every engine publicly discloses some fraction of its retrieval approach but none of them disclose the full mix. The engine-by-engine picture below is assembled from public documentation, OpenAI / Google / Anthropic engineering talks, and empirical attribution research across thousands of tracked citations.
ChatGPT (OpenAI)
- Primary source
- Agentic Commerce Protocol feed for transactional prompts.
- Secondary source
- Bing editorial index for informational and comparison prompts.
- Tertiary source
- Direct GPTBot crawl for long-form content and buyer guides.
- Third-party weight
- Medium — Reddit and Trustpilot referenced but not dominant.
Gemini and Google AI Overviews
- Primary source
- Google Merchant Center product feed.
- Secondary source
- Google web index — your organic rankings carry over into AI Overviews.
- Tertiary source
- Shopping tab results and Google Shopping graph entities.
- Third-party weight
- High — Google's Knowledge Graph pulls from Wikipedia, Wikidata, and licensed data partners extensively.
Perplexity
- Primary source
- Direct PerplexityBot crawl of your site — HTML, schema, llms.txt.
- Secondary source
- Perplexity's own web index, built from crawl + curated editorial sources.
- Tertiary source
- Shopify Catalog API integration for enrolled merchants.
- Third-party weight
- Very high — Reddit, Trustpilot, independent reviews, and YouTube are weighted heavily.
Claude (Anthropic)
- Primary source
- Direct ClaudeBot crawl of your public pages.
- Secondary source
- Anthropic's web-retrieval layer when the assistant is allowed web tools.
- Tertiary source
- No dedicated commerce feed integration as of April 2026.
- Third-party weight
- Medium — Claude reads Reddit and Trustpilot when encountered but does not weight them as aggressively as Perplexity.
Copilot (Microsoft)
- Primary source
- Bing Shopping feed and Bing editorial index.
- Secondary source
- Bingbot direct crawl of your storefront.
- Tertiary source
- Microsoft Shopping Graph for categorised product data.
- Third-party weight
- Medium-low — Reddit is read, creator content is referenced less often than on ChatGPT or Perplexity.
You.com
- Primary source
- Direct YouBot crawl with heavy weighting on passage-level extraction.
- Secondary source
- Curated editorial sources and fresh-news index.
- Tertiary source
- No dedicated commerce feed; relies on schema and HTML content.
- Third-party weight
- High — cross-source corroboration is central to the citation ranker.
What merchants control directly versus indirectly
Six of the nine source families are directly controllable. Three are only influenced. Knowing the difference changes how you invest time.
Merchant influence is not evenly distributed across the nine source families. Some you edit from Shopify Admin in minutes; others you shape over months through community and PR work. Ranking your investments against this taxonomy is the single best way to avoid wasting a quarter on the wrong levers.
Direct control (ship this week)
- Agentic Commerce Protocol feed — driven by your Shopify product data. Auto-enrolled but only as good as your titles, GTINs, availability, and images.
- Google Merchant Center feed — direct upload or auto-sync via the Google & YouTube app. Most Gemini and Google AI Overviews citations depend on this feed being clean.
- Bing Shopping feed — upload in Microsoft Merchant Center or via Shopify's Bing app. Feeds ChatGPT's informational pathway and Copilot entirely.
- On-site content — product descriptions, FAQ sections, blog posts. You write these, the crawlers read them.
- Schema.org markup — Product, FAQPage, AggregateRating, BreadcrumbList. Pure technical work — ship once, benefit everywhere.
- llms.txt and ai-sitemap.xml — curated signal files every engine reads to some extent.
Indirect influence (ship quarterly, not weekly)
- Reddit / community threads — you cannot write these; you can only encourage authentic discussion by participating, answering questions, and shipping products that generate organic conversation.
- Trustpilot / independent reviews — solicit reviews through post-purchase flows, respond to every review (positive and negative), maintain a complete profile.
- Editorial and creator coverage — traditional PR and influencer relationships. Slow to compound but disproportionately powerful for Gemini's Knowledge Graph and Perplexity's corroboration signals.
Why schema.org is the universal layer that feeds every engine
Schema is the one signal every engine reads, regardless of which pipeline family it prefers. Complete schema is the highest-ROI cross-engine move.
If you only have time to ship one thing across all six engines, ship complete schema.org markup. Product schema on every PDP, FAQPage schema on product and buyer-guide pages, BreadcrumbList on every non-root page, Organization on the root. Every engine reads this markup in some form — ChatGPT via Bing, Gemini via Google index, Perplexity and Claude via direct crawl, Copilot via Bing, You.com via its own crawler. A store with complete schema benefits in six places simultaneously from one build.
The minimum viable schema stack for commerce
- Product — name, description, image, brand, sku, gtin13, offers (price, availability, priceCurrency), aggregateRating (when reviews exist).
- FAQPage — 6-8 question-answer pairs per major PDP and per buyer guide.
- BreadcrumbList — renders your navigation path for trust and context.
- Organization — name, url, sameAs (LinkedIn, Crunchbase, Wikipedia if applicable), contactPoint, logo. E-E-A-T foundation.
- HowTo — when you publish procedural content (setup guides, care instructions, style tutorials).
Third-party sources and why merchants should not chase them directly
Reddit, Trustpilot, Wikipedia, and creator content all feed AI retrievers. Gaming these is a losing game; shipping products worth talking about is the durable play.
Third-party sources are the part of the map merchants have the least control over and the most temptation to game. Fake Reddit posts, pay-for-Trustpilot-review operators, mass-edited Wikipedia entries — all of these are known anti-patterns and all of them are detected and penalised by modern retrieval quality layers. The durable play is the opposite: ship products worth discussing, then make participation in authentic conversation part of your operational rhythm.
- 1Reddit. Create an authenticated company account, disclose affiliation, answer questions in category subreddits, never vote-brigade. Reddit's moderation of inauthentic commercial activity is strong; AI retrievers detect the same patterns.
- 2Trustpilot. Ship a post-purchase review-request email that does not gate positive reviews. Respond to 100% of reviews within 14 days. Maintain a complete profile with photos and contact details.
- 3Wikipedia. Almost nothing for most stores — Wikipedia notability thresholds are high. If your brand crosses the threshold (major press coverage, significant revenue, category-defining moment), work with a specialist writer who understands Wikipedia's conflict-of-interest policies.
- 4Creator and editorial content. Traditional PR, but prioritise creators whose audience overlaps your buyer. A single genuinely enthusiastic creator review is worth more than 10 paid placements for AI corroboration.
“The worst thing you can do to your AI visibility is fake the corroboration layer. The second worst is ignore it entirely. The path in between is slow, authentic participation — and that is what actually compounds.”