Skip to main content
AI GuidesEngine-specific playbooks

Where AI engines get product data from

Every AI answer about a product has a provenance. Tracing that provenance is how merchants decide which data pipelines to prioritise. This is the map — nine sources, six engines, and the merchant actions that control each one.

Nora Kimura with Hiren Bhuva

AI Retrieval Researcher

11 min
neural-grid.svg
Where AI engines get product data from

The three pipeline families every AI engine mixes

Licensed feeds, direct web crawl, third-party aggregators. Every engine uses all three — the mix determines where you need to focus.

There is no single source of truth inside an AI engine's retrieval stack. Every engine blends three families of data, and the relative weights of those families are what makes one engine optimise differently from another. Understanding the three families before you look at specific engines is the fastest way to build an accurate mental model.

Licensed feeds
Structured catalog data delivered through a formal partnership. Examples: Shopify's Agentic Commerce Protocol, Google Merchant Center, Bing Shopping feed, Amazon's catalog API. Tightly controlled, high fidelity.
Direct web crawl
Classic web crawl of your public pages. Reads HTML, schema.org markup, llms.txt, ai-sitemap.xml. Every engine runs its own crawler (GPTBot, ClaudeBot, PerplexityBot, Google-Extended, Bingbot, YouBot, xAI crawler).
Third-party aggregators
Content about your products published elsewhere. Reddit threads, Trustpilot reviews, Wikipedia entries, independent review sites, YouTube reviews, X posts. Used for corroboration and sentiment.

9

distinct product-data sources we track across the six mainstream AI engines

Surfient retrieval research panel, April 2026 — derived from 2,400 tracked citations across ChatGPT, Gemini, Perplexity, Claude, Copilot, You.com.

step-flow.svgInfographic
The four-step arc this guide walks through — each numbered card maps to a section below.01The three pipelinefamilies every AIengine mixes02Engine-by-engine:where each oneactually pulls03merchants controldirectly versusindirectly04schema.org is theuniversal layerthat feeds everySEQUENCE · STEP 1 → STEP 4
Figure · step flowThe four-step arc this guide walks through — each numbered card maps to a section below.

Engine-by-engine: where each one actually pulls from

The six mainstream engines have meaningfully different source mixes. Optimising only for ChatGPT's mix leaves Gemini and Claude partially unaddressed.

Every engine publicly discloses some fraction of its retrieval approach but none of them disclose the full mix. The engine-by-engine picture below is assembled from public documentation, OpenAI / Google / Anthropic engineering talks, and empirical attribution research across thousands of tracked citations.

ChatGPT (OpenAI)

Primary source
Agentic Commerce Protocol feed for transactional prompts.
Secondary source
Bing editorial index for informational and comparison prompts.
Tertiary source
Direct GPTBot crawl for long-form content and buyer guides.
Third-party weight
Medium — Reddit and Trustpilot referenced but not dominant.

Gemini and Google AI Overviews

Primary source
Google Merchant Center product feed.
Secondary source
Google web index — your organic rankings carry over into AI Overviews.
Tertiary source
Shopping tab results and Google Shopping graph entities.
Third-party weight
High — Google's Knowledge Graph pulls from Wikipedia, Wikidata, and licensed data partners extensively.

Perplexity

Primary source
Direct PerplexityBot crawl of your site — HTML, schema, llms.txt.
Secondary source
Perplexity's own web index, built from crawl + curated editorial sources.
Tertiary source
Shopify Catalog API integration for enrolled merchants.
Third-party weight
Very high — Reddit, Trustpilot, independent reviews, and YouTube are weighted heavily.

Claude (Anthropic)

Primary source
Direct ClaudeBot crawl of your public pages.
Secondary source
Anthropic's web-retrieval layer when the assistant is allowed web tools.
Tertiary source
No dedicated commerce feed integration as of April 2026.
Third-party weight
Medium — Claude reads Reddit and Trustpilot when encountered but does not weight them as aggressively as Perplexity.

Copilot (Microsoft)

Primary source
Bing Shopping feed and Bing editorial index.
Secondary source
Bingbot direct crawl of your storefront.
Tertiary source
Microsoft Shopping Graph for categorised product data.
Third-party weight
Medium-low — Reddit is read, creator content is referenced less often than on ChatGPT or Perplexity.

You.com

Primary source
Direct YouBot crawl with heavy weighting on passage-level extraction.
Secondary source
Curated editorial sources and fresh-news index.
Tertiary source
No dedicated commerce feed; relies on schema and HTML content.
Third-party weight
High — cross-source corroboration is central to the citation ranker.

What merchants control directly versus indirectly

Six of the nine source families are directly controllable. Three are only influenced. Knowing the difference changes how you invest time.

Merchant influence is not evenly distributed across the nine source families. Some you edit from Shopify Admin in minutes; others you shape over months through community and PR work. Ranking your investments against this taxonomy is the single best way to avoid wasting a quarter on the wrong levers.

Direct control (ship this week)

  • Agentic Commerce Protocol feed — driven by your Shopify product data. Auto-enrolled but only as good as your titles, GTINs, availability, and images.
  • Google Merchant Center feed — direct upload or auto-sync via the Google & YouTube app. Most Gemini and Google AI Overviews citations depend on this feed being clean.
  • Bing Shopping feed — upload in Microsoft Merchant Center or via Shopify's Bing app. Feeds ChatGPT's informational pathway and Copilot entirely.
  • On-site content — product descriptions, FAQ sections, blog posts. You write these, the crawlers read them.
  • Schema.org markup — Product, FAQPage, AggregateRating, BreadcrumbList. Pure technical work — ship once, benefit everywhere.
  • llms.txt and ai-sitemap.xml — curated signal files every engine reads to some extent.

Indirect influence (ship quarterly, not weekly)

  • Reddit / community threads — you cannot write these; you can only encourage authentic discussion by participating, answering questions, and shipping products that generate organic conversation.
  • Trustpilot / independent reviews — solicit reviews through post-purchase flows, respond to every review (positive and negative), maintain a complete profile.
  • Editorial and creator coverage — traditional PR and influencer relationships. Slow to compound but disproportionately powerful for Gemini's Knowledge Graph and Perplexity's corroboration signals.

Why schema.org is the universal layer that feeds every engine

Schema is the one signal every engine reads, regardless of which pipeline family it prefers. Complete schema is the highest-ROI cross-engine move.

If you only have time to ship one thing across all six engines, ship complete schema.org markup. Product schema on every PDP, FAQPage schema on product and buyer-guide pages, BreadcrumbList on every non-root page, Organization on the root. Every engine reads this markup in some form — ChatGPT via Bing, Gemini via Google index, Perplexity and Claude via direct crawl, Copilot via Bing, You.com via its own crawler. A store with complete schema benefits in six places simultaneously from one build.

The minimum viable schema stack for commerce

  • Product — name, description, image, brand, sku, gtin13, offers (price, availability, priceCurrency), aggregateRating (when reviews exist).
  • FAQPage — 6-8 question-answer pairs per major PDP and per buyer guide.
  • BreadcrumbList — renders your navigation path for trust and context.
  • Organization — name, url, sameAs (LinkedIn, Crunchbase, Wikipedia if applicable), contactPoint, logo. E-E-A-T foundation.
  • HowTo — when you publish procedural content (setup guides, care instructions, style tutorials).

Third-party sources and why merchants should not chase them directly

Reddit, Trustpilot, Wikipedia, and creator content all feed AI retrievers. Gaming these is a losing game; shipping products worth talking about is the durable play.

Third-party sources are the part of the map merchants have the least control over and the most temptation to game. Fake Reddit posts, pay-for-Trustpilot-review operators, mass-edited Wikipedia entries — all of these are known anti-patterns and all of them are detected and penalised by modern retrieval quality layers. The durable play is the opposite: ship products worth discussing, then make participation in authentic conversation part of your operational rhythm.

  1. 1Reddit. Create an authenticated company account, disclose affiliation, answer questions in category subreddits, never vote-brigade. Reddit's moderation of inauthentic commercial activity is strong; AI retrievers detect the same patterns.
  2. 2Trustpilot. Ship a post-purchase review-request email that does not gate positive reviews. Respond to 100% of reviews within 14 days. Maintain a complete profile with photos and contact details.
  3. 3Wikipedia. Almost nothing for most stores — Wikipedia notability thresholds are high. If your brand crosses the threshold (major press coverage, significant revenue, category-defining moment), work with a specialist writer who understands Wikipedia's conflict-of-interest policies.
  4. 4Creator and editorial content. Traditional PR, but prioritise creators whose audience overlaps your buyer. A single genuinely enthusiastic creator review is worth more than 10 paid placements for AI corroboration.
The worst thing you can do to your AI visibility is fake the corroboration layer. The second worst is ignore it entirely. The path in between is slow, authentic participation — and that is what actually compounds.
Nora Kimura, AI Retrieval Researcher

Frequently asked questions

6

Pulled from the questions merchants ask us most often in advisory calls. Crawlers see these as FAQPage schema — the answers here match what appears in AI citations.

  • Not directly. ChatGPT draws from Shopify's Agentic Commerce Protocol (for transactional prompts) and Bing editorial index (for informational prompts). Your Google Merchant Center feed flows into Gemini and Google AI Overviews, not ChatGPT. The two feeds should be kept at parity because your competitors across engines will win on whichever feed is cleaner.

Free · 5 minutes · no signup

Ready to see your store's GEO score?

Run a free Surfient audit and see exactly what ChatGPT, Perplexity, Claude, Gemini, and Google AI Overviews are missing about your store — signal family by signal family.

0

GEO score

Engine readiness

0

Technical indexing

0

Content fit

0

Live example — your number is ready in about 90 seconds.

Keep reading

Browse all AI Guides