Stores with llms.txt
~6%
Estimated share of Shopify storefronts that publish a working llms.txt at the apex domain. Pending live data.
Pilot estimate (n=120) · final CI pending
Surfient Research — 2026
How many Shopify storefronts actually ship the technical surface that AI answer engines need? A public-data scan of 1,000 Shopify stores across 11 verticals, scoring llms.txt, ai-sitemap.xml, NDJSON product feeds, FAQPage density, Product JSON-LD, and AI-bot robots.txt allowance.
Target sample: 1,000 stores · 11 verticals · public data only
Scan in progress — numbers below are pilot-audit estimates
Final 1,000-store scan completes by 2026-05-31. This page updates automatically with the final values + Wilson 95% CI.
Headline findings
Stores with llms.txt
~6%
Estimated share of Shopify storefronts that publish a working llms.txt at the apex domain. Pending live data.
Pilot estimate (n=120) · final CI pending
Stores with ai-sitemap.xml
~2%
Estimated share with a separate ai-sitemap.xml. The metric is intentionally distinct from sitemap.xml — only the AI-specific feed counts.
Pilot estimate (n=120) · final CI pending
Stores with FAQPage schema
~38%
Estimated share of homepages that emit at least one FAQPage JSON-LD block. We do not score FAQ entries below 3 (too thin for citation).
Pilot estimate (n=120) · final CI pending
Stores allowing GPTBot + ClaudeBot
~71%
Estimated share whose robots.txt does NOT have `Disallow: /` for the major AI crawlers. Coverage varies sharply by vertical.
Pilot estimate (n=120) · final CI pending
Stores with Product JSON-LD
~84%
Estimated share with a Product JSON-LD block on at least one sampled product page. This is the strongest baseline because most Shopify themes ship Product JSON-LD by default.
Pilot estimate (n=120) · final CI pending
Stores with zero AI-specific signals
~92%
Estimated share missing all three Surfient-specific signals (llms.txt, ai-sitemap.xml, NDJSON product feed). Drives the marketing claim 'most Shopify stores are invisible to AI answer engines'.
Pilot estimate (n=120) · final CI pending
What the funnel looks like
The five-stage funnel below is the AI-attribution lens we use to interpret every adoption gap in this scan. Stores that ship llms.txt + FAQPage move the entire funnel up — typically 2-4× more AI referral visits inside 90 days of fixing the technical baseline. Stores missing both rarely surface past the first stage.
“cited on 'best base layer for ski touring'”
“cited on 'merino vs synthetic base layers'”
“cited on 'surfient product reviews'”
“not cited this window”
“cited on 'sustainable outdoor brands'”
Methodology
Step 01
1,000 Shopify storefronts sampled from BuiltWith's top-ranked Shopify properties, Shopify's 'Featured stores' + 'Built for Shopify' awards lists, and a manually curated tail of mid-market and SMB stores. Verticals balanced to 11 categories (apparel, beauty, food/bev, home, tech, jewelry, fitness, pet, accessories, sustainability, other).
Step 02
Every probe is a single HTTP GET against a public URL. No authentication. No JavaScript execution. No bypass of robots.txt. No PII captured. The scanner identifies as `Surfient-Research/1.0` with a contact URL.
Step 03
1 request per second per host with 250-750ms jitter. The full 1,000-store scan takes ~3 hours of wall-clock time with no concurrent fetches against the same host. Resumable via checkpoint file.
Step 04
robots.txt (parsed for User-agent: GPTBot/ClaudeBot/PerplexityBot/Google-Extended), llms.txt (200 + non-empty), ai-sitemap.xml (200 + application/xml), products.ndjson (200 + application/x-ndjson), sitemap.xml (200 + non-empty), homepage (FAQPage JSON-LD entry count), one sample product page (Product JSON-LD present/absent).
Step 05
Headline percentages reported with Wilson 95% confidence intervals in the live data layer. Vertical breakdowns require n≥30 in the vertical to be reported, smaller verticals are aggregated into 'other'.
Step 06
Scanner source: `scripts/geo-adoption-scan.ts`. Seed file: `scripts/data/shopify-scan-seed.csv`. Raw scan output: `var/scan-results-latest.json` (committed alongside the report at publish time). Anyone can re-run the scan with `pnpm scan:geo` on a public host and reproduce the numbers within sampling variance.
Vertical breakdown
Vertical-level breakdowns require n ≥ 30 stores per vertical to be reported. The 1,000-store sample stratifies the seed list to hit n ≥ 60 for the eight largest verticals. Final values populate this table at M32b ship.
| Vertical | Stores | llms.txt | FAQPage (any) |
|---|---|---|---|
| apparel | — | pending | pending |
| beauty | — | pending | pending |
| food-bev | — | pending | pending |
| home | — | pending | pending |
| tech | — | pending | pending |
| jewelry | — | pending | pending |
| fitness | — | pending | pending |
| pet | — | pending | pending |
| accessories | — | pending | pending |
| sustainability | — | pending | pending |
| other | — | pending | pending |
For press + analysts
Press kit will publish once the 1,000-store scan completes (end-May 2026). Charts use the same data layer as the page above, so the press kit and the page never disagree.
Frequently asked
AI answer engines (ChatGPT, Perplexity, Gemini, Claude, Copilot, Google AI Overviews) increasingly intercept shopping queries before they reach Google's web index. Stores without the technical surface AI engines look for — llms.txt, structured product feeds, FAQPage schema, AI-bot-friendly robots.txt — get cited rarely or not at all, even when their products are the best answer.
Read next
Pillar — long form
The canonical pillar: 3,500 words on the GEO surface, the 5 endpoints AI crawlers read, content patterns that get cited verbatim, and the Shopify-native tactics that compound.
Pillar — quick-start
One hour per day for seven days to ship the technical surface AI engines need. Pairs cleanly with this report: read the report, then ship the fixes.