Skip to main content
Field NotesShopify Signals9 min read

AI crawler traffic in your Shopify logs

If you've never tailed your access logs, you're missing the single earliest signal of GEO health. This is the 280-merchant share chart, the pipeline to capture it, and the alerts that catch a broken llms.txt before it costs you citations.

Harry Parker
Co-founder, Onviqa Inc. · Surfient
crawler-logs
TL;DR
  • Eight user agents account for ~99% of AI crawler traffic on Shopify stores in Q1 2026. GPTBot alone is 34% of bot hits in our 280-merchant panel, with a median 89 requests/day per store.
  • Your access logs are the earliest GEO signal — new crawlers show up here 2–4 weeks before they show up in citations. A broken llms.txt shows up here same-day.
  • Ship the three-stage pipeline (S3 sync every 5 min → nightly MariaDB parse → /admin/crawlers dashboard). End-to-end latency under 10 minutes. Slack alert on zero-bot gaps > 48 hours.

Most Shopify merchants have never opened an access log. In 2026 that's an expensive habit. AI crawlers are the earliest GEO signal you have — they show up in your logs two to four weeks before they show up in citations, and a broken llms.txt shows up same-day. This is what you should be looking at, and the pipeline to look at it without tailing a terminal at 2 AM.

Who's actually crawling Shopify stores

We took a 280-merchant Shopify panel — a mix of Basic, Plus, and Advanced — and ran their OpenLiteSpeed and nginx access logs through a single regex for thirty days in Q1 2026. The share isn't what most merchants guess. Classic Googlebot is no longer the dominant crawler. In terms of raw hits, GPTBot is three times bigger.

Bar chart of AI crawler share across a 280-merchant Shopify panel. GPTBot 34%, ClaudeBot 22%, PerplexityBot 18%, Google-Extended 11%, classic Googlebot 8%, Bingbot 4%, Applebot-Extended 2%, others 1%.
Figure 1 — AI crawler share across 280 merchants. GPTBot leads, classic Googlebot is demoted, and the AI fetch ratio (GPT+Claude+Perplexity+Google-Extended) is 85% of all bot traffic.

Two observations worth internalising. First, the four AI crawlers (GPT, Claude, Perplexity, Google-Extended) together account for 85% of bot traffic. If your robots.txt accidentally blocks any one of them, you lose a quarter of your AI discovery surface. Second, the fetch targets differ: GPTBot prefers /llms-full.txt, ClaudeBot goes to product pages, PerplexityBot hits /llms.txt. That's not noise — it's the three engines grounding on different signals, and it's why shipping both files matters.

The user-agent strings you should recognise

Verbatim, these are the UAs to match on. Strings change occasionally (OpenAI shipped a new GPTBot version in late 2025); treat this list as a starting regex and update when a crawler publishes a new token.

  • GPTBot Mozilla/5.0 AppleWebKit/537.36 (compatible; GPTBot/1.2; +https://openai.com/gptbot). OpenAI's web crawler. Grounds ChatGPT answers.
  • ChatGPT-User (compatible; ChatGPT-User/1.0; +https://openai.com/bot). On-demand fetches when a ChatGPT conversation follows a link. Different from GPTBot.
  • ClaudeBot Mozilla/5.0 (compatible; ClaudeBot/1.0; [email protected]). Grounds Claude answers and computer-use mode.
  • PerplexityBot Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; PerplexityBot/2.0). Feeds Perplexity's answer engine.
  • Google-Extended — identified by UA token Google-Extended. Distinct from Googlebot. Controls whether Gemini and AI Overviews ingest the page.
  • Applebot-ExtendedApplebot-Extended. Controls whether Siri and Apple Intelligence ground on the site. Small share today, growing.
  • Amazonbot (compatible; Amazonbot/0.1; +https://developer.amazon.com/amazonbot). Crawls for Rufus and Alexa grounding.

The three-stage pipeline

You don't need a data warehouse to track this. You need three cron jobs, a MariaDB table, and a Next.js admin page. Total implementation time on Surfient's side was about a day; the regen runs for pennies on the original Lightsail instance.

Three-stage pipeline for turning raw access logs into an AI crawler dashboard. Stage one: s3cmd sync every five minutes. Stage two: nightly MariaDB parse with a known-bots regex. Stage three: materialized view feeding an admin dashboard with gauges and Slack alerts.
Figure 2 — three stages, one cron job, under ten minutes end-to-end. The materialized view is the only expensive-ish step and it only runs nightly.

Stage 1 — ship logs off the box

Do not parse live logs on your web server. You'll compete for disk I/O with real traffic. Instead, rotate logs hourly and sync the rotated files to S3 every five minutes with s3cmd sync --skip-existing. For Shopify stores not on self-hosted infra, pipe your Cloudflare logs to a Worker that writes to R2 or S3 — same pattern, different origin.

Stage 2 — nightly parse into MariaDB

A Node worker runs once a day (say 03:00 local), streams the previous day's logs from S3, matches against the known-bots regex, and inserts one row per hit into a crawler_hits table. Schema: timestamp, bot, ua, status, path, bytes. Index on (bot, timestamp) and (timestamp). Median insert volume across our panel is ~4,000 rows/day/store — trivial.

Stage 3 — materialized view + admin page

A MariaDB view (or a scheduled INSERT INTO crawler_daily_agg) rolls up the hits by bot and path. Your admin page reads that table and renders a gauge, a 7-day trend line, a top-10 path table, and a zero-bot alert. If GPTBot drops to zero for 48 hours, a Slack webhook fires. If any unknown UA exceeds 100 req/day, another webhook fires.

Alerts that actually earn their keep

Not all alerts are useful. The three that have justified their noise for our panel:

  • Zero hits from a major AI crawler for 48 hours. In 90% of cases this is a robots.txt mistake introduced by a theme update or app install. The fix is minutes; the citation loss from sleeping on it is weeks.
  • 4xx spike to /llms.txt or /llms-full.txt. You broke the route. Agents that see 4xx three times in a row back off for days before retrying. Fix inside the next crawl window (~4 hours).
  • Unknown UA > 100 req/day. Scraper, affiliate bot, or new legit crawler. Rate-limit at Cloudflare until you've classified it. Legit AI crawlers will retry on a polite back-off; scrapers won't.

Analytics hygiene — the biggest own-goal

The number one mistake we see is merchants quoting conversion rate without excluding bot sessions. A 280-merchant panel is 4,000+ bot sessions/day per store. Plausible and PostHog both have bot-exclusion filters; GA4 has a "known bots and spiders" toggle. Enable them, document it in your analytics runbook, and never let a stakeholder report a conversion rate computed on unfiltered data.

Tags:crawlersGPTBotClaudeBotlogsobservabilityShopify

Frequently asked questions

Try Surfient free

See how your Shopify store scores with AI engines

Surfient audits every signal ChatGPT, Perplexity, Claude, and Google AI Overviews read on your store — in under 60 seconds, with no install, no card, no catch.

  • ChatGPT, Perplexity, Claude, and AI Overviews
  • Store-by-store score with fix priorities
  • 60-second audit, no install or card

Sources & further reading

  1. Surfient Q1 2026 crawler-share panel
    Surfient Research2026-03-31
Harry Parker
Co-founder, Onviqa Inc. · Surfient

Harry has led SEO and e-commerce engineering for over 12 years and has been shipping Shopify software since Onviqa was founded in 2014. He writes about where commerce is headed when shoppers stop typing queries and start asking assistants.

Related reading

All posts