Do I need to self-host to track this? I'm on Shopify Basic.

No. If Shopify is your origin, point your domain through Cloudflare and use Logpush to stream access logs to R2 or S3. The filter and aggregate stages work the same way — you're just swapping the ingest source. The one caveat: Shopify's native access logs are not exposed to merchants, so Cloudflare (or another reverse proxy) is the only way to capture UA strings reliably at the Basic plan tier.

Is it OK to block GPTBot / ClaudeBot / PerplexityBot?

Technically yes; commercially, usually no. Blocking these bots removes your site from the grounding pool for ChatGPT, Claude, and Perplexity respectively. If a meaningful slice of your shoppers are asking those engines for product recommendations in your category, blocking is expensive. The only legitimate blocks we recommend are short-term rate-limits during incidents — never permanent User-agent: Disallow lines.

What's the difference between Google-Extended and classic Googlebot?

Classic Googlebot crawls for the blue-link index and also provides grounding data for AI Overviews. Google-Extended is a separate opt-out token: if you want to stay in the blue-link index but NOT contribute training data to Gemini, you disallow Google-Extended in robots.txt and leave Googlebot alone. Most merchants should allow both. The split exists for publishers who want search visibility without training contribution.

How long before a new crawler shows up in my logs after an engine launches?

Historically 48–72 hours from the announcement of a new crawler UA. When OpenAI shipped ChatGPT-User in mid-2023, it appeared in our panel logs inside three days. When Anthropic shipped ClaudeBot, it was faster — under 36 hours. If a major engine ships a new crawler and you don't see it in your logs within a week, either your robots.txt is blocking it or your log sampling is missing it.

Do I need to store logs forever?

No. We keep 90 days of raw logs in S3 Infrequent Access (pennies per store per month) and 18 months of aggregates in MariaDB. Raw logs are useful for the occasional 'what actually happened on [date]?' debugging question. Aggregates are what you report against. Anything longer than 90 days raw is usually wasted storage — the questions that need that history almost always need aggregates, not raw lines.

AI crawler traffic in your Shopify logs

Most Shopify merchants have never opened an access log. In 2026 that's an expensive habit. AI crawlers are the earliest GEO signal you have — they show up in your logs two to four weeks before they show up in citations, and a broken llms.txt shows up same-day. This is what you should be looking at, and the pipeline to look at it without tailing a terminal at 2 AM.

Who's actually crawling Shopify stores

We took a 280-merchant Shopify panel — a mix of Basic, Plus, and Advanced — and ran their OpenLiteSpeed and nginx access logs through a single regex for thirty days in Q1 2026. The share isn't what most merchants guess. Classic Googlebot is no longer the dominant crawler. In terms of raw hits, GPTBot is three times bigger.

Bar chart of AI crawler share across a 280-merchant Shopify panel. GPTBot 34%, ClaudeBot 22%, PerplexityBot 18%, Google-Extended 11%, classic Googlebot 8%, Bingbot 4%, Applebot-Extended 2%, others 1%. — Figure 1 — AI crawler share across 280 merchants. GPTBot leads, classic Googlebot is demoted, and the AI fetch ratio (GPT+Claude+Perplexity+Google-Extended) is 85% of all bot traffic.

Two observations worth internalising. First, the four AI crawlers (GPT, Claude, Perplexity, Google-Extended) together account for 85% of bot traffic. If your robots.txt accidentally blocks any one of them, you lose a quarter of your AI discovery surface. Second, the fetch targets differ: GPTBot prefers /llms-full.txt, ClaudeBot goes to product pages, PerplexityBot hits /llms.txt. That's not noise — it's the three engines grounding on different signals, and it's why shipping both files matters.

The user-agent strings you should recognise

Verbatim, these are the UAs to match on. Strings change occasionally (OpenAI shipped a new GPTBot version in late 2025); treat this list as a starting regex and update when a crawler publishes a new token.

GPTBot — Mozilla/5.0 AppleWebKit/537.36 (compatible; GPTBot/1.2; +https://openai.com/gptbot). OpenAI's web crawler. Grounds ChatGPT answers.
ChatGPT-User — (compatible; ChatGPT-User/1.0; +https://openai.com/bot). On-demand fetches when a ChatGPT conversation follows a link. Different from GPTBot.
ClaudeBot — Mozilla/5.0 (compatible; ClaudeBot/1.0; [email protected]). Grounds Claude answers and computer-use mode.
PerplexityBot — Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; PerplexityBot/2.0). Feeds Perplexity's answer engine.
Google-Extended — identified by UA token Google-Extended. Distinct from Googlebot. Controls whether Gemini and AI Overviews ingest the page.
Applebot-Extended — Applebot-Extended. Controls whether Siri and Apple Intelligence ground on the site. Small share today, growing.
Amazonbot — (compatible; Amazonbot/0.1; +https://developer.amazon.com/amazonbot). Crawls for Rufus and Alexa grounding.

The three-stage pipeline

You don't need a data warehouse to track this. You need three cron jobs, a MariaDB table, and a Next.js admin page. Total implementation time on Surfient's side was about a day; the regen runs for pennies on the original Lightsail instance.

Three-stage pipeline for turning raw access logs into an AI crawler dashboard. Stage one: s3cmd sync every five minutes. Stage two: nightly MariaDB parse with a known-bots regex. Stage three: materialized view feeding an admin dashboard with gauges and Slack alerts. — Figure 2 — three stages, one cron job, under ten minutes end-to-end. The materialized view is the only expensive-ish step and it only runs nightly.

Stage 1 — ship logs off the box

Do not parse live logs on your web server. You'll compete for disk I/O with real traffic. Instead, rotate logs hourly and sync the rotated files to S3 every five minutes with s3cmd sync --skip-existing. For Shopify stores not on self-hosted infra, pipe your Cloudflare logs to a Worker that writes to R2 or S3 — same pattern, different origin.

Stage 2 — nightly parse into MariaDB

A Node worker runs once a day (say 03:00 local), streams the previous day's logs from S3, matches against the known-bots regex, and inserts one row per hit into a crawler_hits table. Schema: timestamp, bot, ua, status, path, bytes. Index on (bot, timestamp) and (timestamp). Median insert volume across our panel is ~4,000 rows/day/store — trivial.

Stage 3 — materialized view + admin page

A MariaDB view (or a scheduled INSERT INTO crawler_daily_agg) rolls up the hits by bot and path. Your admin page reads that table and renders a gauge, a 7-day trend line, a top-10 path table, and a zero-bot alert. If GPTBot drops to zero for 48 hours, a Slack webhook fires. If any unknown UA exceeds 100 req/day, another webhook fires.

Alerts that actually earn their keep

Not all alerts are useful. The three that have justified their noise for our panel:

Zero hits from a major AI crawler for 48 hours. In 90% of cases this is a robots.txt mistake introduced by a theme update or app install. The fix is minutes; the citation loss from sleeping on it is weeks.
4xx spike to /llms.txt or /llms-full.txt. You broke the route. Agents that see 4xx three times in a row back off for days before retrying. Fix inside the next crawl window (~4 hours).
Unknown UA > 100 req/day. Scraper, affiliate bot, or new legit crawler. Rate-limit at Cloudflare until you've classified it. Legit AI crawlers will retry on a polite back-off; scrapers won't.

Analytics hygiene — the biggest own-goal

The number one mistake we see is merchants quoting conversion rate without excluding bot sessions. A 280-merchant panel is 4,000+ bot sessions/day per store. Plausible and PostHog both have bot-exclusion filters; GA4 has a "known bots and spiders" toggle. Enable them, document it in your analytics runbook, and never let a stakeholder report a conversion rate computed on unfiltered data.

Tags:crawlersGPTBotClaudeBotlogsobservabilityShopify

AI crawler traffic in your Shopify logs

Who's actually crawling Shopify stores

The user-agent strings you should recognise

The three-stage pipeline

Stage 1 — ship logs off the box

Stage 2 — nightly parse into MariaDB

Stage 3 — materialized view + admin page

Alerts that actually earn their keep

Analytics hygiene — the biggest own-goal

Frequently asked questions

See how your Shopify store scores with AI engines

Sources & further reading

llms-full.txt — the overlooked file that lifts citations 30%

AI citations — the weekly measurement playbook

robots.txt for AI bots on Shopify

Related reading

llms.txt for Shopify — the 20-minute setup

Your product descriptions kill AI citations

Shopify metafields for AI citations

AI crawler traffic in your Shopify logs

Who's actually crawling Shopify stores

The user-agent strings you should recognise

The three-stage pipeline

Stage 1 — ship logs off the box

Stage 2 — nightly parse into MariaDB

Stage 3 — materialized view + admin page

Alerts that actually earn their keep

Analytics hygiene — the biggest own-goal

Frequently asked questions

See how your Shopify store scores with AI engines

Sources & further reading

Keep reading

llms-full.txt — the overlooked file that lifts citations 30%

AI citations — the weekly measurement playbook

robots.txt for AI bots on Shopify

Related reading

llms.txt for Shopify — the 20-minute setup

Your product descriptions kill AI citations

Shopify metafields for AI citations