Most Shopify merchants have never opened an access log. In 2026 that's an expensive habit. AI crawlers are the earliest GEO signal you have — they show up in your logs two to four weeks before they show up in citations, and a broken llms.txt shows up same-day. This is what you should be looking at, and the pipeline to look at it without tailing a terminal at 2 AM.
Who's actually crawling Shopify stores
We took a 280-merchant Shopify panel — a mix of Basic, Plus, and Advanced — and ran their OpenLiteSpeed and nginx access logs through a single regex for thirty days in Q1 2026. The share isn't what most merchants guess. Classic Googlebot is no longer the dominant crawler. In terms of raw hits, GPTBot is three times bigger.

Two observations worth internalising. First, the four AI crawlers (GPT, Claude, Perplexity, Google-Extended) together account for 85% of bot traffic. If your robots.txt accidentally blocks any one of them, you lose a quarter of your AI discovery surface. Second, the fetch targets differ: GPTBot prefers /llms-full.txt, ClaudeBot goes to product pages, PerplexityBot hits /llms.txt. That's not noise — it's the three engines grounding on different signals, and it's why shipping both files matters.
The user-agent strings you should recognise
Verbatim, these are the UAs to match on. Strings change occasionally (OpenAI shipped a new GPTBot version in late 2025); treat this list as a starting regex and update when a crawler publishes a new token.
- GPTBot —
Mozilla/5.0 AppleWebKit/537.36 (compatible; GPTBot/1.2; +https://openai.com/gptbot). OpenAI's web crawler. Grounds ChatGPT answers. - ChatGPT-User —
(compatible; ChatGPT-User/1.0; +https://openai.com/bot). On-demand fetches when a ChatGPT conversation follows a link. Different from GPTBot. - ClaudeBot —
Mozilla/5.0 (compatible; ClaudeBot/1.0; [email protected]). Grounds Claude answers and computer-use mode. - PerplexityBot —
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; PerplexityBot/2.0). Feeds Perplexity's answer engine. - Google-Extended — identified by UA token
Google-Extended. Distinct from Googlebot. Controls whether Gemini and AI Overviews ingest the page. - Applebot-Extended —
Applebot-Extended. Controls whether Siri and Apple Intelligence ground on the site. Small share today, growing. - Amazonbot —
(compatible; Amazonbot/0.1; +https://developer.amazon.com/amazonbot). Crawls for Rufus and Alexa grounding.
The three-stage pipeline
You don't need a data warehouse to track this. You need three cron jobs, a MariaDB table, and a Next.js admin page. Total implementation time on Surfient's side was about a day; the regen runs for pennies on the original Lightsail instance.

Stage 1 — ship logs off the box
Do not parse live logs on your web server. You'll compete for disk I/O with real traffic. Instead, rotate logs hourly and sync the rotated files to S3 every five minutes with s3cmd sync --skip-existing. For Shopify stores not on self-hosted infra, pipe your Cloudflare logs to a Worker that writes to R2 or S3 — same pattern, different origin.
Stage 2 — nightly parse into MariaDB
A Node worker runs once a day (say 03:00 local), streams the previous day's logs from S3, matches against the known-bots regex, and inserts one row per hit into a crawler_hits table. Schema: timestamp, bot, ua, status, path, bytes. Index on (bot, timestamp) and (timestamp). Median insert volume across our panel is ~4,000 rows/day/store — trivial.
Stage 3 — materialized view + admin page
A MariaDB view (or a scheduled INSERT INTO crawler_daily_agg) rolls up the hits by bot and path. Your admin page reads that table and renders a gauge, a 7-day trend line, a top-10 path table, and a zero-bot alert. If GPTBot drops to zero for 48 hours, a Slack webhook fires. If any unknown UA exceeds 100 req/day, another webhook fires.
Alerts that actually earn their keep
Not all alerts are useful. The three that have justified their noise for our panel:
- Zero hits from a major AI crawler for 48 hours. In 90% of cases this is a robots.txt mistake introduced by a theme update or app install. The fix is minutes; the citation loss from sleeping on it is weeks.
- 4xx spike to /llms.txt or /llms-full.txt. You broke the route. Agents that see 4xx three times in a row back off for days before retrying. Fix inside the next crawl window (~4 hours).
- Unknown UA > 100 req/day. Scraper, affiliate bot, or new legit crawler. Rate-limit at Cloudflare until you've classified it. Legit AI crawlers will retry on a polite back-off; scrapers won't.
Analytics hygiene — the biggest own-goal
The number one mistake we see is merchants quoting conversion rate without excluding bot sessions. A 280-merchant panel is 4,000+ bot sessions/day per store. Plausible and PostHog both have bot-exclusion filters; GA4 has a "known bots and spiders" toggle. Enable them, document it in your analytics runbook, and never let a stakeholder report a conversion rate computed on unfiltered data.