How big should a prompt library be to start?

For a mid-sized Shopify brand, 60-120 prompts is the right starting scale. That’s small enough that you can maintain the weekly cycle manually in 2-3 hours per week, and large enough that your citation share numbers stabilise (fewer than 40 prompts creates noisy week-over-week metrics). Grow to 150-200 over the first six months as your measurement tooling matures. Above 300, you need automation or the cycle breaks.

We sell across 12 product categories. Should we have one library or twelve?

One library with a category column, filterable views. Twelve libraries means twelve Discover sessions, twelve Sample passes, twelve Respond meetings — which quickly exceeds a GEO lead’s time budget. One library with category filters means the cycle runs once per week, the GEO lead can rotate deep focus across 2-3 categories per week, and the content team sees a unified priority queue. We’ve tried both shapes — unified library with filters wins at scale.

How do we handle prompts that fire on multiple engines differently (cited on Perplexity, not on ChatGPT)?

One row in the library, with the engine column listing the primary engine where you expect it to fire, and a separate per-engine presence matrix that lives in the measurement tooling. The library tracks the prompt + target page; the measurement tooling tracks the presence and rank per engine per sample. When you diagnose a cross-engine divergence, you’re looking at two different questions (what’s the prompt about + where is it cited), so the data lives in two places. Collapsing them into one row creates illegible spreadsheets at any scale.

What tools do we need to maintain this cycle? Is a Notion table enough?

For the first 6 weeks, yes — a Notion or Airtable database with filter views is completely sufficient. The sampling is manual: a VA runs each prompt across the four engines on Tuesday and logs presence/rank in a separate sheet. The cost is 8-10 hours of VA time per week at 120-prompt scale. Around week 6-8, you’ll want to automate the Sample stage because the manual error rate becomes visible in your deltas. That’s when merchants typically buy Surfient or a similar tool. Discover and Respond stay manual forever because they’re judgement work, not measurement work.

Our SEO team is resistant — they’ve built their careers on keyword research. How do we manage the change?

Don’t position it as replacement; position it as augmentation. The SEO team still owns the keyword list for conventional SERP work (which still matters, especially on Google non-AI surfaces). The GEO team owns the prompt library for AI retrieval work. After 6-8 weeks, the prompt library produces enough clear wins that the SEO team will want to converge their thinking — but let that happen organically rather than top-down. The one non-negotiable: the prompt library’s Friday Respond session must drive sprint priority at least equally with the keyword-list-driven priorities, or the library never becomes real work.

The prompt library that replaces your keyword list

The keyword list was the best tool we had when search was a single-turn typed query. It’s a bad fit for AI retrieval, where a single buyer session spans three to five turns, crosses engines, and resolves through corroboration across multiple sources. The replacement artefact is a prompt library — a living document with seven columns per row and a weekly maintenance cycle. This post covers the schema, the lifecycle, and the common failure modes in the transition.

Why keyword lists fail the AI retrieval test

A keyword list carries four columns: keyword, volume, difficulty, cost-per-click. That schema is sufficient when your goal is to rank one URL for one query on one engine. None of those assumptions hold in AI retrieval. The “query” is a multi-turn conversation. The “URL” is one of several sources the engine synthesises. The “engine” is plural — Perplexity, AI Mode, ChatGPT, Claude all behave differently. And the “rank” is a probability of appearing in a citation shortlist, not a visible position 1-10.

The second problem is more insidious: keyword lists implicitly encode that volume is the right prioritisation signal. In AI retrieval, volume is noisy. A 300-volume comparison prompt with 82% purchase intent outperforms a 12,000-volume informational prompt with 5% intent by roughly 40x in attributable pipeline per citation. Volume-driven prioritisation misses where the money actually flows.

Two-column comparison. Left column red: traditional keyword list with four columns (keyword, monthly volume, difficulty, CPC). Below, six reasons it fails for AI retrieval including no intent signal, no turn position, no buyer stage, no engine specificity, no corroboration map, no citation target. Right column cyan: prompt library with seven columns (prompt, intent, buyer stage, engine, turn position, corroboration needs, citation target page). Below, five reasons it works for AI retrieval. — The keyword list schema (left, 4 columns, volume-driven) versus the prompt library schema (right, 7 columns, intent + turn + stage driven). The right side carries the context AI retrieval actually uses.

The seven columns of a prompt library

Our working prompt library template has seven columns because each one drives a different downstream decision. Drop any column and you lose optionality on something meaningful.

Prompt. The natural-language question exactly as a buyer would phrase it. Not keyword-ified. Full sentences with the buyer’s context (“under $800”, “for kids’ room”, “with two cats and a toddler”). These details matter because engines index the full surface.

Intent. Research / comparison / high-intent shortlist / post-purchase / objection. Intent decides what kind of page wins the citation. Research prompts win on long-form guides; shortlist prompts win on collection pages with AggregateRating; objection prompts win on FAQPage content.

Buyer stage. Where in the funnel the buyer is when this prompt fires. Awareness, consideration, comparison, decision, post-purchase. A single prompt can fire at multiple stages, but most have a dominant one.

Engine. Which engine(s) this prompt is most likely to fire on. Commerce prompts cluster on Perplexity Shopping and AI Mode; research prompts cluster on ChatGPT and Claude. Engine column tells you where to measure first.

Turn position. Turn 1 / turn 2 / turn 3+. Turn 1 opens the session (“best wool rug for X”). Turn 2 narrows (“between Y and Z, which is better for…”). Turn 3 handles objections (“is it safe for…”). Different page types win different turns.

Corroboration needs. Where else must the answer appear for the engine to trust it? Reddit thread, buying guide, review platform, expert publication. This column drives off-site content work and PR.

Citation target page. The specific URL on your site that should win the citation for this prompt. Usually a PDP, collection, FAQ, or long-form guide. Making this explicit enables measurement: you can check whether the engine cited the intended page or a different one, and adjust.

The four-stage weekly lifecycle

A prompt library isn’t a one-off artefact — it’s a living document maintained on a weekly cadence. Treated as a one-off, it becomes a keyword list wearing a better schema. Treated as a living cycle, it becomes the operational nerve centre of your GEO programme.

Four-stage weekly prompt library lifecycle. Stage 01 Discover on Monday AM for 1 hour, sources include sales call transcripts, support tickets, Reddit, customer interviews; output 5-12 new prompt rows. Stage 02 Sample on Tuesday for 2 hours, runs prompts across Perplexity, Google AI Mode, ChatGPT, Claude; output presence and rank per prompt per engine. Stage 03 Measure on Wednesday-Thursday for 2 hours, computes citation share, week-over-week deltas, competitor comparison, stage and turn segmentation; output dashboard update and anomaly alerts. Stage 04 Respond on Friday for 1 hour, prioritises new citation-target pages, schema upgrades, corroboration work, prompt retirements; output next-sprint queue. — The weekly prompt-library lifecycle. Monday: Discover from sales, support, Reddit, interviews. Tuesday: Sample across engines. Wednesday-Thursday: Measure citation share, diffs, competitor comparison. Friday: Respond with next-sprint priorities for content, schema, and off-site work. ~6 hours of team time per week at 186-prompt scale.

Monday — Discover

Surface new candidate prompts from the places your buyers ask real questions. Sales call transcripts are the richest source: the questions prospects ask in discovery calls map directly to turn-1 and turn-2 prompts they’d put to an AI. Support tickets surface objection prompts (turn-3). Reddit subreddits in your category surface the comparison prompts your buyers are actually posting. The discipline: add 5-12 new prompts per week; fill all seven columns on each; don’t add “obvious” prompts that don’t match a real buyer voice.

Tuesday — Sample

Run every prompt in the library across the four primary engines in a consistent order, capturing presence (cited or not) and rank (which citation slot if cited). Tooling ranges from manual (a VA with a script and a spreadsheet) to fully automated (Surfient runs this continuously and surfaces the deltas). Either way, the output is a fresh presence matrix.

Wed-Thu — Measure

Aggregate into citation share numbers at three granularities: overall, per-engine, per-intent-stage. Diff against last week and flag anything shifting by 5 points or more. Compare against your top 3 category competitors on the same cohort. Summarise findings in a dashboard update that goes to GEO lead, content lead, and head of merchandising.

Friday — Respond

Turn findings into sprint tickets. Prompts with strong intent but weak citation share get new content work. Pages that are winning citations but only on high-cost engines get prioritised for schema work. Prompts that have lost citation share get corroboration tickets (new Reddit engagement, guide placements, PR outreach). Prompts that have been dead for 30+ days get retired to archive. The output is a clean, prioritised queue for content and dev to pick up next week.

Common failure modes in the transition

Failure mode #1: keyword list with renamed columns. Team renames “keyword” to “prompt” and “volume” to “priority score” but populates the new columns with the same logic. Result: same behaviour, new spreadsheet. Cure: enforce the schema discipline that prompts must be full sentences with buyer context, not keyword phrases.

Failure mode #2: skipping the Respond stage.Teams measure diligently but never convert findings into content or dev sprint work. The prompt library becomes a dashboard rather than an operational artefact. Cure: treat the Friday Respond session as the most important meeting of the week; if nothing changes in the next sprint, the library isn’t working.

Failure mode #3: library bloat. Every prompt ever discovered stays forever, even when citation share is persistently zero and nothing has changed in the market. Library grows to 900+ rows and the weekly cadence becomes unsustainable. Cure: retire prompts aggressively; archive anything at zero citation share for 30+ days with no active content fix scheduled.

Failure mode #4: single-engine sampling.Team tracks Perplexity only because it’s the easiest to automate. Misses shifts on AI Mode or ChatGPT that would change priority. Cure: all four primary engines every week, even if the tooling is manual on two of them initially.

Kill the volume column. It’s an attractor for old habits. Replace with intent + buyer stage. If you miss a traffic metric, add estimated-monthly-traffic to the citation-target page, not the prompt.
Every prompt needs a citation-target page. “Any of our PDPs” isn’t a citation target. Pick the specific URL you want the engine to cite, and measure against it.
Full sentences, not phrases. “best wool rug 8x10 for high-traffic hallway with two cats” is a prompt. “wool rug 8x10 pets” is a keyword phrase. They behave totally differently in AI retrieval.
Include buyer voice prompts from calls and support. Not just your marketing team’s imagined prompts. Customer language reliably out-performs internal guesses.
Treat the Friday Respond session as the point of the cycle. If nothing changes in sprint planning, the library is decorative and should be killed rather than maintained.

Closing — the artefact shapes the work

The deepest cost of keeping a keyword list after moving to GEO isn’t the list itself — it’s that the artefact shapes the conversations your team has. Weekly meetings organised around keyword volume debate whether to target a 12K-volume keyword or a 3K-volume keyword. Weekly meetings organised around a prompt library debate whether to add a FAQ to a collection page or to seed a Reddit thread with buyer language. Different artefact, different strategy surface, different outcomes.

Tags:promptskeywordsgeoprocessmeasurement

The prompt library that replaces your keyword list

Why keyword lists fail the AI retrieval test

The seven columns of a prompt library

The four-stage weekly lifecycle

Monday — Discover

Tuesday — Sample

Wed-Thu — Measure

Friday — Respond

Common failure modes in the transition

Closing — the artefact shapes the work

Frequently asked questions

See how your Shopify store scores with AI engines

Sources & further reading

AI shopping prompt taxonomy

AI citations weekly measurement playbook

Claude shopping prompt patterns

Related reading

How to read your GEO score: 4 sub-score patterns

The GEO hiring rubric: 5 competencies

The Black Friday GEO playbook for Shopify

The prompt library that replaces your keyword list

Why keyword lists fail the AI retrieval test

The seven columns of a prompt library

The four-stage weekly lifecycle

Monday — Discover

Tuesday — Sample

Wed-Thu — Measure

Friday — Respond

Common failure modes in the transition

Closing — the artefact shapes the work

Frequently asked questions

See how your Shopify store scores with AI engines

Sources & further reading

Keep reading

AI shopping prompt taxonomy

AI citations weekly measurement playbook

Claude shopping prompt patterns

Related reading

How to read your GEO score: 4 sub-score patterns

The GEO hiring rubric: 5 competencies

The Black Friday GEO playbook for Shopify