AI SearchGEOMeasurementMethodology

Measuring AI Search Visibility: Why One-Shot Checks Lie

Most AI search visibility tools report one number from one run. The number is unstable. Here is the statistical case for measuring visibility as a distribution, with confidence intervals and a stability score.

Adrian VellerJune 11, 202610 min read

Ask ChatGPT the same brand question ten times and you will get ten different answers. Some will name your brand. Some will not. The order will shift. The cited sources will change. This is not a bug — it is the defining property of the medium. Yet most AI search visibility dashboards still report a single number, sampled from a single run, as if it were ground truth. That number is, in the strict statistical sense, a guess.

This is the gap a recent St. Gallen research paper (Schulte, Bleeker, & Kaufmann, 2026) makes precise. Their central claim — and the one this article builds on — is that AI search visibility is a distribution, not a point. A defensible measurement program needs two dimensions: how often a brand appears, and how stable that frequency is. Most of the industry reports only the first, and even then reports it once.

What is wrong with measuring AI search visibility once?

The short answer: large language model outputs are stochastic. Same prompt, same model, same day — different responses. Variance is built into how these systems are sampled. A single measurement is therefore one observation drawn from a hidden distribution whose shape you do not know.

In a traditional SEO world, this was less of a problem. Ranking on Google for a given query was a deterministic-feeling outcome. You crawled, you saw a position, you logged it. The position changed slowly. With generative engines, the equivalent of "position" is "did the model name us, and where in the answer." That outcome flickers between calls.

The practical consequence is brutal. If you check a query once and your brand was mentioned, you might log "100% mention rate for that query." Run it twenty more times and the true rate could be 35%. Or 80%. The single check is compatible with both. Reporting the single check is not measurement. It is sampling without statistics.

This is where most AI visibility tools break. They run a prompt, parse the answer, store a row, and call it tracking. The dashboards built on that data look smooth. They are not.

Why AI search visibility is a distribution, not a snapshot

The cleanest way to think about this is to forget the language of "rankings" entirely. For any given prompt, your brand has a true probability p of being mentioned by a given model. You cannot observe p directly. You can only sample it. Each query run is a Bernoulli trial — present, or not — and across many trials the share of positive samples gives you an estimate p̂ (read: "p-hat") of the true rate.

A single trial gives you no idea how close p̂ is to p. Twenty trials start to narrow the range. Thirty plus, with reasonable assumptions, give you a confidence interval you can report without embarrassment. The math is undergraduate-level — binomial proportion confidence intervals — but the implication is what matters: every visibility number deserves a band, not a point.

Where it appears in the answer is even more variable. Was your brand the first vendor named? The third? Buried in a footnote citation? That is a categorical distribution, not a binary one. Measuring it correctly means tracking the full position frequency, not just whether you showed up.

The St. Gallen team formalized this across four sectors — automotive, banking, sports retail, food retail — and showed that variance is not a quirk of a few prompts. It is structural. The same finding holds when you swap providers. Variance patterns differ across models — retrieval-anchored engines look different from pure generative ones — but every model produces a distribution. None produces a point. Measuring on just one model is therefore as misleading as measuring on just one run; the entire model panel needs its own sampling discipline.

The two axes that matter: frequency and stability

Once you accept that visibility is a distribution, the right way to summarize it is two numbers, not one.

Frequency — p̂, the share of runs that mention your brand. This is the headline KPI most people already track.
Stability — how tight the distribution is around p̂. Operationally, the half-width of the 95% confidence interval, or a coefficient of variation across repeated batches.

These two combine into a simple map.

The reliable visibility quadrant — high frequency, high stability — is the only state you can defensibly report to a CMO. The lucky hit quadrant is where most one-shot dashboards live: a high number that will not survive the next run. The unstable ghost is the scariest case: occasional appearances that look like signal in a single sample but average to noise. Invisible-but-stable is at least honest.

This frame is also useful for prioritization. A prompt where you are stable-and-low is a content gap. A prompt where you are unstable-and-high is a defense problem — your appearance is real, but the model is uncertain about you, which usually points to weak entity signals or thin source coverage. The action plan is different in each case.

How many measurements do you actually need?

The honest answer depends on how precise your decisions need to be. The useful answer is: about thirty runs per prompt is the minimum that lets you report a confidence interval without cringing.

The arithmetic is straightforward. For a binomial proportion at a 95% confidence level, the half-width of the Wilson interval at p̂ = 0.5 (the worst case) is roughly 1/√n. Thirty runs put your half-width near ±18 percentage points. A hundred runs tighten it to ±10. A thousand runs get you to ±3. Thirty is not luxury. It is the floor.

This sounds expensive until you do the math against the alternative. If you are tracking 50 priority prompts across 5 models with one run each, you are paying for 250 calls and reporting numbers that could be off by 50 percentage points either way. Run the same 50 prompts 30 times each on 3 models and you are at 4,500 calls — 18× more — but you are now reporting numbers a board can act on. The waste is in the small budget, not the large one.

A few practical refinements that the paper does not spell out but matter in production:

Time the runs apart, not in a tight burst. Some platforms cache responses for short windows; thirty runs in three minutes can give you near-zero variance for the wrong reason.
Rotate prompt phrasing variants if your real-world question has natural paraphrases. A model's behavior on "best CRM for small teams" and "what CRM should a small team use" is different, and pretending they are the same prompt understates real-world variance.
Track session vs. fresh chat behavior separately. Some platforms behave differently when prior turns prime the context.

For a working baseline: thirty runs, fresh sessions, one per minute, across the prompts you actually intend to defend in front of a leadership team.

What this means for how you report AI search visibility

The reporting format changes in three concrete ways.

Every visibility KPI gets a confidence band. Brand mention rate, citation rate, position frequency, share of voice — none of them belong on a slide as a single number anymore. The format is p̂ ± half-width, or "32% (95% CI: 24–41%)." If your tooling cannot produce that band, your tooling is not measuring; it is sampling.

Every report carries a stability score alongside the headline. The simplest version is the half-width of the CI. A more analyst-friendly version is a 0–100 stability score derived from coefficient of variation across batches. Either way, the second number prevents the lucky-hit failure mode.

Trends compare distributions, not points. When you say "our visibility went up week-over-week," the question to answer is whether the new distribution is meaningfully different from last week's, not whether last week's single measurement is lower than this week's single measurement. A two-proportion z-test or a non-overlapping CI test is the right bar.

If you are already running a KPI dashboard, our companion piece on AI search visibility metrics and KPIs is where this methodology gets wired into actual KPI definitions. If you are still picking the platform that will run the measurements, the AI visibility tools buyer's guide compares vendors on whether they actually report intervals or only sample once.

A measurement protocol you can run starting Monday

You do not need to wait for new tooling to start doing this right. The minimum viable protocol fits on one page.

Pick 20–50 prompts that matter. Real buyer questions, written in natural language, including competitor comparisons and category-defining questions. Not keywords. Questions.
Pick 3–5 models that matter to your buyers. Cover the engines your audience actually queries — typically a mix of conversational and retrieval-anchored systems. One model is not a benchmark, it is a hostage.
Run each prompt 30 times per model, fresh sessions, spaced over a few hours. Log the full answer text, the cited sources, and the position of your brand mention.
Compute three things per prompt-model pair: mention rate p̂, the 95% CI half-width, and the most common position when you are mentioned.
Report headline, band, and stability score together. No single numbers. Ever.
Re-run the protocol on a fixed cadence — weekly for active campaigns, monthly for steady-state monitoring — and compare distributions, not points.

This is more rigorous than what most agencies ship today. It is also dramatically cheaper than buying a third dashboard that gives you the same wrong answer prettier. The methodology is the moat. The tools are downstream.

FAQ: AI search visibility measurement

How many runs are enough to call a measurement reliable?

For a 95% CI half-width of ±10 percentage points, plan for around 100 runs per prompt. ±18 percentage points needs ~30. Below 30 you are not reporting a measurement, you are reporting an anecdote.

Does this apply to every model equally?

Variance differs across models. Models that expose temperature controls (and where you have set temperature low) are more stable. Conversational models with retrieval components tend to be more stable than pure generative answers because the retrieval anchors the response. Always measure variance per model rather than assuming.

Is one measurement ever enough?

For an executive snapshot of "are we anywhere in this answer at all," a one-shot check is fine as a directional smoke test. For any decision involving budget, content investment, or competitive comparison, no.

How does this change competitor share-of-voice reporting?

Competitor share of voice should be reported as a distribution per competitor, not as a stack of points. The ranking among competitors can flip across runs. A leadership-grade competitive readout reports each competitor as a band, with a non-overlap test before claiming one is ahead.

Where does this leave click-through and traffic data?

Untouched, and still secondary. Click-through measures the engagement layer, not the visibility layer. The point of distribution-based visibility measurement is to give you a defensible KPI for the layer where users see your brand without ever clicking — which is most of AI search.

References

Schulte, A., Bleeker, J., & Kaufmann, R. (2026). Don't Measure Once: Measuring Visibility in AI Search (GEO). University of St. Gallen. arXiv:2604.07585

Author

Adrian Veller

Adrian Veller designs the measurement frameworks behind GEO programs at scale. His work brings statistical rigor — distributions, confidence intervals, stability — to a field still treating one-shot prompts as ground truth.