Skip to content

Prompt Sensitivity Bench

An open benchmark measuring how prompt wording changes model outputs across the model size spectrum.

The largest measured effect so far is specificity. On 1-4B local models, moving from a vague prompt to a task plus input/output spec raised pass rate from 8% to 82%.

:octicons-graph-16: Findings :octicons-beaker-16: Methodology :octicons-database-16: Data :octicons-git-pull-request-16: Contribute

What This Is

This is a versioned benchmark with derived prompt sensitivity findings. Each finding links to the public summary, the task definitions, and the command needed to reproduce or extend the measurement on your own run archive.

What This Is Not

This is not a capability leaderboard. LMArena, Artificial Analysis, HELM, LiveBench, and coding leaderboards already measure model strength. This project measures how much model behavior changes when the prompt changes.

This is not a product page. It is not an evaluation framework. It is not a benchmark submission tracker.

Current Findings

Finding Headline
Specificity Vague prompts fail; input/output specs create the largest measured jump.
Complexity More prompt detail can hurt small models.
Filler words Text humans treat as filler can be structure for small models.
Format preference XML, Markdown, and plain text were indistinguishable in delimiter-only coding tests.
k=1 trap Single-shot measurements can reverse conclusions on boundary models.
DeMorgan inversion One phrasing caused a deterministic logic inversion on llama3.1:8b.