██╗ ██╗ █████╗ ██╗ ██╗███████╗████████╗ █████╗ ██████╗██╗ ██╗
██║ ██║██╔══██╗╚██╗ ██╔╝██╔════╝╚══██╔══╝██╔══██╗██╔════╝██║ ██╔╝
███████║███████║ ╚████╔╝ ███████╗ ██║ ███████║██║ █████╔╝
██╔══██║██╔══██║ ╚██╔╝ ╚════██║ ██║ ██╔══██║██║ ██╔═██╗
██║ ██║██║ ██║ ██║ ███████║ ██║ ██║ ██║╚██████╗██║ ██╗
╚═╝ ╚═╝╚═╝ ╚═╝ ╚═╝ ╚══════╝ ╚═╝ ╚═╝ ╚═╝ ╚═════╝╚═╝ ╚═╝
██████╗ ███████╗███╗ ██╗ ██████╗██╗ ██╗
██╔══██╗██╔════╝████╗ ██║██╔════╝██║ ██║
██████╔╝█████╗ ██╔██╗ ██║██║ ███████║
██╔══██╗██╔══╝ ██║╚██╗██║██║ ██╔══██║
██████╔╝███████╗██║ ╚████║╚██████╗██║ ██║
╚═════╝ ╚══════╝╚═╝ ╚═══╝ ╚═════╝╚═╝ ╚═╝
A comprehensive, professional-grade Needle in the Haystack (NIAH) evaluation framework for benchmarking LLM long-context retrieval and reasoning capabilities.
The "Needle in a Haystack" test family evaluates whether a large language model can retrieve or reason about specific information embedded within a long document. HaystackBench implements 8 task types covering everything from simple retrieval to complex multi-hop reasoning.
- 8 task types: S-RT, M-RT, M-RS, ATC, Counting, Key-Value, Conflicting Needles, Distractor Needles
- 7 LLM providers: Anthropic, OpenAI, Google, Mistral, Cohere, Groq, Ollama (local)
- Web UI: Full SvelteKit frontend with interactive heatmaps, real-time progress, and cell-level drill-down
- CLI: Complete
haystackbenchcommand for power users and CI/CD - Publication-quality charts: 2D heatmaps, radar charts, line charts, HTML reports
- Multi-model comparison: Run multiple models in a single experiment
- Smart caching: Avoid re-running identical cells
- Checkpoint/resume: Never lose progress on interrupted runs
- Cost estimation: Know your bill before you run — including which cells will be skipped for models with smaller context windows
- Context window guardrails: Cells that exceed a model's advertised context limit are skipped gracefully rather than failing mid-run
- Large context support: Presets up to 1M tokens; tested with Gemini 2.5, Claude 3.7, and other long-context models
- Custom models: Register any provider/model not in the built-in list with its own context window and pricing
- Custom needle library: Define reusable needle/expected-answer pairs in the Settings UI and select them in the experiment wizard
- Multi-needle depth strategies:
uniform,centered,random, orbookendsplacement for M-RT/M-RS tasks; redundant depth sweeps are automatically collapsed for depth-independent strategies - YAML configs: Define experiments declaratively
- No cloud required: Everything runs locally; your data stays on your machine
pip install haystackbench
# or with all providers:
pip install "haystackbench[all]"haystackbench setup
# or via environment variable:
export ANTHROPIC_API_KEY=sk-ant-...# Quick S-RT test with Claude (3 lengths × 5 depths ≈ 15 API calls)
haystackbench run s-rt \
--provider anthropic \
--model claude-3-5-haiku-20241022 \
--preset quick
# View results
haystackbench results list
haystackbench results plot <experiment_id>
haystackbench results export <experiment_id> --format htmlimport asyncio
from haystackbench import TestRunner, get_preset
from haystackbench.config.schema import ModelConfig
from haystackbench.providers import get_provider
from haystackbench.storage import ResultStore, ResponseCache
async def main():
config = get_preset("quick")
config.models = [ModelConfig(provider="anthropic", model="claude-3-5-haiku-20241022")]
store = ResultStore()
await store.init_db()
runner = TestRunner(storage=store, cache=ResponseCache())
provider = get_provider("anthropic")
result = await runner.run_experiment(config, {"anthropic": provider})
print(f"Overall accuracy: {result.overall_accuracy:.1%}")
asyncio.run(main())haystackbench serve # Opens at http://127.0.0.1:8080| Task | Description | Output |
|---|---|---|
| S-RT | Single-Needle Retrieval | 2D heatmap (accuracy × context × depth) |
| M-RT | Multi-Needle Retrieval (N needles) | Per-needle accuracy + aggregate |
| M-RS | Multi-Needle Reasoning (sum, max, etc.) | Reasoning accuracy |
| ATC | Ancestral Trace Challenge (multi-hop) | Accuracy by hop depth |
| Counting | Count occurrences of a pattern | Numeric match |
| Key-Value | Retrieve value for a specific key | Exact match |
| Conflicting | Two contradictory facts — which wins? | Primacy/recency bias |
| Distractor | Near-miss facts surround the target | Retrieval precision |
| Provider | Notable Models | Max Context |
|---|---|---|
| Anthropic | claude-haiku-4-5, claude-sonnet-4-6, claude-opus-4-6, claude-3-7-sonnet | 200K |
| OpenAI | gpt-5, gpt-5-mini, gpt-4.1, o3, o4-mini | 128K–1M |
| gemini-2.5-flash-lite, gemini-2.5-flash, gemini-2.5-pro | 1M | |
| Mistral | mistral-large, mistral-small, devstral, codestral | 128K |
| Cohere | command-a, command-r-plus, command-r | 128K |
| Groq | llama-4-maverick, kimi-k2, qwen3-32b, llama-3.3-70b | 128K |
| Ollama | Any local model | Varies |
| Custom | Any model via provider API | User-defined |
# experiments/my_experiment.yaml
name: "Claude vs Gemini — Multi-Needle Long Context"
task_type: m_rs
context_lengths: [8192, 32768, 131072, 1000000]
depth_percents: [0.1, 0.3, 0.5, 0.7, 0.9]
num_needles: [3, 5, 10]
depth_distribution: centered # uniform | centered | random | bookends
trials_per_cell: 2
needle_config:
needle_type: synthetic_numeric
reasoning_type: sum
haystack_config:
source: paul_graham
models:
- provider: anthropic
model: claude-sonnet-4-6
- provider: google
model: gemini-2.5-flash
scoring:
method: auto # auto-selects numeric or substring based on needle typehaystackbench run --config experiments/my_experiment.yamlThe primary output is a 2D heatmap where:
- X-axis: context length (longer = harder)
- Y-axis: document depth (where the needle is placed)
- Color: accuracy (green = 1.0, red = 0.0)
Common patterns:
- "Lost in the Middle": Dark band in the center rows — model misses information at mid-context depths
- "Context Cliff": Green columns → sudden red columns — hard cutoff in effective context
- "Recency Bias": Bottom rows brighter than top rows
| Feature | HaystackBench | gkamradt NIAH | NeedleBench | RULER |
|---|---|---|---|---|
| Task types | 8 | 1 | 6 | 5 |
| Web UI | ✓ | ✗ | ✗ | ✗ |
| Multi-provider | 7+ custom | 2 | 2 | 2 |
| Cost estimation | ✓ | ✗ | ✗ | ✗ |
| Context guardrails | ✓ | ✗ | ✗ | ✗ |
| Checkpoint/resume | ✓ | ✗ | ✗ | ✗ |
| YAML config | ✓ | ✗ | ✗ | ✗ |
| HTML reports | ✓ | ✗ | ✗ | ✗ |
| Local models | ✓ | ✗ | ✗ | ✗ |
| Custom needle library | ✓ | ✗ | ✗ | ✗ |
@software{haystackbench2025,
title = {HaystackBench: A Comprehensive NIAH Evaluation Framework for LLMs},
year = {2025},
url = {https://github.com/marmutapp/needlehaystack},
}MIT — see LICENSE