Skip to content

marmutapp/needlehaystack

Repository files navigation

HaystackBench

██╗  ██╗ █████╗ ██╗   ██╗███████╗████████╗ █████╗  ██████╗██╗  ██╗
██║  ██║██╔══██╗╚██╗ ██╔╝██╔════╝╚══██╔══╝██╔══██╗██╔════╝██║ ██╔╝
███████║███████║ ╚████╔╝ ███████╗   ██║   ███████║██║     █████╔╝
██╔══██║██╔══██║  ╚██╔╝  ╚════██║   ██║   ██╔══██║██║     ██╔═██╗
██║  ██║██║  ██║   ██║   ███████║   ██║   ██║  ██║╚██████╗██║  ██╗
╚═╝  ╚═╝╚═╝  ╚═╝   ╚═╝   ╚══════╝   ╚═╝   ╚═╝  ╚═╝ ╚═════╝╚═╝  ╚═╝
     ██████╗ ███████╗███╗   ██╗ ██████╗██╗  ██╗
     ██╔══██╗██╔════╝████╗  ██║██╔════╝██║  ██║
     ██████╔╝█████╗  ██╔██╗ ██║██║     ███████║
     ██╔══██╗██╔══╝  ██║╚██╗██║██║     ██╔══██║
     ██████╔╝███████╗██║ ╚████║╚██████╗██║  ██║
     ╚═════╝ ╚══════╝╚═╝  ╚═══╝ ╚═════╝╚═╝  ╚═╝

PyPI version Python 3.10+ License: MIT Tests Coverage

A comprehensive, professional-grade Needle in the Haystack (NIAH) evaluation framework for benchmarking LLM long-context retrieval and reasoning capabilities.


What is NIAH?

The "Needle in a Haystack" test family evaluates whether a large language model can retrieve or reason about specific information embedded within a long document. HaystackBench implements 8 task types covering everything from simple retrieval to complex multi-hop reasoning.

Features

  • 8 task types: S-RT, M-RT, M-RS, ATC, Counting, Key-Value, Conflicting Needles, Distractor Needles
  • 7 LLM providers: Anthropic, OpenAI, Google, Mistral, Cohere, Groq, Ollama (local)
  • Web UI: Full SvelteKit frontend with interactive heatmaps, real-time progress, and cell-level drill-down
  • CLI: Complete haystackbench command for power users and CI/CD
  • Publication-quality charts: 2D heatmaps, radar charts, line charts, HTML reports
  • Multi-model comparison: Run multiple models in a single experiment
  • Smart caching: Avoid re-running identical cells
  • Checkpoint/resume: Never lose progress on interrupted runs
  • Cost estimation: Know your bill before you run — including which cells will be skipped for models with smaller context windows
  • Context window guardrails: Cells that exceed a model's advertised context limit are skipped gracefully rather than failing mid-run
  • Large context support: Presets up to 1M tokens; tested with Gemini 2.5, Claude 3.7, and other long-context models
  • Custom models: Register any provider/model not in the built-in list with its own context window and pricing
  • Custom needle library: Define reusable needle/expected-answer pairs in the Settings UI and select them in the experiment wizard
  • Multi-needle depth strategies: uniform, centered, random, or bookends placement for M-RT/M-RS tasks; redundant depth sweeps are automatically collapsed for depth-independent strategies
  • YAML configs: Define experiments declaratively
  • No cloud required: Everything runs locally; your data stays on your machine

Quick Start

Install

pip install haystackbench
# or with all providers:
pip install "haystackbench[all]"

Add your API key

haystackbench setup
# or via environment variable:
export ANTHROPIC_API_KEY=sk-ant-...

Run your first test

# Quick S-RT test with Claude (3 lengths × 5 depths ≈ 15 API calls)
haystackbench run s-rt \
  --provider anthropic \
  --model claude-3-5-haiku-20241022 \
  --preset quick

# View results
haystackbench results list
haystackbench results plot <experiment_id>
haystackbench results export <experiment_id> --format html

Python API

import asyncio
from haystackbench import TestRunner, get_preset
from haystackbench.config.schema import ModelConfig
from haystackbench.providers import get_provider
from haystackbench.storage import ResultStore, ResponseCache

async def main():
    config = get_preset("quick")
    config.models = [ModelConfig(provider="anthropic", model="claude-3-5-haiku-20241022")]

    store = ResultStore()
    await store.init_db()
    runner = TestRunner(storage=store, cache=ResponseCache())

    provider = get_provider("anthropic")
    result = await runner.run_experiment(config, {"anthropic": provider})

    print(f"Overall accuracy: {result.overall_accuracy:.1%}")

asyncio.run(main())

Web UI

haystackbench serve  # Opens at http://127.0.0.1:8080

Task Types

Task Description Output
S-RT Single-Needle Retrieval 2D heatmap (accuracy × context × depth)
M-RT Multi-Needle Retrieval (N needles) Per-needle accuracy + aggregate
M-RS Multi-Needle Reasoning (sum, max, etc.) Reasoning accuracy
ATC Ancestral Trace Challenge (multi-hop) Accuracy by hop depth
Counting Count occurrences of a pattern Numeric match
Key-Value Retrieve value for a specific key Exact match
Conflicting Two contradictory facts — which wins? Primacy/recency bias
Distractor Near-miss facts surround the target Retrieval precision

Supported Providers

Provider Notable Models Max Context
Anthropic claude-haiku-4-5, claude-sonnet-4-6, claude-opus-4-6, claude-3-7-sonnet 200K
OpenAI gpt-5, gpt-5-mini, gpt-4.1, o3, o4-mini 128K–1M
Google gemini-2.5-flash-lite, gemini-2.5-flash, gemini-2.5-pro 1M
Mistral mistral-large, mistral-small, devstral, codestral 128K
Cohere command-a, command-r-plus, command-r 128K
Groq llama-4-maverick, kimi-k2, qwen3-32b, llama-3.3-70b 128K
Ollama Any local model Varies
Custom Any model via provider API User-defined

YAML Config

# experiments/my_experiment.yaml
name: "Claude vs Gemini — Multi-Needle Long Context"
task_type: m_rs
context_lengths: [8192, 32768, 131072, 1000000]
depth_percents: [0.1, 0.3, 0.5, 0.7, 0.9]
num_needles: [3, 5, 10]
depth_distribution: centered   # uniform | centered | random | bookends
trials_per_cell: 2
needle_config:
  needle_type: synthetic_numeric
  reasoning_type: sum
haystack_config:
  source: paul_graham
models:
  - provider: anthropic
    model: claude-sonnet-4-6
  - provider: google
    model: gemini-2.5-flash
scoring:
  method: auto   # auto-selects numeric or substring based on needle type
haystackbench run --config experiments/my_experiment.yaml

Interpreting Results

The primary output is a 2D heatmap where:

  • X-axis: context length (longer = harder)
  • Y-axis: document depth (where the needle is placed)
  • Color: accuracy (green = 1.0, red = 0.0)

Common patterns:

  • "Lost in the Middle": Dark band in the center rows — model misses information at mid-context depths
  • "Context Cliff": Green columns → sudden red columns — hard cutoff in effective context
  • "Recency Bias": Bottom rows brighter than top rows

Comparison with Other Tools

Feature HaystackBench gkamradt NIAH NeedleBench RULER
Task types 8 1 6 5
Web UI
Multi-provider 7+ custom 2 2 2
Cost estimation
Context guardrails
Checkpoint/resume
YAML config
HTML reports
Local models
Custom needle library

Documentation

Citation

@software{haystackbench2025,
  title  = {HaystackBench: A Comprehensive NIAH Evaluation Framework for LLMs},
  year   = {2025},
  url    = {https://github.com/marmutapp/needlehaystack},
}

License

MIT — see LICENSE

About

Needle in the Haystack testbench for LLMs

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages