do-web-doc-resolver

Resolve queries or URLs into compact, LLM-ready markdown using an intelligent, low-cost cascade

Overview

This project implements a v4 cascade resolver with Python core, Rust CLI, and web UI that prioritizes free and low-cost data sources.

Overview

This project implements a v4 cascade resolver with Python core, Rust CLI, and web UI that prioritizes free and low-cost data sources:

Query Resolution Cascade

Semantic Cache - Multi-layer cache (URL, Query, Provider)
Exa MCP - FREE search via Model Context Protocol (no API key required!)
Exa SDK - Token-efficient query resolution using highlights (low-cost)
Tavily - Comprehensive search (configurable)
Serper - Google search via Serper API (2500 free credits)
DuckDuckGo - Free search, always available (no API key)
Mistral - AI-powered fallback when other methods fail

URL Resolution Cascade

Semantic Cache - Instant retrieval for known URLs
llms.txt / Jina Reader - Parallel fast-path probes for structured documentation
Firecrawl - Deep extraction (requires API key)
Direct HTTP fetch - Basic content extraction (free)
Mistral browser - AI-powered fallback when other methods fail
DuckDuckGo - Free search fallback (no API key)

Features

Three Interfaces: Python library, Rust CLI (do-wdr), and Next.js web UI
Execution Profiles: free, balanced, fast, and quality modes
Telemetry & Metrics: Detailed per-provider latency and cost tracking
Content Compaction: Intelligent boilerplate removal and deduplication
AI Synthesis: Cohesive research answers synthesized using Mistral
Parallel Probes: Concurrent fast-path provider checks for lower latency
Link Validation: Automated async HTTP status checks for returned links
Bias Scoring: Quality ranking based on domain trust and heuristics
Document & OCR: Support for PDF/DOCX via Docling and images via OCR
Semantic Cache: Feature-gated similarity cache (Turso/libsql-backed)
Circuit Breakers: Per-provider failure tracking with cooldowns
Routing Memory: Per-domain provider success/latency learning
Web UI: Browser-based resolver with dark mode, PWA support, and help/FAQ
Result Canonicalization: Normalizes and deduplicates search hits with card-based previews and per-result actions

Installation

Python (Primary)

git clone https://github.com/d-oit/do-web-doc-resolver.git
cd do-web-doc-resolver
pip install -r requirements.txt

Rust CLI

cd cli && cargo build --release
# Binary: cli/target/release/do-wdr

Web UI

cd web && npm install && npx playwright install chromium

Git Hooks

./scripts/setup-hooks.sh
cp scripts/pre-commit-hook.sh .git/hooks/pre-commit && chmod +x .git/hooks/pre-commit

Configuration

All API keys are optional. The resolver works without any keys by using free providers (Exa MCP, llms.txt, DuckDuckGo).

Variable	Provider	Notes
`EXA_API_KEY`	Exa SDK	Optional - Exa MCP is free and runs first
`TAVILY_API_KEY`	Tavily	Optional - comprehensive search
`SERPER_API_KEY`	Serper	Optional - Google search (2500 free credits)
`FIRECRAWL_API_KEY`	Firecrawl	Optional - deep extraction
`MISTRAL_API_KEY`	Mistral	Optional - AI-powered fallback

Setting API Keys

# Linux/macOS
export EXA_API_KEY="your-exa-key"
export TAVILY_API_KEY="your-tavily-key"
export SERPER_API_KEY="your-serper-key"
export FIRECRAWL_API_KEY="your-firecrawl-key"
export MISTRAL_API_KEY="your-mistral-key"

# Windows (PowerShell)
$env:EXA_API_KEY="your-exa-key"
$env:TAVILY_API_KEY="your-tavily-key"
$env:SERPER_API_KEY="your-serper-key"
$env:FIRECRAWL_API_KEY="your-firecrawl-key"
$env:MISTRAL_API_KEY="your-mistral-key"

Rust CLI supports config.toml or DO_WDR_* env vars. See .agents/skills/do-web-doc-resolver/references/CONFIG.md for full reference.

Usage

Python API

Basic Usage (No API Keys)

from scripts.resolve import resolve

# Resolve a URL (uses llms.txt + free fallbacks)
result = resolve("https://example.com")

# Resolve a query (uses Exa MCP - free!)
result = resolve("latest AI research papers")

Skip Specific Providers

from scripts.resolve import resolve

result = resolve("query", skip_providers={"exa_mcp", "exa"})

Use a Specific Provider Directly

from scripts.resolve import resolve_direct, ProviderType

result = resolve_direct("https://example.com", ProviderType.JINA)
result = resolve_direct("python tutorials", ProviderType.EXA_MCP)
result = resolve_direct("latest news", ProviderType.DUCKDUCKGO)

Available providers:

URL providers: llms_txt, jina, firecrawl, direct_fetch, mistral_browser, duckduckgo
Query providers: exa_mcp, exa, tavily, serper, duckduckgo, mistral_websearch

Custom Provider Order

from scripts.resolve import resolve_with_order, ProviderType

result = resolve_with_order(
    "https://example.com",
    [ProviderType.LLMS_TXT, ProviderType.JINA, ProviderType.DIRECT_FETCH]
)

result = resolve_with_order(
    "python tutorials",
    [ProviderType.EXA_MCP, ProviderType.DUCKDUCKGO]
)

Python CLI

# Resolve a URL
python -m scripts.cli "https://example.com"

# Resolve a query
python -m scripts.cli "machine learning tutorials"

# With options
python -m scripts.cli "query" --max-chars 8000 --json --log-level INFO

# Skip providers
python -m scripts.cli "query" --skip exa_mcp --skip exa

# Use a specific provider
python -m scripts.cli "https://example.com" --provider jina

# Custom provider order
python -m scripts.cli "https://example.com" --providers-order "llms_txt,jina,direct_fetch"

Rust CLI (`do-wdr`)

# Build the CLI
cd cli && cargo build --release

# Resolve a URL or query
./target/release/do-wdr resolve "https://example.com"
./target/release/do-wdr resolve "machine learning tutorials"

# Output as JSON
./target/release/do-wdr resolve "query" --json

# Save to file
./target/release/do-wdr resolve "https://example.com" -o result.md

# Use execution profile
./target/release/do-wdr resolve "query" --profile free    # No paid providers
./target/release/do-wdr resolve "query" --profile fast    # Low latency
./target/release/do-wdr resolve "query" --profile quality # Best results

# Skip providers
./target/release/do-wdr resolve "query" --skip exa_mcp,exa

# AI synthesis from multiple providers
./target/release/do-wdr resolve "query" --synthesize

# Metrics output
./target/release/do-wdr resolve "query" --metrics-json

# List available providers
./target/release/do-wdr providers

# Show current config
./target/release/do-wdr config

Web UI

# Development
cd web && npm run dev
# Open http://localhost:3000

# Production
# https://web-eight-ivory-29.vercel.app

The web UI provides a browser-based interface with:

Collapsible configuration sidebar (profile, providers, advanced options)
Profile-based provider selection with visual active indicators
Collapsible API keys section for paid providers
All UI state persisted to localStorage (sidebar, profile, providers, options)
Text input for URLs or search queries
Card-based result display with per-link copy/open actions plus a raw markdown toggle
Inline toggles for deep research, skip cache, and max character presets
Dark mode support
Help & FAQ page (/help)
PWA support for installable app

How It Works

Query Resolution Cascade

Query Input
    |
    v
+-------------------------------------------------------------+
| 1. Exa MCP Search (FREE - no API key required!)             |
|    - Uses Model Context Protocol at https://mcp.exa.ai/mcp  |
|    - JSON-RPC 2.0 over HTTP POST                            |
|    - Rate limit handling: 30s cooldown                      |
|    - On error: continue to next provider                    |
+-------------------------------------------------------------+
    | (fail)
    v
+-------------------------------------------------------------+
| 2. Exa SDK Search (if EXA_API_KEY set)                      |
|    - Uses highlights for token-efficient results            |
|    - Rate limit handling: 60s cooldown                      |
|    - On error: continue to next provider                    |
+-------------------------------------------------------------+
    | (fail/unavailable)
    v
+-------------------------------------------------------------+
| 3. Tavily Search (if TAVILY_API_KEY set)                    |
|    - Comprehensive search results                           |
|    - Rate limit handling: 60s cooldown                      |
|    - On error: continue to next provider                    |
+-------------------------------------------------------------+
    | (fail/unavailable)
    v
+-------------------------------------------------------------+
| 4. Serper Search (if SERPER_API_KEY set)                    |
|    - Google search results via Serper API                   |
|    - 2500 free credits                                      |
|    - Rate limit handling: 60s cooldown                      |
|    - On error: continue to next provider                    |
+-------------------------------------------------------------+
    | (fail/unavailable)
    v
+-------------------------------------------------------------+
| 5. DuckDuckGo Search (FREE - no API key required!)          |
|    - Completely free, no authentication needed              |
|    - Rate limit handling: 30s cooldown                      |
|    - Always available as fallback                           |
+-------------------------------------------------------------+
    | (fail)
    v
+-------------------------------------------------------------+
| 6. Mistral Web Search (if MISTRAL_API_KEY set)              |
|    - Uses Mistral chat API with web search                  |
|    - Free tier available                                    |
|    - Rate limit handling: 60s cooldown                      |
+-------------------------------------------------------------+
    | (fail/unavailable)
    v
+-------------------------------------------------------------+
| 7. Return Error                                             |
|    - source: "none"                                         |
|    - error: "No resolution method available"                |
+-------------------------------------------------------------+

URL Resolution Cascade

URL Input
    |
    v
+-------------------------------------------------------------+
| 1. Check for llms.txt                                       |
|    - Probe: https://origin/llms.txt                         |
|    - If found: return structured documentation              |
|    - FREE - no API key required                             |
|    - Cached per origin (1-hour TTL)                         |
+-------------------------------------------------------------+
    | (not found)
    v
+-------------------------------------------------------------+
| 2. Jina Reader (FREE - https://r.jina.ai/<url>)             |
|    - No API key required, 20 RPM free tier                  |
|    - Returns clean markdown for any public URL              |
|    - Rate limit handling: 60s cooldown                      |
+-------------------------------------------------------------+
    | (fail)
    v
+-------------------------------------------------------------+
| 3. Firecrawl Extraction (if FIRECRAWL_API_KEY set)          |
|    - Deep content extraction with markdown output           |
|    - Rate limit handling: 60s cooldown                      |
|    - On rate limit/quota: fallback to next provider         |
|    - On auth error: return None                             |
+-------------------------------------------------------------+
    | (fail/unavailable)
    v
+-------------------------------------------------------------+
| 4. Direct HTTP Fetch                                        |
|    - Basic content extraction from HTML                     |
|    - FREE - no API key required                             |
+-------------------------------------------------------------+
    | (fail)
    v
+-------------------------------------------------------------+
| 5. Mistral Browser (if MISTRAL_API_KEY set)                 |
|    - Uses Mistral agent with web browsing capability        |
|    - Free tier available                                    |
|    - Rate limit handling: 60s cooldown                      |
+-------------------------------------------------------------+
    | (fail/unavailable)
    v
+-------------------------------------------------------------+
| 6. DuckDuckGo Search (fallback)                             |
|    - Search for the URL as a query                          |
|    - FREE - no API key required                             |
+-------------------------------------------------------------+
    | (fail)
    v
+-------------------------------------------------------------+
| 7. Return Error                                             |
|    - source: "none"                                         |
|    - error: "No resolution method available"                |
+-------------------------------------------------------------+

Error Handling & Self-Learning

The resolver automatically handles:

Rate Limits: Detects 429 errors and falls back to next source
No Credits: Catches "no credits" errors and uses free alternatives
Network Errors: Graceful degradation through the cascade
Invalid Responses: Validates content before returning
Missing API Keys: Skips paid services when keys not configured
Circuit Breakers: Per-provider failure tracking with cooldowns
Negative Cache: TTL-based cache of failed provider/target pairs
Routing Memory: Per-domain provider success rate learning

Testing

Python

# Unit tests (no API keys needed)
python -m pytest tests/ -v -m "not live"

# Live integration tests (requires API keys)
python -m pytest tests/ -m live -v

# With coverage
python -m pytest --cov=scripts tests/

Rust CLI

cd cli && cargo test

# Lint
cd cli && cargo clippy -- -D warnings && cargo fmt --check

Web UI

cd web && npx playwright test --project=desktop

# With UI
cd web && npx playwright test --ui

Quality Gate (All Checks)

./scripts/quality_gate.sh

Repository Structure

do-web-doc-resolver/
├── AGENTS.md              # Agent instructions
├── README.md              # This file
├── scripts/
│   ├── resolve.py         # Main Python resolver
│   ├── quality_gate.sh    # Pre-commit quality checks
│   └── setup-hooks.sh     # Git hook installer
├── cli/                   # Rust CLI (do-wdr binary)
│   ├── Cargo.toml
│   └── src/
│       ├── main.rs        # Entry point
│       ├── cli.rs         # Clap CLI definition
│       ├── config.rs      # Config loading (TOML + env)
│       ├── resolver/       # Cascade orchestrator
│       ├── providers/     # 13 provider modules
│       └── ...            # quality, metrics, synthesis, etc.
├── web/                   # Next.js web UI (Vercel)
│   ├── app/
│   │   ├── page.tsx       # Resolver form
│   │   └── help/page.tsx  # Help & FAQ
│   ├── tests/e2e/         # Playwright E2E tests
│   └── vercel.json        # Deployment config
├── tests/                 # Python test suite
├── .agents/skills/        # Canonical skill definitions
│   └── do-web-doc-resolver/
│       ├── SKILL.md       # Main skill file
│       └── references/    # Detailed reference docs
│           ├── CASCADE.md  # Full cascade decision trees
│           ├── PROVIDERS.md # Provider API details
│           ├── CLI.md      # CLI usage reference
│           ├── RUST_CLI.md # Rust CLI architecture
│           ├── TESTING.md  # Test structure
│           └── CONFIG.md   # Config reference
└── .github/workflows/     # CI/CD pipelines

Related Files

AGENTS.md - Agent instructions
.agents/skills/do-web-doc-resolver/ - Skill definition and reference docs
scripts/resolve.py - Python resolver source
cli/src/ - Rust CLI source

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch
Add tests for new functionality
Run ./scripts/quality_gate.sh before submitting
Submit a pull request

Assets & Screenshots

Screenshots and visual assets are stored in assets/screenshots/. See assets/README.md for details.

Capturing Screenshots

# Capture for current version
./scripts/capture/capture-release.sh

# Capture for specific version
./scripts/capture/capture-release.sh 1.0.0

# Capture resolution flow
./scripts/capture/capture-flow.sh resolve

# Capture responsive views
./scripts/capture/capture-responsive.sh

License

MIT License - see LICENSE file for details.

Support

For issues, questions, or feature requests, please open an issue.

Note: This project prioritizes cost-efficiency and graceful degradation. It works perfectly fine with zero API keys configured, using only free sources (Exa MCP, llms.txt, DuckDuckGo). API keys enhance functionality but are not required. The Rust CLI and Web UI provide fast native and browser-based interfaces to the same resolver core.

Name		Name	Last commit message	Last commit date
Latest commit History 334 Commits
.agents/skills		.agents/skills
.blackbox		.blackbox
.claude		.claude
.github		.github
.opencode		.opencode
agents-docs		agents-docs
assets		assets
cli		cli
plans		plans
samples		samples
scripts		scripts
tests		tests
web		web
.gitignore		.gitignore
.python-version		.python-version
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
skills-lock.json		skills-lock.json

Folders and files

Latest commit

History

Repository files navigation

do-web-doc-resolver

Overview

Overview

Query Resolution Cascade

URL Resolution Cascade

Features

Installation

Python (Primary)

Rust CLI

Web UI

Git Hooks

Configuration

Setting API Keys

Usage

Python API

Basic Usage (No API Keys)

Skip Specific Providers

Use a Specific Provider Directly

Custom Provider Order

Python CLI

Rust CLI (do-wdr)

Web UI

How It Works

Query Resolution Cascade

URL Resolution Cascade

Error Handling & Self-Learning

Testing

Python

Rust CLI

Web UI

Quality Gate (All Checks)

Repository Structure

Related Files

Contributing

Assets & Screenshots

Capturing Screenshots

License

Support

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 6

Uh oh!

Contributors

Uh oh!

Languages

Rust CLI (`do-wdr`)