I tried safishamsi/graphify on a private codebase. Notes from the run.
Short answer — yes, but the bit that actually helped was the thing graphify doesn't advertise.
The headline number ("fewer tokens per Claude query") is real in graphify's own benchmark. I haven't used graphify enough days to show you a week-over-week drop in my own Claude bill. I'll say that instead of pretending.
pip install graphifyy && graphify install
# then inside Claude Code:
/graphify .→ 4 minutes → 445 nodes · 446 edges · 118 communities → graph.html · graph.json · GRAPH_REPORT.md · obsidian vault (563 notes) → local only — no Neo4j, no vector DB, no extra API keys
Force-layout settling. Every color is one Leiden community.
Same graph inside Obsidian's native graph view. Labels stripped in this crop — the actual vault has one note per graph node + one overview per community + a graph.canvas with structured groups.
Graphify's report flagged a "surprising connection" — two of my own docs looked too similar to each other.
I traced it. Found 6 different value-prop framings across 6 strategy docs. No two agreed.
Fix:
→ wrote one canonical MOAT.md
→ banner-linked the other six to it
→ ran graphify --update
→ duplicate semantic edges dropped to zero
20 minutes. Real bug, gone. Before I'd even looked at the benchmark.
This is the use case no viral post is talking about. Graphify as a ruthless editor for your own strategy docs.
Graphify's published claim: 71.5× fewer tokens per query.
I ran 20 generic engineering questions against the graph. 19 matched. Two baselines:
| Compared against | typical | worst | best |
|---|---|---|---|
| graphify's default baseline | ~70× | 8.5× | 549× |
| my full corpus size (126k words) | ~400× | 48× | 3,120× |
My ~70× lines up with graphify's 71.5× almost exactly. Their number replicates.
The ~400× comes from comparing against the full corpus word count, which is closer to what Claude Code actually reads when it dumps files into context.
I'm not using the 3,120× peak as the headline — that's one tight question, not a real distribution.
Raw data: benchmark_results.json
Environment + graph SHA-256: env.json
Run it yourself:
python3 bench.py /path/to/your/graphify-out/graph.json→ answer quality — I measured tokens, not whether the graph-scoped answers are as good as full-file answers → week-over-week savings in my daily Claude Code work — I built the graph, I haven't yet lived with it long enough to show you a real bill
When I run that last experiment, I'll add the receipt here.
The "didn't test" list was too long. Went and ran them. Findings:
Ran graphify + my bench harness on 2 public repos:
| Corpus | Files | Words | Default median | Measured median |
|---|---|---|---|---|
vercel/commerce (small Next.js) |
67 | 10k | 252× | 349× |
tj/commander.js (CLI library) |
179 | 86k | 22× | 87× |
| my private corpus | 187 | 126k | 67× | 379× |
Graphify's 71.5× claim sits in the middle of a 22×–252× range on default baseline.
Pattern: app code with cross-file imports (commerce, mine) gets higher ratios. Library code with tight internal structure (commander) gets lower. The headline number is not universal — it's corpus-shape-dependent.
Raw data: cross_corpus_results.json
Ran OpenAI's cl100k_base encoder (a reasonable public proxy — Anthropic ships its own counter, but cl100k_base is the closest off-the-shelf tokenizer for a directional check) on the actual subgraph text graphify would send for 15 of my queries. Compared against graphify's words × 1.33 heuristic.
→ median |drift| — 6.1% (heuristic is roughly right) → mean |drift| — 22.3% (pulled up by 4 outliers) → max |drift| — 74% (on tight-scope questions with short-label nodes)
Direction is mixed. Heuristic often overcounts on short subgraphs (real tokens lower → real ratios even higher on those queries). Undercounts on broader subgraphs by ~5-13%. Takeaway: the "~15% drift" claim is true for the median, breaks down at the tails.
Raw data: tiktoken_drift.json
Took tj/commander.js (179 files), saved a baseline manifest, then synthetically:
→ added 1 new file
→ modified 1 existing file
→ deleted 1 existing file
Ran detect_incremental. It correctly flagged all 3 changes by hash-diff against the saved manifest (2 under files_flagged_as_changed for the add + modify, plus 1 under deleted_detected — summed = 3). --update would re-extract only those 3 files (1.7% of the corpus), not the full 179.
Raw data: staleness_test.json
Took tj/commander.js, checked out the oldest of 10 recent commits, saved a baseline manifest, then replayed each commit forward running detect_incremental at every step.
Result across 10 commits:
→ mean: 0.1 files flagged as changed per commit
→ 1 commit flagged 1 file (a README tweak — the only commit that touched a tracked file type)
→ the other 9 were dependabot bumps to package.json / lockfiles / CI configs — correctly ignored because graphify's tracked types are code, docs, papers, images, and video files
So running --update after every commit is basically free on typical dev cycles — graphify only sees the commits that matter. Zero false positives, zero misses.
Raw data: longterm_drift.json
Graphify supports 13 agentic CLIs (per upstream). I only had gemini installed, so I tested it end-to-end:
graphify gemini install # writes skill + hook
graphify gemini uninstall # removes cleanly→ ~/.gemini/skills/graphify/SKILL.md — byte-identical to ~/.claude/skills/graphify/SKILL.md (same 54,664 bytes)
→ hook differs per platform API — Claude gets PreToolUse matching Glob|Grep; Gemini gets BeforeTool matching read_file|list_directory
→ project-level integration doc: Claude → CLAUDE.md; Gemini → GEMINI.md
Same skill logic across backends. I couldn't run an end-to-end query on Gemini (no API key handy), but the install protocol is consistent and verifiable.
Untested by me: Cursor, Codex, Aider, OpenCode, Droid, Trae, Antigravity, Hermes.
Raw data: non_claude_backends.json
Would need a proper experiment: same questions, Claude run twice — once with full-file context, once with graph-scoped subgraph — scored against a rubric. That's a separate afternoon + API budget. Left for a future update.
→ answer quality (see above — the one I still owe) → week-over-week savings in my daily Claude Code work
README.md — this
LAUNCH_POST.md — LinkedIn / X drafts
bench.py — reproduction harness (20 blind queries)
env.json — environment + graph.json SHA-256
benchmark_results.json — raw distribution, both baselines
cross_corpus_results.json — commander.js + vercel/commerce runs
tiktoken_drift.json — per-query heuristic vs real tokens
staleness_test.json — --update synthetic change detection
longterm_drift.json — 10-commit replay of tj/commander.js
non_claude_backends.json — graphify on Gemini CLI verified; claude/gemini diff
assets/ — 4 inline SVGs + cluster GIF + Obsidian still
pip install graphifyy && graphify installThen /graphify . inside Claude Code.
Upstream: safishamsi/graphify — MIT, ~27k stars, local only.
Tool is theirs, not mine. I ran the experiment and wrote this up.
MIT — for the notes, prose, assets, and bench harness in this repo.

