graphify-notes

I tried safishamsi/graphify on a private codebase. Notes from the run.

Did it help me?

Short answer — yes, but the bit that actually helped was the thing graphify doesn't advertise.

The headline number ("fewer tokens per Claude query") is real in graphify's own benchmark. I haven't used graphify enough days to show you a week-over-week drop in my own Claude bill. I'll say that instead of pretending.

What it ran

pip install graphifyy && graphify install
# then inside Claude Code:
/graphify .

→ 4 minutes → 445 nodes · 446 edges · 118 communities → graph.html · graph.json · GRAPH_REPORT.md · obsidian vault (563 notes) → local only — no Neo4j, no vector DB, no extra API keys

What it looks like

Force-layout settling. Every color is one Leiden community.

Same graph inside Obsidian's native graph view. Labels stripped in this crop — the actual vault has one note per graph node + one overview per community + a graph.canvas with structured groups.

The bit I didn't expect

Graphify's report flagged a "surprising connection" — two of my own docs looked too similar to each other.

I traced it. Found 6 different value-prop framings across 6 strategy docs. No two agreed.

Fix:

→ wrote one canonical MOAT.md → banner-linked the other six to it → ran graphify --update → duplicate semantic edges dropped to zero

20 minutes. Real bug, gone. Before I'd even looked at the benchmark.

This is the use case no viral post is talking about. Graphify as a ruthless editor for your own strategy docs.

The token number

Graphify's published claim: 71.5× fewer tokens per query.

I ran 20 generic engineering questions against the graph. 19 matched. Two baselines:

Compared against	typical	worst	best
graphify's default baseline	~70×	8.5×	549×
my full corpus size (126k words)	~400×	48×	3,120×

My ~70× lines up with graphify's 71.5× almost exactly. Their number replicates.

The ~400× comes from comparing against the full corpus word count, which is closer to what Claude Code actually reads when it dumps files into context.

I'm not using the 3,120× peak as the headline — that's one tight question, not a real distribution.

Raw data: benchmark_results.json Environment + graph SHA-256: env.json

Run it yourself:

python3 bench.py /path/to/your/graphify-out/graph.json

What I haven't tested

→ answer quality — I measured tokens, not whether the graph-scoped answers are as good as full-file answers → week-over-week savings in my daily Claude Code work — I built the graph, I haven't yet lived with it long enough to show you a real bill

When I run that last experiment, I'll add the receipt here.

Update 2026-04-15 — I went and tested three of the four

The "didn't test" list was too long. Went and ran them. Findings:

1. Other codebases — ratios vary a lot

Ran graphify + my bench harness on 2 public repos:

Corpus	Files	Words	Default median	Measured median
`vercel/commerce` (small Next.js)	67	10k	252×	349×
`tj/commander.js` (CLI library)	179	86k	22×	87×
my private corpus	187	126k	67×	379×

Graphify's 71.5× claim sits in the middle of a 22×–252× range on default baseline.

Pattern: app code with cross-file imports (commerce, mine) gets higher ratios. Library code with tight internal structure (commander) gets lower. The headline number is not universal — it's corpus-shape-dependent.

Raw data: cross_corpus_results.json

2. Tiktoken drift — median 6%, tails 74%

Ran OpenAI's cl100k_base encoder (a reasonable public proxy — Anthropic ships its own counter, but cl100k_base is the closest off-the-shelf tokenizer for a directional check) on the actual subgraph text graphify would send for 15 of my queries. Compared against graphify's words × 1.33 heuristic.

Direction is mixed. Heuristic often overcounts on short subgraphs (real tokens lower → real ratios even higher on those queries). Undercounts on broader subgraphs by ~5-13%. Takeaway: the "~15% drift" claim is true for the median, breaks down at the tails.

Raw data: tiktoken_drift.json

3. Graph staleness (`--update`) — works as advertised

Took tj/commander.js (179 files), saved a baseline manifest, then synthetically: → added 1 new file → modified 1 existing file → deleted 1 existing file

Ran detect_incremental. It correctly flagged all 3 changes by hash-diff against the saved manifest (2 under files_flagged_as_changed for the add + modify, plus 1 under deleted_detected — summed = 3). --update would re-extract only those 3 files (1.7% of the corpus), not the full 179.

Raw data: staleness_test.json

4. Long-term drift — 10-commit replay on commander.js

Took tj/commander.js, checked out the oldest of 10 recent commits, saved a baseline manifest, then replayed each commit forward running detect_incremental at every step.

Result across 10 commits: → mean: 0.1 files flagged as changed per commit → 1 commit flagged 1 file (a README tweak — the only commit that touched a tracked file type) → the other 9 were dependabot bumps to package.json / lockfiles / CI configs — correctly ignored because graphify's tracked types are code, docs, papers, images, and video files

So running --update after every commit is basically free on typical dev cycles — graphify only sees the commits that matter. Zero false positives, zero misses.

Raw data: longterm_drift.json

5. Non-Claude backends — Gemini CLI verified

Graphify supports 13 agentic CLIs (per upstream). I only had gemini installed, so I tested it end-to-end:

graphify gemini install   # writes skill + hook
graphify gemini uninstall # removes cleanly

→ ~/.gemini/skills/graphify/SKILL.md — byte-identical to ~/.claude/skills/graphify/SKILL.md (same 54,664 bytes) → hook differs per platform API — Claude gets PreToolUse matching Glob|Grep; Gemini gets BeforeTool matching read_file|list_directory → project-level integration doc: Claude → CLAUDE.md; Gemini → GEMINI.md

Same skill logic across backends. I couldn't run an end-to-end query on Gemini (no API key handy), but the install protocol is consistent and verifiable.

Untested by me: Cursor, Codex, Aider, OpenCode, Droid, Trae, Antigravity, Hermes.

Raw data: non_claude_backends.json

6. Answer quality — still NOT tested

Would need a proper experiment: same questions, Claude run twice — once with full-file context, once with graph-scoped subgraph — scored against a rubric. That's a separate afternoon + API budget. Left for a future update.

Still-open items

→ answer quality (see above — the one I still owe) → week-over-week savings in my daily Claude Code work

In this repo

README.md                 — this
LAUNCH_POST.md            — LinkedIn / X drafts
bench.py                  — reproduction harness (20 blind queries)
env.json                  — environment + graph.json SHA-256
benchmark_results.json    — raw distribution, both baselines
cross_corpus_results.json — commander.js + vercel/commerce runs
tiktoken_drift.json       — per-query heuristic vs real tokens
staleness_test.json       — --update synthetic change detection
longterm_drift.json       — 10-commit replay of tj/commander.js
non_claude_backends.json  — graphify on Gemini CLI verified; claude/gemini diff
assets/                   — 4 inline SVGs + cluster GIF + Obsidian still

Install (upstream)

pip install graphifyy && graphify install

Then /graphify . inside Claude Code.

Upstream: safishamsi/graphify — MIT, ~27k stars, local only.

Credit + license

Tool is theirs, not mine. I ran the experiment and wrote this up.

MIT — for the notes, prose, assets, and bench harness in this repo.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

graphify-notes

Did it help me?

What it ran

What it looks like

The bit I didn't expect

The token number

What I haven't tested

Update 2026-04-15 — I went and tested three of the four

1. Other codebases — ratios vary a lot

2. Tiktoken drift — median 6%, tails 74%

3. Graph staleness (`--update`) — works as advertised

4. Long-term drift — 10-commit replay on commander.js

5. Non-Claude backends — Gemini CLI verified

6. Answer quality — still NOT tested

Still-open items

In this repo

Install (upstream)

Credit + license

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
.gitignore		.gitignore
LAUNCH_POST.md		LAUNCH_POST.md
LICENSE		LICENSE
README.md		README.md
bench.py		bench.py
benchmark_results.json		benchmark_results.json
cross_corpus_results.json		cross_corpus_results.json
env.json		env.json
longterm_drift.json		longterm_drift.json
non_claude_backends.json		non_claude_backends.json
staleness_test.json		staleness_test.json
tiktoken_drift.json		tiktoken_drift.json

Folders and files

Latest commit

History

Repository files navigation

graphify-notes

Did it help me?

What it ran

What it looks like

The bit I didn't expect

The token number

What I haven't tested

Update 2026-04-15 — I went and tested three of the four

1. Other codebases — ratios vary a lot

2. Tiktoken drift — median 6%, tails 74%

3. Graph staleness (--update) — works as advertised

4. Long-term drift — 10-commit replay on commander.js

5. Non-Claude backends — Gemini CLI verified

6. Answer quality — still NOT tested

Still-open items

In this repo

Install (upstream)

Credit + license

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

3. Graph staleness (`--update`) — works as advertised

Packages