Skip to content

ipunithgowda/graphify-notes

Repository files navigation

graphify-notes

Graphify: source crystallizing into a knowledge graph

I tried safishamsi/graphify on a private codebase. Notes from the run.


Did it help me?

What I ran, what the tool reports, what I didn't test

Short answer — yes, but the bit that actually helped was the thing graphify doesn't advertise.

The headline number ("fewer tokens per Claude query") is real in graphify's own benchmark. I haven't used graphify enough days to show you a week-over-week drop in my own Claude bill. I'll say that instead of pretending.


What it ran

Detect → AST + LLM extract → cluster → outputs

pip install graphifyy && graphify install
# then inside Claude Code:
/graphify .

→ 4 minutes → 445 nodes · 446 edges · 118 communities → graph.html · graph.json · GRAPH_REPORT.md · obsidian vault (563 notes) → local only — no Neo4j, no vector DB, no extra API keys


What it looks like

Force-layout animation — nodes self-organizing into 118 communities

Force-layout settling. Every color is one Leiden community.

Obsidian graph view of the 445-node knowledge graph, 118 communities color-coded

Same graph inside Obsidian's native graph view. Labels stripped in this crop — the actual vault has one note per graph node + one overview per community + a graph.canvas with structured groups.


The bit I didn't expect

6 conflicting docs collapsed into 1 canonical

Graphify's report flagged a "surprising connection" — two of my own docs looked too similar to each other.

I traced it. Found 6 different value-prop framings across 6 strategy docs. No two agreed.

Fix:

→ wrote one canonical MOAT.md → banner-linked the other six to it → ran graphify --update → duplicate semantic edges dropped to zero

20 minutes. Real bug, gone. Before I'd even looked at the benchmark.

This is the use case no viral post is talking about. Graphify as a ruthless editor for your own strategy docs.


The token number

Graphify's published claim: 71.5× fewer tokens per query.

I ran 20 generic engineering questions against the graph. 19 matched. Two baselines:

Compared against typical worst best
graphify's default baseline ~70× 8.5× 549×
my full corpus size (126k words) ~400× 48× 3,120×

My ~70× lines up with graphify's 71.5× almost exactly. Their number replicates.

The ~400× comes from comparing against the full corpus word count, which is closer to what Claude Code actually reads when it dumps files into context.

I'm not using the 3,120× peak as the headline — that's one tight question, not a real distribution.

Raw data: benchmark_results.json Environment + graph SHA-256: env.json

Run it yourself:

python3 bench.py /path/to/your/graphify-out/graph.json

What I haven't tested

→ answer quality — I measured tokens, not whether the graph-scoped answers are as good as full-file answers → week-over-week savings in my daily Claude Code work — I built the graph, I haven't yet lived with it long enough to show you a real bill

When I run that last experiment, I'll add the receipt here.


Update 2026-04-15 — I went and tested three of the four

The "didn't test" list was too long. Went and ran them. Findings:

1. Other codebases — ratios vary a lot

Ran graphify + my bench harness on 2 public repos:

Corpus Files Words Default median Measured median
vercel/commerce (small Next.js) 67 10k 252× 349×
tj/commander.js (CLI library) 179 86k 22× 87×
my private corpus 187 126k 67× 379×

Graphify's 71.5× claim sits in the middle of a 22×–252× range on default baseline.

Pattern: app code with cross-file imports (commerce, mine) gets higher ratios. Library code with tight internal structure (commander) gets lower. The headline number is not universal — it's corpus-shape-dependent.

Raw data: cross_corpus_results.json

2. Tiktoken drift — median 6%, tails 74%

Ran OpenAI's cl100k_base encoder (a reasonable public proxy — Anthropic ships its own counter, but cl100k_base is the closest off-the-shelf tokenizer for a directional check) on the actual subgraph text graphify would send for 15 of my queries. Compared against graphify's words × 1.33 heuristic.

→ median |drift| — 6.1% (heuristic is roughly right) → mean |drift| — 22.3% (pulled up by 4 outliers) → max |drift| — 74% (on tight-scope questions with short-label nodes)

Direction is mixed. Heuristic often overcounts on short subgraphs (real tokens lower → real ratios even higher on those queries). Undercounts on broader subgraphs by ~5-13%. Takeaway: the "~15% drift" claim is true for the median, breaks down at the tails.

Raw data: tiktoken_drift.json

3. Graph staleness (--update) — works as advertised

Took tj/commander.js (179 files), saved a baseline manifest, then synthetically: → added 1 new file → modified 1 existing file → deleted 1 existing file

Ran detect_incremental. It correctly flagged all 3 changes by hash-diff against the saved manifest (2 under files_flagged_as_changed for the add + modify, plus 1 under deleted_detected — summed = 3). --update would re-extract only those 3 files (1.7% of the corpus), not the full 179.

Raw data: staleness_test.json

4. Long-term drift — 10-commit replay on commander.js

Took tj/commander.js, checked out the oldest of 10 recent commits, saved a baseline manifest, then replayed each commit forward running detect_incremental at every step.

Result across 10 commits: → mean: 0.1 files flagged as changed per commit → 1 commit flagged 1 file (a README tweak — the only commit that touched a tracked file type) → the other 9 were dependabot bumps to package.json / lockfiles / CI configs — correctly ignored because graphify's tracked types are code, docs, papers, images, and video files

So running --update after every commit is basically free on typical dev cycles — graphify only sees the commits that matter. Zero false positives, zero misses.

Raw data: longterm_drift.json

5. Non-Claude backends — Gemini CLI verified

Graphify supports 13 agentic CLIs (per upstream). I only had gemini installed, so I tested it end-to-end:

graphify gemini install   # writes skill + hook
graphify gemini uninstall # removes cleanly

~/.gemini/skills/graphify/SKILL.mdbyte-identical to ~/.claude/skills/graphify/SKILL.md (same 54,664 bytes) → hook differs per platform API — Claude gets PreToolUse matching Glob|Grep; Gemini gets BeforeTool matching read_file|list_directory → project-level integration doc: Claude → CLAUDE.md; Gemini → GEMINI.md

Same skill logic across backends. I couldn't run an end-to-end query on Gemini (no API key handy), but the install protocol is consistent and verifiable.

Untested by me: Cursor, Codex, Aider, OpenCode, Droid, Trae, Antigravity, Hermes.

Raw data: non_claude_backends.json

6. Answer quality — still NOT tested

Would need a proper experiment: same questions, Claude run twice — once with full-file context, once with graph-scoped subgraph — scored against a rubric. That's a separate afternoon + API budget. Left for a future update.


Still-open items

→ answer quality (see above — the one I still owe) → week-over-week savings in my daily Claude Code work


In this repo

README.md                 — this
LAUNCH_POST.md            — LinkedIn / X drafts
bench.py                  — reproduction harness (20 blind queries)
env.json                  — environment + graph.json SHA-256
benchmark_results.json    — raw distribution, both baselines
cross_corpus_results.json — commander.js + vercel/commerce runs
tiktoken_drift.json       — per-query heuristic vs real tokens
staleness_test.json       — --update synthetic change detection
longterm_drift.json       — 10-commit replay of tj/commander.js
non_claude_backends.json  — graphify on Gemini CLI verified; claude/gemini diff
assets/                   — 4 inline SVGs + cluster GIF + Obsidian still

Install (upstream)

pip install graphifyy && graphify install

Then /graphify . inside Claude Code.

Upstream: safishamsi/graphify — MIT, ~27k stars, local only.

Credit + license

Tool is theirs, not mine. I ran the experiment and wrote this up.

MIT — for the notes, prose, assets, and bench harness in this repo.

About

Notes from running safishamsi/graphify on a private codebase — 70×/400× median token reduction, the doc-audit win, and 5 follow-up experiments (cross-corpus, tiktoken drift, staleness, long-term drift, non-Claude backends). Reproducible bench harness.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages