A Python CLI tool that uses local LLM inference (Ollama) to automatically curate Obsidian vault notes with intelligent tag management and semantic clustering.
- π€ Automated Frontmatter Generation: Generates/updates YAML frontmatter with tags, category, and summary
- π·οΈ Semantic Tag Clustering: Consolidates fragmented tags using embeddings and hierarchical clustering
- π Intelligent Tag Matching: 7-step tag resolution pipeline with fuzzy and semantic matching
- π Path-First Categorization: Uses directory structure as primary category signal
- π Safety First: Always creates timestamped backups before modifications
- ποΈ Dry-Run Mode: Preview changes without modifying files
- π Privacy-Preserving: Uses local Ollama for LLM inference
- π Interactive Tag Review: Review and approve/reject proposed tags
- π Tag Analytics: Detailed statistics and consolidation reports
Before Curator:
- 952 unique tags, 76.5% used only once
- Poor tag reuse (23.5% reuse rate)
- Tags don't connect related notes
After Curator (Phase 2):
- 17 consolidated tags
- 98.2% reduction in tag fragmentation
- 80%+ tag reuse rate
- Meaningful connections between notes
- Installation
- Quick Start
- Two-Phase Workflow
- Commands Reference
- Configuration
- How It Works
- Examples
- Troubleshooting
- Python 3.10+ installed
- Ollama running locally with required models:
ollama pull qwen3:14b ollama pull nomic-embed-text
# 1. Clone this repository
git clone https://github.com/yourusername/obsidian-curator.git
cd obsidian-curator
# 2. Create and activate virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# 3. Install dependencies
pip install -r requirements.txt
# 4. Configure your vault
cp config.example.yaml config.yaml
cp tags_registry.example.json tags_registry.json
# 5. Edit config.yaml to set your vault path
nano config.yaml # or your preferred editorAlways start with dry-run mode to preview changes:
# Activate virtual environment
source venv/bin/activate
# Run with dry-run (safe - no files modified)
python -m curator run --dry-run --max-files 10
# Review the results
cat curator_log.csv
python -m curator show-pending-tagsAfter reviewing dry-run results:
# Run with actual file modifications
python -m curator run --no-dry-run --max-files 50The curator implements a sophisticated two-phase approach to achieve optimal tag consolidation:
Collect all tags that the LLM generates without any tag ontology:
# 1. Run curator to collect tags
python -m curator run --dry-run --max-files 500
# 2. This creates new_tags_proposed.csv with all proposed tags
# Result: May generate 900+ unique tags with heavy fragmentationBuild a consolidated tag ontology and re-run:
# 1. Build consolidated tag ontology using semantic clustering
python -m curator build-ontology --min-freq 3 --target 80 --dry-run
# Review the consolidation report, then apply it:
python -m curator build-ontology --min-freq 3 --target 80 --no-dry-run
# 2. Backup Phase 1 data
mv curator_log.csv curator_log_phase1.csv
mv new_tags_proposed.csv new_tags_proposed_phase1.csv
# 3. Re-run curator with consolidated ontology
python -m curator run --dry-run --max-files 500
# 4. Review and apply
python -m curator run --no-dry-run --max-files 500Result: LLM now sees 50-100 canonical tags and prioritizes reusing them, dramatically reducing tag fragmentation.
python -m curator run [OPTIONS]
Options:
--dry-run / --no-dry-run Preview changes (default: --dry-run)
--max-files N Maximum files to process
-c, --config PATH Config file path (default: config.yaml)
Examples:
python -m curator run --dry-run --max-files 10
python -m curator run --no-dry-run --max-files 50
python -m curator run --config my-config.yamlpython -m curator build-ontology [OPTIONS]
Options:
--min-freq N Minimum tag occurrences to include (default: 3)
--target N Target number of canonical tags (default: 80)
--similarity FLOAT Cosine similarity threshold (default: 0.75)
--dry-run / --no-dry-run Preview without saving (default: --dry-run)
Examples:
# Preview consolidation
python -m curator build-ontology --min-freq 3 --target 80 --dry-run
# Apply consolidation
python -m curator build-ontology --min-freq 3 --target 80 --no-dry-run
# More aggressive clustering (fewer tags)
python -m curator build-ontology --target 50 --similarity 0.70python -m curator review-tags [OPTIONS]
Options:
--batch N Review N tags before pausing (default: 20)
--sort {count,alpha,date} Sort order (default: count)
Examples:
python -m curator review-tags
python -m curator review-tags --batch 50 --sort alphapython -m curator show-pending-tagspython -m curator dump-registryvault:
path: "/path/to/your/obsidian/vault"
include_extensions: [".md"]
exclude_dirs:
- ".git"
- ".obsidian"
- ".trash"tag_policy:
fuzzy_threshold: 0.90 # Levenshtein ratio for fuzzy matching
semantic_threshold: 0.85 # Cosine similarity for semantic matching
auto_accept_new_tags: false # Require human review for new tags
blocked_tags:
- "note"
- "notes"
- "draft"
- "summary"processing:
max_files_per_run: 50
selection_mode: "modified" # "all" | "modified" | "random"
modified_since_days: 30 # For mode="modified"
dry_run: true # Safety first!frontmatter:
merge_existing_tags: true # Union of old + new tags
overwrite_summary: false # Keep existing summary
overwrite_category: false # Keep existing category
skip_if_status:
- "done"
- "frozen"
- "manual"1. Normalize β Lowercase, replace spaces with hyphens
2. Check Blocked β Reject generic tags like "note", "idea"
3. Resolve Alias β Map synonyms (e.g., "ml" β "machine-learning")
4. Exact Match β Check if tag exists in canonical registry
5. Fuzzy Match β Find similar tags (>90% Levenshtein similarity)
6. Semantic Match β Use embeddings to find related tags (>0.85 cosine similarity)
7. New Tag β Auto-accept or propose for human review
Priority:
1. Path pattern match (e.g., "Resources/" β "Resources")
2. LLM suggestion (if valid category)
3. Default category from config
The build-ontology command uses:
- Embeddings: Ollama
nomic-embed-textto vectorize tag names - Clustering: Scikit-learn agglomerative clustering with cosine similarity
- Representative Selection: Highest frequency tag per cluster becomes canonical
- Alias Creation: Other cluster members become aliases
$ python -m curator run --dry-run --max-files 5
DRY RUN MODE - No files will be modified
Processing: research-notes.md... β
Processing: meeting-minutes.md... β
Processing: project-ideas.md... β
Summary:
Files Processed: 3
New Tags Proposed: 12
Success Rate: 100%$ python -m curator build-ontology --min-freq 3 --target 80 --dry-run
Building Tag Ontology
Step 1: Loading proposed tags from CSV...
Loaded 83 tags with frequency >= 3
Total unique tags: 952
Step 2: Computing embeddings...
Computing embeddings: 100%|ββββββββββ| 83/83
Step 3: Clustering tags...
Created 80 clusters
Tag Ontology Consolidation Report
Tags before: 952
Tags after: 80
Reduction: 91.6%
Top 20 Tag Clusters by Frequency:
βββββββββββββββββββββββββββββ¬ββββββββββββ¬βββββββββββββββ¬ββββββββββββββββββββββ
β Representative β Frequency β Cluster Size β Members β
βββββββββββββββββββββββββββββΌββββββββββββΌβββββββββββββββΌββββββββββββββββββββββ€
β statistical-analysis β 13 β 2 β data-analysis, ... β
β data-visualization β 12 β 1 β data-visualization β
β anova β 10 β 1 β anova β
βββββββββββββββββββββββββββββ΄ββββββββββββ΄βββββββββββββββ΄ββββββββββββββββββββββ$ python -m curator review-tags --batch 10
Tag Review Session
Tag: machine-learning
Variations: machine learning, ML, ml
Occurrences: 47
Example: ~/Atlas/ML/supervised-learning.md
[A]ccept / [R]eject / [S]kip / [Q]uit: a
β Accepted
Summary:
Accepted: 15
Rejected: 3
Skipped: 2obsidian-curator/
βββ curator/
β βββ __init__.py
β βββ main.py # CLI entry point (Typer commands)
β βββ schemas.py # Pydantic models for validation
β βββ gatekeeper.py # Tag governance and resolution
β βββ analyzer.py # LLM interaction (Ollama)
β βββ executor.py # File I/O and backup management
βββ config.example.yaml # Example configuration
βββ config.yaml # Your configuration (gitignored)
βββ tags_registry.example.json
βββ tags_registry.json # Tag ontology (gitignored)
βββ requirements.txt
βββ CLAUDE.md # Development guidelines
βββ README.md # This file
- β
Timestamped Backups: Created in
.backup/before any modification - β
Dry-Run Default: Must explicitly use
--no-dry-runto modify files - β
Comprehensive Logging: All changes logged to
curator_log.csv - β Skip Protection: Won't modify files with status "done", "frozen", or "manual"
- β Merge Mode: Preserves existing tags while adding new ones
- β Validation: Pydantic schemas ensure data integrity
# Check if Ollama is running
curl http://localhost:11434/
# If not, start Ollama
ollama serveMake sure you're in the virtual environment:
source venv/bin/activate
which python # Should show path in venv/Check:
- Vault path in
config.yamlis correct - Files aren't excluded by
exclude_dirs - Files don't have a skipped status in frontmatter
max_files_per_runisn't set too low
If LLM output is truncated:
- Check
num_predictinanalyzer.py(should be >= 2000) - Try a different LLM model
- Reduce
max_input_charsto give LLM more tokens for output
- Use
--max-filesto limit batch size - Check Ollama is using GPU acceleration
- Consider using a smaller LLM model for testing
Typical processing speed (on RTX 5070 Ti, 16GB VRAM):
- LLM Analysis: 20-30 seconds per file
- Tag Resolution: <1 second per tag
- Embedding Generation: ~0.5 seconds per tag
- Overall: ~40-60 files per hour
Contributions welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
See CLAUDE.md for:
- Detailed architecture notes
- Implementation guidelines
- Testing strategies
- Common pitfalls to avoid
MIT License - See LICENSE file for details
- Built with Ollama for local LLM inference
- Uses sentence-transformers for semantic similarity
- Powered by Typer for CLI
- Styled with Rich for beautiful terminal output
--dry-run first!
π‘ Tip: Start with a small subset of files (--max-files 10) to test configuration before processing your entire vault.