🏛️ Obsidian Local Curator

A Python CLI tool that uses local LLM inference (Ollama) to automatically curate Obsidian vault notes with intelligent tag management and semantic clustering.

✨ Features

🤖 Automated Frontmatter Generation: Generates/updates YAML frontmatter with tags, category, and summary
🏷️ Semantic Tag Clustering: Consolidates fragmented tags using embeddings and hierarchical clustering
🔍 Intelligent Tag Matching: 7-step tag resolution pipeline with fuzzy and semantic matching
📂 Path-First Categorization: Uses directory structure as primary category signal
🔒 Safety First: Always creates timestamped backups before modifications
👁️ Dry-Run Mode: Preview changes without modifying files
🔐 Privacy-Preserving: Uses local Ollama for LLM inference
📊 Interactive Tag Review: Review and approve/reject proposed tags
📈 Tag Analytics: Detailed statistics and consolidation reports

🎯 Problem Solved

Before Curator:

952 unique tags, 76.5% used only once
Poor tag reuse (23.5% reuse rate)
Tags don't connect related notes

After Curator (Phase 2):

17 consolidated tags
98.2% reduction in tag fragmentation
80%+ tag reuse rate
Meaningful connections between notes

🚀 Installation

Prerequisites

Python 3.10+ installed

Ollama running locally with required models:

ollama pull qwen3:14b
ollama pull nomic-embed-text

Setup

# 1. Clone this repository
git clone https://github.com/yourusername/obsidian-curator.git
cd obsidian-curator

# 2. Create and activate virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# 3. Install dependencies
pip install -r requirements.txt

# 4. Configure your vault
cp config.example.yaml config.yaml
cp tags_registry.example.json tags_registry.json

# 5. Edit config.yaml to set your vault path
nano config.yaml  # or your preferred editor

🎬 Quick Start

First-Time Users: Dry-Run

Always start with dry-run mode to preview changes:

# Activate virtual environment
source venv/bin/activate

# Run with dry-run (safe - no files modified)
python -m curator run --dry-run --max-files 10

# Review the results
cat curator_log.csv
python -m curator show-pending-tags

Running for Real

After reviewing dry-run results:

# Run with actual file modifications
python -m curator run --no-dry-run --max-files 50

🔄 Two-Phase Workflow

The curator implements a sophisticated two-phase approach to achieve optimal tag consolidation:

Phase 1: Tag Collection

Collect all tags that the LLM generates without any tag ontology:

# 1. Run curator to collect tags
python -m curator run --dry-run --max-files 500

# 2. This creates new_tags_proposed.csv with all proposed tags
# Result: May generate 900+ unique tags with heavy fragmentation

Phase 2: Ontology Consolidation

Build a consolidated tag ontology and re-run:

# 1. Build consolidated tag ontology using semantic clustering
python -m curator build-ontology --min-freq 3 --target 80 --dry-run

# Review the consolidation report, then apply it:
python -m curator build-ontology --min-freq 3 --target 80 --no-dry-run

# 2. Backup Phase 1 data
mv curator_log.csv curator_log_phase1.csv
mv new_tags_proposed.csv new_tags_proposed_phase1.csv

# 3. Re-run curator with consolidated ontology
python -m curator run --dry-run --max-files 500

# 4. Review and apply
python -m curator run --no-dry-run --max-files 500

Result: LLM now sees 50-100 canonical tags and prioritizes reusing them, dramatically reducing tag fragmentation.

📚 Commands Reference

Main Commands

`run` - Process Notes

python -m curator run [OPTIONS]

Options:
  --dry-run / --no-dry-run    Preview changes (default: --dry-run)
  --max-files N               Maximum files to process
  -c, --config PATH           Config file path (default: config.yaml)

Examples:
  python -m curator run --dry-run --max-files 10
  python -m curator run --no-dry-run --max-files 50
  python -m curator run --config my-config.yaml

`build-ontology` - Consolidate Tags

python -m curator build-ontology [OPTIONS]

Options:
  --min-freq N                Minimum tag occurrences to include (default: 3)
  --target N                  Target number of canonical tags (default: 80)
  --similarity FLOAT          Cosine similarity threshold (default: 0.75)
  --dry-run / --no-dry-run    Preview without saving (default: --dry-run)

Examples:
  # Preview consolidation
  python -m curator build-ontology --min-freq 3 --target 80 --dry-run

  # Apply consolidation
  python -m curator build-ontology --min-freq 3 --target 80 --no-dry-run

  # More aggressive clustering (fewer tags)
  python -m curator build-ontology --target 50 --similarity 0.70

`review-tags` - Interactive Tag Review

python -m curator review-tags [OPTIONS]

Options:
  --batch N                   Review N tags before pausing (default: 20)
  --sort {count,alpha,date}   Sort order (default: count)

Examples:
  python -m curator review-tags
  python -m curator review-tags --batch 50 --sort alpha

`show-pending-tags` - View Proposed Tags

python -m curator show-pending-tags

`dump-registry` - View Tag Registry

python -m curator dump-registry

⚙️ Configuration

Key Configuration Sections

Vault Settings

vault:
  path: "/path/to/your/obsidian/vault"
  include_extensions: [".md"]
  exclude_dirs:
    - ".git"
    - ".obsidian"
    - ".trash"

Tag Policy

tag_policy:
  fuzzy_threshold: 0.90        # Levenshtein ratio for fuzzy matching
  semantic_threshold: 0.85     # Cosine similarity for semantic matching
  auto_accept_new_tags: false  # Require human review for new tags
  blocked_tags:
    - "note"
    - "notes"
    - "draft"
    - "summary"

Processing Control

processing:
  max_files_per_run: 50
  selection_mode: "modified"   # "all" | "modified" | "random"
  modified_since_days: 30      # For mode="modified"
  dry_run: true                # Safety first!

Frontmatter Behavior

frontmatter:
  merge_existing_tags: true      # Union of old + new tags
  overwrite_summary: false       # Keep existing summary
  overwrite_category: false      # Keep existing category
  skip_if_status:
    - "done"
    - "frozen"
    - "manual"

🔧 How It Works

Tag Resolution Pipeline (7 Steps)

1. Normalize      → Lowercase, replace spaces with hyphens
2. Check Blocked  → Reject generic tags like "note", "idea"
3. Resolve Alias  → Map synonyms (e.g., "ml" → "machine-learning")
4. Exact Match    → Check if tag exists in canonical registry
5. Fuzzy Match    → Find similar tags (>90% Levenshtein similarity)
6. Semantic Match → Use embeddings to find related tags (>0.85 cosine similarity)
7. New Tag        → Auto-accept or propose for human review

Category Determination (Path-First Strategy)

Priority:
1. Path pattern match (e.g., "Resources/" → "Resources")
2. LLM suggestion (if valid category)
3. Default category from config

Semantic Tag Clustering

The build-ontology command uses:

Embeddings: Ollama nomic-embed-text to vectorize tag names
Clustering: Scikit-learn agglomerative clustering with cosine similarity
Representative Selection: Highest frequency tag per cluster becomes canonical
Alias Creation: Other cluster members become aliases

📊 Examples

Example 1: Basic Dry-Run

$ python -m curator run --dry-run --max-files 5
DRY RUN MODE - No files will be modified
Processing: research-notes.md... ✓
Processing: meeting-minutes.md... ✓
Processing: project-ideas.md... ✓

Summary:
  Files Processed: 3
  New Tags Proposed: 12
  Success Rate: 100%

Example 2: Build Ontology Output

$ python -m curator build-ontology --min-freq 3 --target 80 --dry-run

Building Tag Ontology

Step 1: Loading proposed tags from CSV...
Loaded 83 tags with frequency >= 3
Total unique tags: 952

Step 2: Computing embeddings...
Computing embeddings: 100%|██████████| 83/83

Step 3: Clustering tags...
Created 80 clusters

Tag Ontology Consolidation Report

Tags before: 952
Tags after: 80
Reduction: 91.6%

Top 20 Tag Clusters by Frequency:
┌───────────────────────────┬───────────┬──────────────┬─────────────────────┐
│ Representative            │ Frequency │ Cluster Size │ Members             │
├───────────────────────────┼───────────┼──────────────┼─────────────────────┤
│ statistical-analysis      │        13 │            2 │ data-analysis, ...  │
│ data-visualization        │        12 │            1 │ data-visualization  │
│ anova                     │        10 │            1 │ anova               │
└───────────────────────────┴───────────┴──────────────┴─────────────────────┘

Example 3: Interactive Tag Review

$ python -m curator review-tags --batch 10

Tag Review Session

Tag: machine-learning
  Variations: machine learning, ML, ml
  Occurrences: 47
  Example: ~/Atlas/ML/supervised-learning.md

[A]ccept / [R]eject / [S]kip / [Q]uit: a
✓ Accepted

Summary:
  Accepted: 15
  Rejected: 3
  Skipped: 2

📁 Project Structure

obsidian-curator/
├── curator/
│   ├── __init__.py
│   ├── main.py           # CLI entry point (Typer commands)
│   ├── schemas.py        # Pydantic models for validation
│   ├── gatekeeper.py     # Tag governance and resolution
│   ├── analyzer.py       # LLM interaction (Ollama)
│   └── executor.py       # File I/O and backup management
├── config.example.yaml   # Example configuration
├── config.yaml           # Your configuration (gitignored)
├── tags_registry.example.json
├── tags_registry.json    # Tag ontology (gitignored)
├── requirements.txt
├── CLAUDE.md            # Development guidelines
└── README.md            # This file

🛡️ Safety Features

✅ Timestamped Backups: Created in .backup/ before any modification
✅ Dry-Run Default: Must explicitly use --no-dry-run to modify files
✅ Comprehensive Logging: All changes logged to curator_log.csv
✅ Skip Protection: Won't modify files with status "done", "frozen", or "manual"
✅ Merge Mode: Preserves existing tags while adding new ones
✅ Validation: Pydantic schemas ensure data integrity

🐛 Troubleshooting

Ollama Not Responding

# Check if Ollama is running
curl http://localhost:11434/

# If not, start Ollama
ollama serve

Import Errors

Make sure you're in the virtual environment:

source venv/bin/activate
which python  # Should show path in venv/

No Files Processed

Check:

Vault path in config.yaml is correct
Files aren't excluded by exclude_dirs
Files don't have a skipped status in frontmatter
max_files_per_run isn't set too low

JSON Extraction Errors

If LLM output is truncated:

Check num_predict in analyzer.py (should be >= 2000)
Try a different LLM model
Reduce max_input_chars to give LLM more tokens for output

Slow Performance

Use --max-files to limit batch size
Check Ollama is using GPU acceleration
Consider using a smaller LLM model for testing

📈 Performance

Typical processing speed (on RTX 5070 Ti, 16GB VRAM):

LLM Analysis: 20-30 seconds per file
Tag Resolution: <1 second per tag
Embedding Generation: ~0.5 seconds per tag
Overall: ~40-60 files per hour

🤝 Contributing

Contributions welcome! Please:

Fork the repository
Create a feature branch
Make your changes
Add tests if applicable
Submit a pull request

📝 Development

See CLAUDE.md for:

Detailed architecture notes
Implementation guidelines
Testing strategies
Common pitfalls to avoid

📄 License

MIT License - See LICENSE file for details

🙏 Acknowledgments

Built with Ollama for local LLM inference
Uses sentence-transformers for semantic similarity
Powered by Typer for CLI
Styled with Rich for beautiful terminal output

⚠️ Note: This tool modifies your Obsidian vault. Always backup your vault and test with --dry-run first!

💡 Tip: Start with a small subset of files (--max-files 10) to test configuration before processing your entire vault.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
curator		curator
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
config.example.yaml		config.example.yaml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
tags_registry.example.json		tags_registry.example.json

Folders and files

Latest commit

History

Repository files navigation

🏛️ Obsidian Local Curator

✨ Features

🎯 Problem Solved

📋 Table of Contents

🚀 Installation

Prerequisites

Setup

🎬 Quick Start

First-Time Users: Dry-Run

Running for Real

🔄 Two-Phase Workflow

Phase 1: Tag Collection

Phase 2: Ontology Consolidation

📚 Commands Reference

Main Commands

run - Process Notes

build-ontology - Consolidate Tags

review-tags - Interactive Tag Review

show-pending-tags - View Proposed Tags

dump-registry - View Tag Registry

⚙️ Configuration

Key Configuration Sections

Vault Settings

Tag Policy

Processing Control

Frontmatter Behavior

🔧 How It Works

Tag Resolution Pipeline (7 Steps)

Category Determination (Path-First Strategy)

Semantic Tag Clustering

📊 Examples

Example 1: Basic Dry-Run

Example 2: Build Ontology Output

Example 3: Interactive Tag Review

📁 Project Structure

🛡️ Safety Features

🐛 Troubleshooting

Ollama Not Responding

Import Errors

No Files Processed

JSON Extraction Errors

Slow Performance

📈 Performance

🤝 Contributing

📝 Development

📄 License

🙏 Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`run` - Process Notes

`build-ontology` - Consolidate Tags

`review-tags` - Interactive Tag Review

`show-pending-tags` - View Proposed Tags

`dump-registry` - View Tag Registry

Packages