Skip to content

onlelonely/obsidian-curator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ›οΈ Obsidian Local Curator

A Python CLI tool that uses local LLM inference (Ollama) to automatically curate Obsidian vault notes with intelligent tag management and semantic clustering.

Python 3.10+ License

✨ Features

  • πŸ€– Automated Frontmatter Generation: Generates/updates YAML frontmatter with tags, category, and summary
  • 🏷️ Semantic Tag Clustering: Consolidates fragmented tags using embeddings and hierarchical clustering
  • πŸ” Intelligent Tag Matching: 7-step tag resolution pipeline with fuzzy and semantic matching
  • πŸ“‚ Path-First Categorization: Uses directory structure as primary category signal
  • πŸ”’ Safety First: Always creates timestamped backups before modifications
  • πŸ‘οΈ Dry-Run Mode: Preview changes without modifying files
  • πŸ” Privacy-Preserving: Uses local Ollama for LLM inference
  • πŸ“Š Interactive Tag Review: Review and approve/reject proposed tags
  • πŸ“ˆ Tag Analytics: Detailed statistics and consolidation reports

🎯 Problem Solved

Before Curator:

  • 952 unique tags, 76.5% used only once
  • Poor tag reuse (23.5% reuse rate)
  • Tags don't connect related notes

After Curator (Phase 2):

  • 17 consolidated tags
  • 98.2% reduction in tag fragmentation
  • 80%+ tag reuse rate
  • Meaningful connections between notes

πŸ“‹ Table of Contents

πŸš€ Installation

Prerequisites

  1. Python 3.10+ installed
  2. Ollama running locally with required models:
    ollama pull qwen3:14b
    ollama pull nomic-embed-text

Setup

# 1. Clone this repository
git clone https://github.com/yourusername/obsidian-curator.git
cd obsidian-curator

# 2. Create and activate virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# 3. Install dependencies
pip install -r requirements.txt

# 4. Configure your vault
cp config.example.yaml config.yaml
cp tags_registry.example.json tags_registry.json

# 5. Edit config.yaml to set your vault path
nano config.yaml  # or your preferred editor

🎬 Quick Start

First-Time Users: Dry-Run

Always start with dry-run mode to preview changes:

# Activate virtual environment
source venv/bin/activate

# Run with dry-run (safe - no files modified)
python -m curator run --dry-run --max-files 10

# Review the results
cat curator_log.csv
python -m curator show-pending-tags

Running for Real

After reviewing dry-run results:

# Run with actual file modifications
python -m curator run --no-dry-run --max-files 50

πŸ”„ Two-Phase Workflow

The curator implements a sophisticated two-phase approach to achieve optimal tag consolidation:

Phase 1: Tag Collection

Collect all tags that the LLM generates without any tag ontology:

# 1. Run curator to collect tags
python -m curator run --dry-run --max-files 500

# 2. This creates new_tags_proposed.csv with all proposed tags
# Result: May generate 900+ unique tags with heavy fragmentation

Phase 2: Ontology Consolidation

Build a consolidated tag ontology and re-run:

# 1. Build consolidated tag ontology using semantic clustering
python -m curator build-ontology --min-freq 3 --target 80 --dry-run

# Review the consolidation report, then apply it:
python -m curator build-ontology --min-freq 3 --target 80 --no-dry-run

# 2. Backup Phase 1 data
mv curator_log.csv curator_log_phase1.csv
mv new_tags_proposed.csv new_tags_proposed_phase1.csv

# 3. Re-run curator with consolidated ontology
python -m curator run --dry-run --max-files 500

# 4. Review and apply
python -m curator run --no-dry-run --max-files 500

Result: LLM now sees 50-100 canonical tags and prioritizes reusing them, dramatically reducing tag fragmentation.

πŸ“š Commands Reference

Main Commands

run - Process Notes

python -m curator run [OPTIONS]

Options:
  --dry-run / --no-dry-run    Preview changes (default: --dry-run)
  --max-files N               Maximum files to process
  -c, --config PATH           Config file path (default: config.yaml)

Examples:
  python -m curator run --dry-run --max-files 10
  python -m curator run --no-dry-run --max-files 50
  python -m curator run --config my-config.yaml

build-ontology - Consolidate Tags

python -m curator build-ontology [OPTIONS]

Options:
  --min-freq N                Minimum tag occurrences to include (default: 3)
  --target N                  Target number of canonical tags (default: 80)
  --similarity FLOAT          Cosine similarity threshold (default: 0.75)
  --dry-run / --no-dry-run    Preview without saving (default: --dry-run)

Examples:
  # Preview consolidation
  python -m curator build-ontology --min-freq 3 --target 80 --dry-run

  # Apply consolidation
  python -m curator build-ontology --min-freq 3 --target 80 --no-dry-run

  # More aggressive clustering (fewer tags)
  python -m curator build-ontology --target 50 --similarity 0.70

review-tags - Interactive Tag Review

python -m curator review-tags [OPTIONS]

Options:
  --batch N                   Review N tags before pausing (default: 20)
  --sort {count,alpha,date}   Sort order (default: count)

Examples:
  python -m curator review-tags
  python -m curator review-tags --batch 50 --sort alpha

show-pending-tags - View Proposed Tags

python -m curator show-pending-tags

dump-registry - View Tag Registry

python -m curator dump-registry

βš™οΈ Configuration

Key Configuration Sections

Vault Settings

vault:
  path: "/path/to/your/obsidian/vault"
  include_extensions: [".md"]
  exclude_dirs:
    - ".git"
    - ".obsidian"
    - ".trash"

Tag Policy

tag_policy:
  fuzzy_threshold: 0.90        # Levenshtein ratio for fuzzy matching
  semantic_threshold: 0.85     # Cosine similarity for semantic matching
  auto_accept_new_tags: false  # Require human review for new tags
  blocked_tags:
    - "note"
    - "notes"
    - "draft"
    - "summary"

Processing Control

processing:
  max_files_per_run: 50
  selection_mode: "modified"   # "all" | "modified" | "random"
  modified_since_days: 30      # For mode="modified"
  dry_run: true                # Safety first!

Frontmatter Behavior

frontmatter:
  merge_existing_tags: true      # Union of old + new tags
  overwrite_summary: false       # Keep existing summary
  overwrite_category: false      # Keep existing category
  skip_if_status:
    - "done"
    - "frozen"
    - "manual"

πŸ”§ How It Works

Tag Resolution Pipeline (7 Steps)

1. Normalize      β†’ Lowercase, replace spaces with hyphens
2. Check Blocked  β†’ Reject generic tags like "note", "idea"
3. Resolve Alias  β†’ Map synonyms (e.g., "ml" β†’ "machine-learning")
4. Exact Match    β†’ Check if tag exists in canonical registry
5. Fuzzy Match    β†’ Find similar tags (>90% Levenshtein similarity)
6. Semantic Match β†’ Use embeddings to find related tags (>0.85 cosine similarity)
7. New Tag        β†’ Auto-accept or propose for human review

Category Determination (Path-First Strategy)

Priority:
1. Path pattern match (e.g., "Resources/" β†’ "Resources")
2. LLM suggestion (if valid category)
3. Default category from config

Semantic Tag Clustering

The build-ontology command uses:

  1. Embeddings: Ollama nomic-embed-text to vectorize tag names
  2. Clustering: Scikit-learn agglomerative clustering with cosine similarity
  3. Representative Selection: Highest frequency tag per cluster becomes canonical
  4. Alias Creation: Other cluster members become aliases

πŸ“Š Examples

Example 1: Basic Dry-Run

$ python -m curator run --dry-run --max-files 5
DRY RUN MODE - No files will be modified
Processing: research-notes.md... βœ“
Processing: meeting-minutes.md... βœ“
Processing: project-ideas.md... βœ“

Summary:
  Files Processed: 3
  New Tags Proposed: 12
  Success Rate: 100%

Example 2: Build Ontology Output

$ python -m curator build-ontology --min-freq 3 --target 80 --dry-run

Building Tag Ontology

Step 1: Loading proposed tags from CSV...
Loaded 83 tags with frequency >= 3
Total unique tags: 952

Step 2: Computing embeddings...
Computing embeddings: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 83/83

Step 3: Clustering tags...
Created 80 clusters

Tag Ontology Consolidation Report

Tags before: 952
Tags after: 80
Reduction: 91.6%

Top 20 Tag Clusters by Frequency:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Representative            β”‚ Frequency β”‚ Cluster Size β”‚ Members             β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ statistical-analysis      β”‚        13 β”‚            2 β”‚ data-analysis, ...  β”‚
β”‚ data-visualization        β”‚        12 β”‚            1 β”‚ data-visualization  β”‚
β”‚ anova                     β”‚        10 β”‚            1 β”‚ anova               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Example 3: Interactive Tag Review

$ python -m curator review-tags --batch 10

Tag Review Session

Tag: machine-learning
  Variations: machine learning, ML, ml
  Occurrences: 47
  Example: ~/Atlas/ML/supervised-learning.md

[A]ccept / [R]eject / [S]kip / [Q]uit: a
βœ“ Accepted

Summary:
  Accepted: 15
  Rejected: 3
  Skipped: 2

πŸ“ Project Structure

obsidian-curator/
β”œβ”€β”€ curator/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ main.py           # CLI entry point (Typer commands)
β”‚   β”œβ”€β”€ schemas.py        # Pydantic models for validation
β”‚   β”œβ”€β”€ gatekeeper.py     # Tag governance and resolution
β”‚   β”œβ”€β”€ analyzer.py       # LLM interaction (Ollama)
β”‚   └── executor.py       # File I/O and backup management
β”œβ”€β”€ config.example.yaml   # Example configuration
β”œβ”€β”€ config.yaml           # Your configuration (gitignored)
β”œβ”€β”€ tags_registry.example.json
β”œβ”€β”€ tags_registry.json    # Tag ontology (gitignored)
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ CLAUDE.md            # Development guidelines
└── README.md            # This file

πŸ›‘οΈ Safety Features

  • βœ… Timestamped Backups: Created in .backup/ before any modification
  • βœ… Dry-Run Default: Must explicitly use --no-dry-run to modify files
  • βœ… Comprehensive Logging: All changes logged to curator_log.csv
  • βœ… Skip Protection: Won't modify files with status "done", "frozen", or "manual"
  • βœ… Merge Mode: Preserves existing tags while adding new ones
  • βœ… Validation: Pydantic schemas ensure data integrity

πŸ› Troubleshooting

Ollama Not Responding

# Check if Ollama is running
curl http://localhost:11434/

# If not, start Ollama
ollama serve

Import Errors

Make sure you're in the virtual environment:

source venv/bin/activate
which python  # Should show path in venv/

No Files Processed

Check:

  • Vault path in config.yaml is correct
  • Files aren't excluded by exclude_dirs
  • Files don't have a skipped status in frontmatter
  • max_files_per_run isn't set too low

JSON Extraction Errors

If LLM output is truncated:

  • Check num_predict in analyzer.py (should be >= 2000)
  • Try a different LLM model
  • Reduce max_input_chars to give LLM more tokens for output

Slow Performance

  • Use --max-files to limit batch size
  • Check Ollama is using GPU acceleration
  • Consider using a smaller LLM model for testing

πŸ“ˆ Performance

Typical processing speed (on RTX 5070 Ti, 16GB VRAM):

  • LLM Analysis: 20-30 seconds per file
  • Tag Resolution: <1 second per tag
  • Embedding Generation: ~0.5 seconds per tag
  • Overall: ~40-60 files per hour

🀝 Contributing

Contributions welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests if applicable
  5. Submit a pull request

πŸ“ Development

See CLAUDE.md for:

  • Detailed architecture notes
  • Implementation guidelines
  • Testing strategies
  • Common pitfalls to avoid

πŸ“„ License

MIT License - See LICENSE file for details

πŸ™ Acknowledgments


⚠️ Note: This tool modifies your Obsidian vault. Always backup your vault and test with --dry-run first!

πŸ’‘ Tip: Start with a small subset of files (--max-files 10) to test configuration before processing your entire vault.

About

A Python CLI tool that uses local LLM (Ollama) to automatically curate Obsidian vault notes with intelligent tag management and semantic clustering

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages