# ArchiveBox TODOs This directory contains detailed design documentation for major ArchiveBox systems. ## Active Design Documents ### [Lazy Filesystem Migration System](./TODO_fs_migrations.md) **Problem**: `archivebox init` on 1TB+ collections takes hours/days scanning and migrating everything upfront. **Solution**: O(1) init + lazy migration on save() + background worker + single-pass streaming update. **Key Features**: - O(1) init regardless of collection size - Lazy migration happens automatically on `Snapshot.save()` - Single streaming O(n) pass for `archivebox update` - Atomic cp + verify + rm (safe to interrupt) - Intelligent merging of index.json ↔ DB data - Migration from flat structure to organized extractor subdirectories - Backwards-compatible symlinks **Status**: Design complete, ready for implementation --- ### [Hook Architecture & Background Hooks](./TODO_hook_architecture.md) **Problem**: Need unified hook system for all models + support for long-running background extractors. **Solution**: JSONL-based hook system with background hook support via `.bg.` suffix. **Key Features**: - Unified `Model.run()` pattern for Crawl, Dependency, Snapshot, ArchiveResult - Hooks emit JSONL: `{type: 'ModelName', ...}` - Generic `run_hook()` parser (doesn't know about specific models) - Background hooks run concurrently without blocking - Split `output` into `output_str` (human) and `output_json` (structured) - New fields: `output_files`, `output_size`, `output_mimetypes` **Status**: Phases 1-3 in progress, Phases 4-7 planned --- ## Implementation Order 1. **Filesystem Migration** (TODO_fs_migrations.md) - Database migration for `fs_version` field - `Snapshot.save()` with migration chain - Migration methods: `_migrate_fs_from_0_7_0_to_0_8_0()`, `_migrate_fs_from_0_8_0_to_0_9_0()` - `Snapshot.output_dir` property that derives path from `fs_version` - Simplify `archivebox init` to O(1) - Single-pass streaming `archivebox update` - Intelligent `reconcile_index_json()` merging - Runtime assertions and `archivebox doctor` checks 2. **Hook Architecture** (TODO_hook_architecture.md) - Phase 1: Database migration for new ArchiveResult fields - Phase 2: Update hooks to emit clean JSONL - Phase 3: Generic `run_hook()` implementation - Phase 4: Plugin audit and standardization - Phase 5: Update `run_hook()` for background support - Phase 6: Update `ArchiveResult.run()` - Phase 7: Background hook finalization --- ## Design Principles Both systems follow these principles: ✅ **Never load all snapshots into memory** - Use `.iterator()` everywhere ✅ **Atomic operations** - Transactions protect DB, idempotent copies protect FS ✅ **Resumable** - Safe to kill and restart anytime ✅ **Correct by default** - Runtime assertions catch migration issues ✅ **Simple > Complex** - Avoid over-engineering, keep it predictable --- ## Related Files - `CLAUDE.md` - Development guide and test suite documentation - `.claude/CLAUDE.md` - User's global instructions (git workflow, DB connections)