3.0 KiB
ArchiveBox TODOs
This directory contains detailed design documentation for major ArchiveBox systems.
Active Design Documents
Lazy Filesystem Migration System
Problem: archivebox init on 1TB+ collections takes hours/days scanning and migrating everything upfront.
Solution: O(1) init + lazy migration on save() + background worker + single-pass streaming update.
Key Features:
- O(1) init regardless of collection size
- Lazy migration happens automatically on
Snapshot.save() - Single streaming O(n) pass for
archivebox update - Atomic cp + verify + rm (safe to interrupt)
- Intelligent merging of index.json ↔ DB data
- Migration from flat structure to organized extractor subdirectories
- Backwards-compatible symlinks
Status: Design complete, ready for implementation
Hook Architecture & Background Hooks
Problem: Need unified hook system for all models + support for long-running background extractors.
Solution: JSONL-based hook system with background hook support via .bg. suffix.
Key Features:
- Unified
Model.run()pattern for Crawl, Dependency, Snapshot, ArchiveResult - Hooks emit JSONL:
{type: 'ModelName', ...} - Generic
run_hook()parser (doesn't know about specific models) - Background hooks run concurrently without blocking
- Split
outputintooutput_str(human) andoutput_json(structured) - New fields:
output_files,output_size,output_mimetypes
Status: Phases 1-3 in progress, Phases 4-7 planned
Implementation Order
-
Filesystem Migration (TODO_fs_migrations.md)
- Database migration for
fs_versionfield Snapshot.save()with migration chain- Migration methods:
_migrate_fs_from_0_7_0_to_0_8_0(),_migrate_fs_from_0_8_0_to_0_9_0() Snapshot.output_dirproperty that derives path fromfs_version- Simplify
archivebox initto O(1) - Single-pass streaming
archivebox update - Intelligent
reconcile_index_json()merging - Runtime assertions and
archivebox doctorchecks
- Database migration for
-
Hook Architecture (TODO_hook_architecture.md)
- Phase 1: Database migration for new ArchiveResult fields
- Phase 2: Update hooks to emit clean JSONL
- Phase 3: Generic
run_hook()implementation - Phase 4: Plugin audit and standardization
- Phase 5: Update
run_hook()for background support - Phase 6: Update
ArchiveResult.run() - Phase 7: Background hook finalization
Design Principles
Both systems follow these principles:
✅ Never load all snapshots into memory - Use .iterator() everywhere
✅ Atomic operations - Transactions protect DB, idempotent copies protect FS
✅ Resumable - Safe to kill and restart anytime
✅ Correct by default - Runtime assertions catch migration issues
✅ Simple > Complex - Avoid over-engineering, keep it predictable
Related Files
CLAUDE.md- Development guide and test suite documentation.claude/CLAUDE.md- User's global instructions (git workflow, DB connections)