ArchiveBox TODOs

This directory contains detailed design documentation for major ArchiveBox systems.

Active Design Documents

Lazy Filesystem Migration System

Problem: archivebox init on 1TB+ collections takes hours/days scanning and migrating everything upfront.

Solution: O(1) init + lazy migration on save() + background worker + single-pass streaming update.

Key Features:

O(1) init regardless of collection size
Lazy migration happens automatically on Snapshot.save()
Single streaming O(n) pass for archivebox update
Atomic cp + verify + rm (safe to interrupt)
Intelligent merging of index.json ↔ DB data
Migration from flat structure to organized extractor subdirectories
Backwards-compatible symlinks

Status: Design complete, ready for implementation

Hook Architecture & Background Hooks

Problem: Need unified hook system for all models + support for long-running background extractors.

Solution: JSONL-based hook system with background hook support via .bg. suffix.

Key Features:

Unified Model.run() pattern for Crawl, Dependency, Snapshot, ArchiveResult
Hooks emit JSONL: {type: 'ModelName', ...}
Generic run_hook() parser (doesn't know about specific models)
Background hooks run concurrently without blocking
Split output into output_str (human) and output_json (structured)
New fields: output_files, output_size, output_mimetypes

Status: Phases 1-3 in progress, Phases 4-7 planned

Implementation Order

Filesystem Migration (TODO_fs_migrations.md)
- Database migration for fs_version field
- Snapshot.save() with migration chain
- Migration methods: _migrate_fs_from_0_7_0_to_0_8_0(), _migrate_fs_from_0_8_0_to_0_9_0()
- Snapshot.output_dir property that derives path from fs_version
- Simplify archivebox init to O(1)
- Single-pass streaming archivebox update
- Intelligent reconcile_index_json() merging
- Runtime assertions and archivebox doctor checks
Hook Architecture (TODO_hook_architecture.md)
- Phase 1: Database migration for new ArchiveResult fields
- Phase 2: Update hooks to emit clean JSONL
- Phase 3: Generic run_hook() implementation
- Phase 4: Plugin audit and standardization
- Phase 5: Update run_hook() for background support
- Phase 6: Update ArchiveResult.run()
- Phase 7: Background hook finalization

Design Principles

Both systems follow these principles:

✅ Never load all snapshots into memory - Use .iterator() everywhere ✅ Atomic operations - Transactions protect DB, idempotent copies protect FS ✅ Resumable - Safe to kill and restart anytime ✅ Correct by default - Runtime assertions catch migration issues ✅ Simple > Complex - Avoid over-engineering, keep it predictable

CLAUDE.md - Development guide and test suite documentation
.claude/CLAUDE.md - User's global instructions (git workflow, DB connections)

3.0 KiB Raw Permalink Blame History