mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2025-12-27 22:26:00 +00:00
81 lines
3.0 KiB
Markdown
81 lines
3.0 KiB
Markdown
# ArchiveBox TODOs
|
|
|
|
This directory contains detailed design documentation for major ArchiveBox systems.
|
|
|
|
## Active Design Documents
|
|
|
|
### [Lazy Filesystem Migration System](./TODO_fs_migrations.md)
|
|
**Problem**: `archivebox init` on 1TB+ collections takes hours/days scanning and migrating everything upfront.
|
|
|
|
**Solution**: O(1) init + lazy migration on save() + background worker + single-pass streaming update.
|
|
|
|
**Key Features**:
|
|
- O(1) init regardless of collection size
|
|
- Lazy migration happens automatically on `Snapshot.save()`
|
|
- Single streaming O(n) pass for `archivebox update`
|
|
- Atomic cp + verify + rm (safe to interrupt)
|
|
- Intelligent merging of index.json ↔ DB data
|
|
- Migration from flat structure to organized extractor subdirectories
|
|
- Backwards-compatible symlinks
|
|
|
|
**Status**: Design complete, ready for implementation
|
|
|
|
---
|
|
|
|
### [Hook Architecture & Background Hooks](./TODO_hook_architecture.md)
|
|
**Problem**: Need unified hook system for all models + support for long-running background extractors.
|
|
|
|
**Solution**: JSONL-based hook system with background hook support via `.bg.` suffix.
|
|
|
|
**Key Features**:
|
|
- Unified `Model.run()` pattern for Crawl, Dependency, Snapshot, ArchiveResult
|
|
- Hooks emit JSONL: `{type: 'ModelName', ...}`
|
|
- Generic `run_hook()` parser (doesn't know about specific models)
|
|
- Background hooks run concurrently without blocking
|
|
- Split `output` into `output_str` (human) and `output_json` (structured)
|
|
- New fields: `output_files`, `output_size`, `output_mimetypes`
|
|
|
|
**Status**: Phases 1-3 in progress, Phases 4-7 planned
|
|
|
|
---
|
|
|
|
## Implementation Order
|
|
|
|
1. **Filesystem Migration** (TODO_fs_migrations.md)
|
|
- Database migration for `fs_version` field
|
|
- `Snapshot.save()` with migration chain
|
|
- Migration methods: `_migrate_fs_from_0_7_0_to_0_8_0()`, `_migrate_fs_from_0_8_0_to_0_9_0()`
|
|
- `Snapshot.output_dir` property that derives path from `fs_version`
|
|
- Simplify `archivebox init` to O(1)
|
|
- Single-pass streaming `archivebox update`
|
|
- Intelligent `reconcile_index_json()` merging
|
|
- Runtime assertions and `archivebox doctor` checks
|
|
|
|
2. **Hook Architecture** (TODO_hook_architecture.md)
|
|
- Phase 1: Database migration for new ArchiveResult fields
|
|
- Phase 2: Update hooks to emit clean JSONL
|
|
- Phase 3: Generic `run_hook()` implementation
|
|
- Phase 4: Plugin audit and standardization
|
|
- Phase 5: Update `run_hook()` for background support
|
|
- Phase 6: Update `ArchiveResult.run()`
|
|
- Phase 7: Background hook finalization
|
|
|
|
---
|
|
|
|
## Design Principles
|
|
|
|
Both systems follow these principles:
|
|
|
|
✅ **Never load all snapshots into memory** - Use `.iterator()` everywhere
|
|
✅ **Atomic operations** - Transactions protect DB, idempotent copies protect FS
|
|
✅ **Resumable** - Safe to kill and restart anytime
|
|
✅ **Correct by default** - Runtime assertions catch migration issues
|
|
✅ **Simple > Complex** - Avoid over-engineering, keep it predictable
|
|
|
|
---
|
|
|
|
## Related Files
|
|
|
|
- `CLAUDE.md` - Development guide and test suite documentation
|
|
- `.claude/CLAUDE.md` - User's global instructions (git workflow, DB connections)
|