# `atlas-extractor` — V0 Implementation Python implementation of Stages 1-3 of the email-extractor pipeline spec (`../email-extractor.md`). Stages 4-7 (threading, entity extraction, intent classification, canonical-JSON normalization) live in sibling modules to be added. ## Why only 1-3 in V0? Stages 1-3 are the *unsexy but critical* foundation. Threading and intent classification are easier (well-understood techniques + LLM); fetch / decode / dequote is where most "AI email tools" silently break and produce garbage downstream. We invest here first. ## Install ```bash cd mcp-tools/email-extractor pip install -e .[test] ``` ## Usage ### Single .eml file (smoke test) ```bash atlas-extract eml \ --input tests/fixtures/sample_thread.eml \ --state-dir /tmp/atlas-state ``` ### Directory of .txt or .eml files (e.g., the boss-distiller demo INPUT/) ```bash atlas-extract dir \ --input-dir ../../skills/claw-boss-distiller/demo/INPUT \ --state-dir /tmp/atlas-state ``` ### Real IMAP account ```bash ATLAS_IMAP_PASSWORD='app-specific-password' atlas-extract imap \ --host imap.gmail.com \ --user wang@us-saas.cn \ --folder INBOX --folder Sent \ --state-dir ./state \ --since-days 365 ``` The `since-days` flag bounds the cold start. Subsequent runs are incremental via persisted `last_uid` per (account, folder). ## Output Each message produces a JSON file under `state-dir/extracted/YYYY-MM/.json`: ```json { "msg_id": "demo-001@us-saas.cn", "account": "test", "folder": "local", "uid": "1", "internal_date": "2026-04-22T01:14:03+00:00", "subject": "PRJ-001 客户A 改版 — 第三次问", "from": {"name": "王", "email": "wang@us-saas.cn"}, "to": [{"name": "张三", "email": "zhangsan@us-saas.cn"}], "cc": [{"name": "李四", "email": "lisi@us-saas.cn"}], "in_reply_to": null, "references": [], "body_text_clean": "张三，\n\nPRJ-001 我上次问已经过去 6 天了 ...", "body_text_full_chars": 312, "body_text_clean_chars": 134, "dequote": { "strategies_used": [ "marker:^On\\s.+?wrote:\\s*$", "signature_sep_dashdash", "disclaimer:本邮件(及其附件)?(包含|含有)?(保密" ], "chars_stripped": 178 }, "_extraction": { "stages_complete": [1, 2, 3], "extractor_version": "0.1.0", ... } } ``` `body_text_clean` is the contract handed to Stages 4-7 (threading, entities, intent, canonical normalization). ## Run tests ```bash pytest -q ``` The test suite verifies the 3 critical guarantees: 1. Decode pulls headers + body correctly from the canonical fixture 2. Dequote strips the English `On X wrote:` marker, signature block, and disclaimer 3. Real content is preserved (no over-aggressive stripping) ## Performance target (per spec) - ≥ 200 msgs/min on a single worker (V0 acceptable) - Dequote precision ≥ 92% on a manually-labeled 100-message eval set (TBD) ## Failure modes (how each error gets handled) | Failure | Behavior | |---------|----------| | MIME parse exception | Log to `state/extracted/_failed/__.error`, continue | | Charset undetectable | Fall through to `gb18030` → `latin-1` → `utf-8` with `errors="replace"` | | HTML-only message with broken HTML | `readability` falls back to `html2text` raw | | Quote-stripping leaves < 8 chars | Marked `low_signal_clean` in run summary; not skipped | | IMAP rate limit / quota | Exponential backoff in `imap-tools` (built in); checkpoint via `last_uid` | ## What's NOT in this V0 - **Stage 4 (threading)**: header-first + subject-fuzzy fallback; comes next - **Stage 5 (entity extraction)**: spacy + regex packs; comes next - **Stage 6 (intent classification)**: LLM few-shot with 30-sample reference; comes next - **Stage 7 (canonical normalization)**: structured Email JSON contract; comes next - **Gmail API / Exchange**: only IMAP in V0; same interface, different fetcher - **MCP server wrapping**: V0 is pure CLI; MCP is a V0.5 refactor ## Roadmap to V0.5 1. Add `thread.py` — RFC headers first, fall back to subject+participant similarity 2. Add `entities.py` — spacy zh+en NER + regex (dates, amounts, project keywords) 3. Add `intent.py` — LLM few-shot classifier with body-hash cache 4. Add `normalize.py` — canonical Email JSON output (matching the spec in `../email-extractor.md`) 5. Add 100-message eval set + accuracy harness 6. Wrap as MCP server with the same CLI underneath