# MCP Tool: email-extractor The most underestimated component of Atlas. "Connecting to email" is a 2-day job; **extracting useful structure out of email is a 2-week job and the rest of Atlas falls apart without it.** This doc specifies the 7-stage extraction pipeline, the canonical `Email` object schema, and the open-source libraries we lean on (so we don't reinvent IMAP / MIME / language detection). --- ## Why a dedicated tool Raw email is a hostile data source: - HTML wrapped in CSS, inline images, base64 attachments, multiple MIME parts - Quoted reply chains stacking 10+ deep, each with a different signature block - Auto-forwards, mailing lists, calendar invites, OOO replies polluting the signal - 8+ languages mixed in one thread (中/英/日 + tech jargon) - Senders use the same name with different addresses (`zhang@a.com` vs `zhang.san@a-corp.cn`) - Subject lines drift across replies (`Re: Re: 项目 → 客户A 改版进度跟进`) Atlas's downstream skills (`claw-project-tracker` etc.) assume **clean, normalized, deduplicated, intent-tagged Email objects**. The extractor is the bridge from MIME chaos to that contract. --- ## 7-Stage Pipeline ``` [Stage 1: Fetch] IMAP / Gmail API / Exchange → raw MIME bytes ↓ [Stage 2: Decode] MIME parse, charset, HTML→text (readability) ↓ [Stage 3: Dequote] strip quoted replies + signatures + disclaimers ↓ [Stage 4: Thread] group by Message-ID / In-Reply-To / References / subject-fuzzy ↓ [Stage 5: Entities] extract people, orgs, dates, amounts, project keywords ↓ [Stage 6: Intent] classify into 8 categories (催办 / 决策 / 转交 / ...) ↓ [Stage 7: Normalize] emit canonical Email JSON → state/extracted/.json ``` ### Stage 1 — Fetch | Backend | Lib | Notes | |---------|-----|-------| | IMAP | `imap-tools` (Python) or `node-imap` | Use UID-based incremental sync; persist `last_uid` per folder | | Gmail API | `google-api-python-client` | OAuth2; use `historyId` for incremental | | Exchange / O365 | `exchangelib` or MS Graph SDK | Modern auth (OAuth2); avoid legacy EWS basic auth | Output: `raw_mime` bytes + envelope (account, folder, uid, internal_date) **Configuration:** - Folders to scan: `INBOX`, `Sent`, optionally `Drafts`. Exclude `Spam`, `Trash`, mailing-list folders. - Date range: configurable (default V0 first run = past 12 months; subsequent runs = since last sync) - Rate limit: respect server limits; backoff on `OVERQUOTA` / `429` ### Stage 2 — Decode - Parse MIME with stdlib `email` (Python) or `mailparser` (Node) - Detect charset; fallback chain: declared → `chardet` sniff → `utf-8` with errors=replace - HTML body → plain text via `readability-lxml` (preserves structure) or `html2text` - Inline images: keep `cid:` reference for later attachment OCR (V1) - Calendar invites (`text/calendar`): extract event metadata, do NOT treat as conversation - Detect language per body part with `fasttext-langdetect` (multilingual support) Output adds: `body_text`, `body_html`, `language`, `attachments_meta` ### Stage 3 — Dequote (the unglamorous but critical step) Most emails contain a quoted history of the entire prior thread. If we don't strip it, every email looks like every other email and clustering is destroyed. **Strategies (combine, fall through):** 1. **Marker patterns** (regex): - `^On .* wrote:$` (English) - `^.* 写道:$` / `^.* 于 \d{4}年.*写道:$` (Chinese) - `^------ (转发|原始)邮件 ------` / `------ Forwarded message ------` - `^From: .*\nSent: .*\nTo: .*` (Outlook block headers) - `^>+ ` (RFC quoted lines) 2. **Signature blocks**: detect `--\s*$` separator, or trailing block with phone/title patterns 3. **Disclaimer footers**: regex for `本邮件包含保密信息`, `CONFIDENTIAL`, etc. 4. **Library helper**: vendor `EmailReplyParser` (Python or Node port) as a baseline, then layer our patterns on top **Result:** `body_text_clean` — only the new content the sender wrote in this message. ### Stage 4 — Thread Goal: group all messages of one conversation into a `thread_id`. | Method | Strength | Weakness | |--------|----------|----------| | `Message-ID` + `In-Reply-To` + `References` headers | Most reliable | Outlook sometimes drops these | | Normalized subject (strip `Re:` / `Fwd:` / `回复:` / `转发:` prefixes) + participant overlap | Catches Outlook gaps | Subject drift breaks it | | Embedding similarity over `body_text_clean[:500]` | Catches subject drift | Expensive; only as tiebreaker | Persist `thread_id` per message; threads are first-class — `claw-project-tracker` clusters at thread level, not message level. ### Stage 5 — Entity Extraction Per cleaned message, extract: | Entity | Method | |--------|--------| | **People** | from / to / cc parsed addresses → normalize to `(name, email)` tuples; fuzzy-merge identities (`zhang san ` ≡ `张三 `) using a maintained alias map under `state/people/aliases.json` | | **Internal vs external** | `email_domain ∈ company_domains` → internal; else external (= candidate customer) | | **Organization (customer)** | external email domain → lookup in `state/customers/domain_map.json`; new domain → create candidate `customers/UNCLASSIFIED-.json` for boss confirmation | | **Dates** | `dateparser` lib (multi-language) for "下周三" / "by EOM" / "Mar 15" | | **Amounts** | regex for `¥1,200` / `$50K` / `30 万` / `200万元` | | **Project keywords** | (a) seed list from boss; (b) noun phrases via `spacy` zh+en models; cluster across thread | | **Action verbs** | small classifier or regex set: 催 / urge / 等 / waiting / 决定 / decide / 转 / forward / 否决 / reject | ### Stage 6 — Intent Classification 8 intents (mutually exclusive primary + multiple secondary): | Intent | Examples | |--------|----------| | `催办` (urge) | "麻烦 ASAP" / "deadline 已过" | | `决策` (decide) | "我同意 / 不同意 / 选 A" | | `转交` (delegate) | "请张三跟一下" / "+张三" | | `询问` (ask) | "进展如何" / "有更新吗" | | `抱怨` (complain) | "再不给答复就..." / "为什么这么慢" | | `表扬` (praise) | "辛苦了 / 做得不错" | | `通知` (inform) | "FYI / 同步一下" | | `闲聊` (smalltalk) | greetings, pleasantries | **Method:** few-shot LLM classification with 30-example reference set (seeded from the boss's own emails). Cache by `body_text_clean` hash to avoid re-classifying duplicates. ### Stage 7 — Normalize → Canonical Email JSON Final output stored as `state/extracted/YYYY-MM//.json`: ```json { "msg_id": "CAH+...@mail.gmail.com", "thread_id": "thr-2026-04-12-abc123", "internal_date": "2026-04-22T14:33:00+08:00", "from": {"name": "客户A 王总", "email": "wang@clientco.com", "internal": false}, "to": [{"name": "李四", "email": "lisi@us.com", "internal": true}], "cc": [{"name": "Boss", "email": "boss@us.com", "internal": true}], "subject_normalized": "客户A 官网改版 进度跟进", "language": "zh-CN", "body_text_clean": "再这样下去我就找别家了。这个礼拜必须给个准信。", "entities": { "dates": [{"text": "这个礼拜", "iso": "2026-04-26", "confidence": 0.85}], "amounts": [], "project_keywords": ["官网改版"], "internal_people": ["李四", "Boss"], "external_people": ["客户A 王总"], "customer_id_candidate": "CUST-clientco" }, "intent": {"primary": "抱怨", "secondary": ["催办"], "confidence": 0.91}, "attachments": [], "extraction_version": "v0.1", "extracted_at": "2026-05-09T07:30:12Z", "rule_audit": {"dequote_strategy": "marker+signature", "thread_method": "header"} } ``` This is the contract. `claw-project-tracker`, `claw-people-observer`, `claw-customer-radar` consume only this — never raw MIME. --- ## Failure Handling | Failure | Recovery | |---------|---------| | MIME parse fails | Log to `state/extracted/_failed/`, continue with next | | Charset undetectable | Mark `body_text_clean = ""`, intent = `unknown`, surface in unclustered queue | | Thread headers missing | Fall through to subject+participant strategy | | Customer domain unknown | Create `UNCLASSIFIED-` candidate; boss confirms in week-1 | | Person alias collision | Surface in `state/people/_to_merge.json` for boss | | Intent confidence < 0.6 | Default to `通知`, mark `low_confidence: true` | | Rate-limit hit | Exponential backoff; resume on next heartbeat | --- ## Performance Targets | Metric | V0 target | |--------|-----------| | Extraction throughput | ≥ 200 msgs/min on a single worker | | Stage 3 dequote precision | ≥ 92% (manual eval over 100-message sample) | | Stage 4 thread accuracy | ≥ 95% (vs human-labeled) | | Stage 5 entity recall (people) | ≥ 98% | | Stage 6 intent accuracy | ≥ 80% top-1, ≥ 95% top-3 | | End-to-end latency | < 2 sec/msg avg incl. LLM calls | --- ## Reuse vs Build | Component | Approach | |-----------|----------| | IMAP / Gmail / Exchange auth + fetch | **Reuse** — `imap-tools`, `google-api-python-client`, `exchangelib` | | MIME parse | **Reuse** — stdlib `email` | | HTML→text | **Reuse** — `readability-lxml` | | Quote stripping | **Reuse + extend** — `EmailReplyParser` baseline + our regex packs | | Language detection | **Reuse** — `fasttext-langdetect` | | Date parsing | **Reuse** — `dateparser` | | Entity extraction (NER) | **Reuse** — `spacy` zh + en models | | Intent classification | **Build (LLM few-shot)** — small custom prompt, cache by body hash | | Threading | **Build** — header-first, custom fallbacks | | Alias merging | **Build** — boss-curated `aliases.json` | **Estimate:** 5–7 dev days to V0 for one mail backend (IMAP); +2 days each for Gmail / Exchange. --- ## V0 Deliverable Checklist - [ ] IMAP fetcher with incremental UID sync - [ ] MIME → clean text pipeline (stages 2–3) at ≥ 92% dequote precision - [ ] Threading at ≥ 95% accuracy on a 100-thread eval set - [ ] Entity extraction (people / dates / amounts / project keywords) - [ ] Intent classifier with 30-shot reference set - [ ] Canonical `Email` JSON writer - [ ] `state/people/aliases.json` and `state/customers/domain_map.json` seed format - [ ] Failure quarantine bucket - [ ] CLI: `atlas-extract --since YYYY-MM-DD` for ad-hoc backfill