assistant-claw/mcp-tools/email-extractor.md

# MCP Tool: email-extractor

The most underestimated component of Atlas. "Connecting to email" is a 2-day job; **extracting useful structure out of email is a 2-week job and the rest of Atlas falls apart without it.**

This doc specifies the 7-stage extraction pipeline, the canonical `Email` object schema, and the open-source libraries we lean on (so we don't reinvent IMAP / MIME / language detection).

---

## Why a dedicated tool

Raw email is a hostile data source:

- HTML wrapped in CSS, inline images, base64 attachments, multiple MIME parts
- Quoted reply chains stacking 10+ deep, each with a different signature block
- Auto-forwards, mailing lists, calendar invites, OOO replies polluting the signal
- 8+ languages mixed in one thread (中/英/日 + tech jargon)
- Senders use the same name with different addresses (`zhang@a.com` vs `zhang.san@a-corp.cn`)
- Subject lines drift across replies (`Re: Re: 项目 → 客户A 改版进度跟进`)

Atlas's downstream skills (`claw-project-tracker` etc.) assume **clean, normalized, deduplicated, intent-tagged Email objects**. The extractor is the bridge from MIME chaos to that contract.

---

## 7-Stage Pipeline

```
[Stage 1: Fetch]      IMAP / Gmail API / Exchange → raw MIME bytes
        ↓
[Stage 2: Decode]     MIME parse, charset, HTML→text (readability)
        ↓
[Stage 3: Dequote]    strip quoted replies + signatures + disclaimers
        ↓
[Stage 4: Thread]     group by Message-ID / In-Reply-To / References / subject-fuzzy
        ↓
[Stage 5: Entities]   extract people, orgs, dates, amounts, project keywords
        ↓
[Stage 6: Intent]     classify into 8 categories (催办 / 决策 / 转交 / ...)
        ↓
[Stage 7: Normalize]  emit canonical Email JSON → state/extracted/<msg_id>.json
```

### Stage 1 — Fetch

| Backend | Lib | Notes |
|---------|-----|-------|
| IMAP | `imap-tools` (Python) or `node-imap` | Use UID-based incremental sync; persist `last_uid` per folder |
| Gmail API | `google-api-python-client` | OAuth2; use `historyId` for incremental |
| Exchange / O365 | `exchangelib` or MS Graph SDK | Modern auth (OAuth2); avoid legacy EWS basic auth |

Output: `raw_mime` bytes + envelope (account, folder, uid, internal_date)

**Configuration:**
- Folders to scan: `INBOX`, `Sent`, optionally `Drafts`. Exclude `Spam`, `Trash`, mailing-list folders.
- Date range: configurable (default V0 first run = past 12 months; subsequent runs = since last sync)
- Rate limit: respect server limits; backoff on `OVERQUOTA` / `429`

### Stage 2 — Decode

- Parse MIME with stdlib `email` (Python) or `mailparser` (Node)
- Detect charset; fallback chain: declared → `chardet` sniff → `utf-8` with errors=replace
- HTML body → plain text via `readability-lxml` (preserves structure) or `html2text`
- Inline images: keep `cid:` reference for later attachment OCR (V1)
- Calendar invites (`text/calendar`): extract event metadata, do NOT treat as conversation
- Detect language per body part with `fasttext-langdetect` (multilingual support)

Output adds: `body_text`, `body_html`, `language`, `attachments_meta`

### Stage 3 — Dequote (the unglamorous but critical step)

Most emails contain a quoted history of the entire prior thread. If we don't strip it, every email looks like every other email and clustering is destroyed.

**Strategies (combine, fall through):**

1. **Marker patterns** (regex):
   - `^On .* wrote:$` (English)
   - `^.* 写道：$` / `^.* 于 \d{4}年.*写道：$` (Chinese)
   - `^------ (转发|原始)邮件 ------` / `------ Forwarded message ------`
   - `^From: .*\nSent: .*\nTo: .*` (Outlook block headers)
   - `^>+ ` (RFC quoted lines)
2. **Signature blocks**: detect `--\s*$` separator, or trailing block with phone/title patterns
3. **Disclaimer footers**: regex for `本邮件包含保密信息`, `CONFIDENTIAL`, etc.
4. **Library helper**: vendor `EmailReplyParser` (Python or Node port) as a baseline, then layer our patterns on top

**Result:** `body_text_clean` — only the new content the sender wrote in this message.

### Stage 4 — Thread

Goal: group all messages of one conversation into a `thread_id`.

| Method | Strength | Weakness |
|--------|----------|----------|
| `Message-ID` + `In-Reply-To` + `References` headers | Most reliable | Outlook sometimes drops these |
| Normalized subject (strip `Re:` / `Fwd:` / `回复:` / `转发:` prefixes) + participant overlap | Catches Outlook gaps | Subject drift breaks it |
| Embedding similarity over `body_text_clean[:500]` | Catches subject drift | Expensive; only as tiebreaker |

Persist `thread_id` per message; threads are first-class — `claw-project-tracker` clusters at thread level, not message level.

### Stage 5 — Entity Extraction

Per cleaned message, extract:

| Entity | Method |
|--------|--------|
| **People** | from / to / cc parsed addresses → normalize to `(name, email)` tuples; fuzzy-merge identities (`zhang san <zhang@a.com>` ≡ `张三 <zhang.san@a-corp.cn>`) using a maintained alias map under `state/people/aliases.json` |
| **Internal vs external** | `email_domain ∈ company_domains` → internal; else external (= candidate customer) |
| **Organization (customer)** | external email domain → lookup in `state/customers/domain_map.json`; new domain → create candidate `customers/UNCLASSIFIED-<domain>.json` for boss confirmation |
| **Dates** | `dateparser` lib (multi-language) for "下周三" / "by EOM" / "Mar 15" |
| **Amounts** | regex for `¥1,200` / `$50K` / `30 万` / `200万元` |
| **Project keywords** | (a) seed list from boss; (b) noun phrases via `spacy` zh+en models; cluster across thread |
| **Action verbs** | small classifier or regex set: 催 / urge / 等 / waiting / 决定 / decide / 转 / forward / 否决 / reject |

### Stage 6 — Intent Classification

8 intents (mutually exclusive primary + multiple secondary):

| Intent | Examples |
|--------|----------|
| `催办` (urge) | "麻烦 ASAP" / "deadline 已过" |
| `决策` (decide) | "我同意 / 不同意 / 选 A" |
| `转交` (delegate) | "请张三跟一下" / "+张三" |
| `询问` (ask) | "进展如何" / "有更新吗" |
| `抱怨` (complain) | "再不给答复就..." / "为什么这么慢" |
| `表扬` (praise) | "辛苦了 / 做得不错" |
| `通知` (inform) | "FYI / 同步一下" |
| `闲聊` (smalltalk) | greetings, pleasantries |

**Method:** few-shot LLM classification with 30-example reference set (seeded from the boss's own emails). Cache by `body_text_clean` hash to avoid re-classifying duplicates.

### Stage 7 — Normalize → Canonical Email JSON

Final output stored as `state/extracted/YYYY-MM/<thread_id>/<msg_id>.json`:

```json
{
  "msg_id": "CAH+...@mail.gmail.com",
  "thread_id": "thr-2026-04-12-abc123",
  "internal_date": "2026-04-22T14:33:00+08:00",
  "from": {"name": "客户A 王总", "email": "wang@clientco.com", "internal": false},
  "to": [{"name": "李四", "email": "lisi@us.com", "internal": true}],
  "cc": [{"name": "Boss", "email": "boss@us.com", "internal": true}],
  "subject_normalized": "客户A 官网改版 进度跟进",
  "language": "zh-CN",
  "body_text_clean": "再这样下去我就找别家了。这个礼拜必须给个准信。",
  "entities": {
    "dates": [{"text": "这个礼拜", "iso": "2026-04-26", "confidence": 0.85}],
    "amounts": [],
    "project_keywords": ["官网改版"],
    "internal_people": ["李四", "Boss"],
    "external_people": ["客户A 王总"],
    "customer_id_candidate": "CUST-clientco"
  },
  "intent": {"primary": "抱怨", "secondary": ["催办"], "confidence": 0.91},
  "attachments": [],
  "extraction_version": "v0.1",
  "extracted_at": "2026-05-09T07:30:12Z",
  "rule_audit": {"dequote_strategy": "marker+signature", "thread_method": "header"}
}
```

This is the contract. `claw-project-tracker`, `claw-people-observer`, `claw-customer-radar` consume only this — never raw MIME.

---

## Failure Handling

| Failure | Recovery |
|---------|---------|
| MIME parse fails | Log to `state/extracted/_failed/`, continue with next |
| Charset undetectable | Mark `body_text_clean = ""`, intent = `unknown`, surface in unclustered queue |
| Thread headers missing | Fall through to subject+participant strategy |
| Customer domain unknown | Create `UNCLASSIFIED-<domain>` candidate; boss confirms in week-1 |
| Person alias collision | Surface in `state/people/_to_merge.json` for boss |
| Intent confidence < 0.6 | Default to `通知`, mark `low_confidence: true` |
| Rate-limit hit | Exponential backoff; resume on next heartbeat |

---

## Performance Targets

| Metric | V0 target |
|--------|-----------|
| Extraction throughput | ≥ 200 msgs/min on a single worker |
| Stage 3 dequote precision | ≥ 92% (manual eval over 100-message sample) |
| Stage 4 thread accuracy | ≥ 95% (vs human-labeled) |
| Stage 5 entity recall (people) | ≥ 98% |
| Stage 6 intent accuracy | ≥ 80% top-1, ≥ 95% top-3 |
| End-to-end latency | < 2 sec/msg avg incl. LLM calls |

---

## Reuse vs Build

| Component | Approach |
|-----------|----------|
| IMAP / Gmail / Exchange auth + fetch | **Reuse** — `imap-tools`, `google-api-python-client`, `exchangelib` |
| MIME parse | **Reuse** — stdlib `email` |
| HTML→text | **Reuse** — `readability-lxml` |
| Quote stripping | **Reuse + extend** — `EmailReplyParser` baseline + our regex packs |
| Language detection | **Reuse** — `fasttext-langdetect` |
| Date parsing | **Reuse** — `dateparser` |
| Entity extraction (NER) | **Reuse** — `spacy` zh + en models |
| Intent classification | **Build (LLM few-shot)** — small custom prompt, cache by body hash |
| Threading | **Build** — header-first, custom fallbacks |
| Alias merging | **Build** — boss-curated `aliases.json` |

**Estimate:** 5–7 dev days to V0 for one mail backend (IMAP); +2 days each for Gmail / Exchange.

---

## V0 Deliverable Checklist

- [ ] IMAP fetcher with incremental UID sync
- [ ] MIME → clean text pipeline (stages 2–3) at ≥ 92% dequote precision
- [ ] Threading at ≥ 95% accuracy on a 100-thread eval set
- [ ] Entity extraction (people / dates / amounts / project keywords)
- [ ] Intent classifier with 30-shot reference set
- [ ] Canonical `Email` JSON writer
- [ ] `state/people/aliases.json` and `state/customers/domain_map.json` seed format
- [ ] Failure quarantine bucket
- [ ] CLI: `atlas-extract --since YYYY-MM-DD` for ad-hoc backfill