assistant-claw/atlas/mcp-tools/email-extractor.md
Vega (Atlas scaffolding) ce9f27320a Add Atlas profile under atlas/ — boss-perspective project execution radar
This adds the full Atlas (总助 Claw / 老板视角项目执行雷达) scaffolding as a
sibling profile to the existing Vega general-purpose assistant. All Atlas content
lives under atlas/ to keep the existing top-level skeleton intact.

What's included:

- atlas/IDENTITY.md, SOUL.md, USER.md, AGENTS.md, MEMORY.md, BOOTSTRAP.md,
  HEARTBEAT.md, TOOLS.md (+ zh-CN mirrors) — full OpenClaw 8-piece set
  matching the zero-cca convention
- atlas/skills/ — 6 sub-skills with frontmatter:
  claw-email-parser / claw-project-tracker / claw-people-observer /
  claw-customer-radar / claw-boss-distiller / claw-report-writer
- atlas/skills/claw-boss-distiller/ — adapter notes for nuwa-skill, 5-layer
  boss_skill seed template (23 rules across Expression DNA / Mental Models /
  Decision Heuristics / Anti-Patterns / Honest Boundaries), and a complete
  synthetic distillation demo (10 input emails -> validated 5-layer output)
- atlas/mcp-tools/email-extractor/ — Python implementation of stages 1-3
  (fetch + decode + dequote), 7 pytest tests passing, CLI: atlas-extract
- atlas/state-schemas/ — formal JSON schemas for project / person / customer
  cards with the no-employee-rating hard constraint baked in
- atlas/client-deck/ — 2-page client-facing pitch document
- autopilots/atlas-*.yaml — 5 autopilot configs (daily / weekly / monthly /
  quarterly + andon event-triggered) for a future Multica-side scheduler

Notes:

- nuwa-skill (MIT, https://github.com/alchaincyf/nuwa-skill) NOT vendored;
  fetch at deploy time via instructions in
  atlas/skills/claw-boss-distiller/upstream/README.md
- Vega-side prompts/skills/tools/autopilots/docs scaffold left untouched
- Top-level README.md updated with a brief Atlas pointer; rest preserved
2026-05-09 17:00:29 +08:00

221 lines
10 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# MCP Tool: email-extractor
The most underestimated component of Atlas. "Connecting to email" is a 2-day job; **extracting useful structure out of email is a 2-week job and the rest of Atlas falls apart without it.**
This doc specifies the 7-stage extraction pipeline, the canonical `Email` object schema, and the open-source libraries we lean on (so we don't reinvent IMAP / MIME / language detection).
---
## Why a dedicated tool
Raw email is a hostile data source:
- HTML wrapped in CSS, inline images, base64 attachments, multiple MIME parts
- Quoted reply chains stacking 10+ deep, each with a different signature block
- Auto-forwards, mailing lists, calendar invites, OOO replies polluting the signal
- 8+ languages mixed in one thread (中/英/日 + tech jargon)
- Senders use the same name with different addresses (`zhang@a.com` vs `zhang.san@a-corp.cn`)
- Subject lines drift across replies (`Re: Re: 项目 → 客户A 改版进度跟进`)
Atlas's downstream skills (`claw-project-tracker` etc.) assume **clean, normalized, deduplicated, intent-tagged Email objects**. The extractor is the bridge from MIME chaos to that contract.
---
## 7-Stage Pipeline
```
[Stage 1: Fetch] IMAP / Gmail API / Exchange → raw MIME bytes
[Stage 2: Decode] MIME parse, charset, HTML→text (readability)
[Stage 3: Dequote] strip quoted replies + signatures + disclaimers
[Stage 4: Thread] group by Message-ID / In-Reply-To / References / subject-fuzzy
[Stage 5: Entities] extract people, orgs, dates, amounts, project keywords
[Stage 6: Intent] classify into 8 categories (催办 / 决策 / 转交 / ...)
[Stage 7: Normalize] emit canonical Email JSON → state/extracted/<msg_id>.json
```
### Stage 1 — Fetch
| Backend | Lib | Notes |
|---------|-----|-------|
| IMAP | `imap-tools` (Python) or `node-imap` | Use UID-based incremental sync; persist `last_uid` per folder |
| Gmail API | `google-api-python-client` | OAuth2; use `historyId` for incremental |
| Exchange / O365 | `exchangelib` or MS Graph SDK | Modern auth (OAuth2); avoid legacy EWS basic auth |
Output: `raw_mime` bytes + envelope (account, folder, uid, internal_date)
**Configuration:**
- Folders to scan: `INBOX`, `Sent`, optionally `Drafts`. Exclude `Spam`, `Trash`, mailing-list folders.
- Date range: configurable (default V0 first run = past 12 months; subsequent runs = since last sync)
- Rate limit: respect server limits; backoff on `OVERQUOTA` / `429`
### Stage 2 — Decode
- Parse MIME with stdlib `email` (Python) or `mailparser` (Node)
- Detect charset; fallback chain: declared → `chardet` sniff → `utf-8` with errors=replace
- HTML body → plain text via `readability-lxml` (preserves structure) or `html2text`
- Inline images: keep `cid:` reference for later attachment OCR (V1)
- Calendar invites (`text/calendar`): extract event metadata, do NOT treat as conversation
- Detect language per body part with `fasttext-langdetect` (multilingual support)
Output adds: `body_text`, `body_html`, `language`, `attachments_meta`
### Stage 3 — Dequote (the unglamorous but critical step)
Most emails contain a quoted history of the entire prior thread. If we don't strip it, every email looks like every other email and clustering is destroyed.
**Strategies (combine, fall through):**
1. **Marker patterns** (regex):
- `^On .* wrote:$` (English)
- `^.* 写道:$` / `^.* 于 \d{4}年.*写道:$` (Chinese)
- `^------ (转发|原始)邮件 ------` / `------ Forwarded message ------`
- `^From: .*\nSent: .*\nTo: .*` (Outlook block headers)
- `^>+ ` (RFC quoted lines)
2. **Signature blocks**: detect `--\s*$` separator, or trailing block with phone/title patterns
3. **Disclaimer footers**: regex for `本邮件包含保密信息`, `CONFIDENTIAL`, etc.
4. **Library helper**: vendor `EmailReplyParser` (Python or Node port) as a baseline, then layer our patterns on top
**Result:** `body_text_clean` — only the new content the sender wrote in this message.
### Stage 4 — Thread
Goal: group all messages of one conversation into a `thread_id`.
| Method | Strength | Weakness |
|--------|----------|----------|
| `Message-ID` + `In-Reply-To` + `References` headers | Most reliable | Outlook sometimes drops these |
| Normalized subject (strip `Re:` / `Fwd:` / `回复:` / `转发:` prefixes) + participant overlap | Catches Outlook gaps | Subject drift breaks it |
| Embedding similarity over `body_text_clean[:500]` | Catches subject drift | Expensive; only as tiebreaker |
Persist `thread_id` per message; threads are first-class — `claw-project-tracker` clusters at thread level, not message level.
### Stage 5 — Entity Extraction
Per cleaned message, extract:
| Entity | Method |
|--------|--------|
| **People** | from / to / cc parsed addresses → normalize to `(name, email)` tuples; fuzzy-merge identities (`zhang san <zhang@a.com>` ≡ `张三 <zhang.san@a-corp.cn>`) using a maintained alias map under `state/people/aliases.json` |
| **Internal vs external** | `email_domain ∈ company_domains` → internal; else external (= candidate customer) |
| **Organization (customer)** | external email domain → lookup in `state/customers/domain_map.json`; new domain → create candidate `customers/UNCLASSIFIED-<domain>.json` for boss confirmation |
| **Dates** | `dateparser` lib (multi-language) for "下周三" / "by EOM" / "Mar 15" |
| **Amounts** | regex for `¥1,200` / `$50K` / `30 万` / `200万元` |
| **Project keywords** | (a) seed list from boss; (b) noun phrases via `spacy` zh+en models; cluster across thread |
| **Action verbs** | small classifier or regex set: 催 / urge / 等 / waiting / 决定 / decide / 转 / forward / 否决 / reject |
### Stage 6 — Intent Classification
8 intents (mutually exclusive primary + multiple secondary):
| Intent | Examples |
|--------|----------|
| `催办` (urge) | "麻烦 ASAP" / "deadline 已过" |
| `决策` (decide) | "我同意 / 不同意 / 选 A" |
| `转交` (delegate) | "请张三跟一下" / "+张三" |
| `询问` (ask) | "进展如何" / "有更新吗" |
| `抱怨` (complain) | "再不给答复就..." / "为什么这么慢" |
| `表扬` (praise) | "辛苦了 / 做得不错" |
| `通知` (inform) | "FYI / 同步一下" |
| `闲聊` (smalltalk) | greetings, pleasantries |
**Method:** few-shot LLM classification with 30-example reference set (seeded from the boss's own emails). Cache by `body_text_clean` hash to avoid re-classifying duplicates.
### Stage 7 — Normalize → Canonical Email JSON
Final output stored as `state/extracted/YYYY-MM/<thread_id>/<msg_id>.json`:
```json
{
"msg_id": "CAH+...@mail.gmail.com",
"thread_id": "thr-2026-04-12-abc123",
"internal_date": "2026-04-22T14:33:00+08:00",
"from": {"name": "客户A 王总", "email": "wang@clientco.com", "internal": false},
"to": [{"name": "李四", "email": "lisi@us.com", "internal": true}],
"cc": [{"name": "Boss", "email": "boss@us.com", "internal": true}],
"subject_normalized": "客户A 官网改版 进度跟进",
"language": "zh-CN",
"body_text_clean": "再这样下去我就找别家了。这个礼拜必须给个准信。",
"entities": {
"dates": [{"text": "这个礼拜", "iso": "2026-04-26", "confidence": 0.85}],
"amounts": [],
"project_keywords": ["官网改版"],
"internal_people": ["李四", "Boss"],
"external_people": ["客户A 王总"],
"customer_id_candidate": "CUST-clientco"
},
"intent": {"primary": "抱怨", "secondary": ["催办"], "confidence": 0.91},
"attachments": [],
"extraction_version": "v0.1",
"extracted_at": "2026-05-09T07:30:12Z",
"rule_audit": {"dequote_strategy": "marker+signature", "thread_method": "header"}
}
```
This is the contract. `claw-project-tracker`, `claw-people-observer`, `claw-customer-radar` consume only this — never raw MIME.
---
## Failure Handling
| Failure | Recovery |
|---------|---------|
| MIME parse fails | Log to `state/extracted/_failed/`, continue with next |
| Charset undetectable | Mark `body_text_clean = ""`, intent = `unknown`, surface in unclustered queue |
| Thread headers missing | Fall through to subject+participant strategy |
| Customer domain unknown | Create `UNCLASSIFIED-<domain>` candidate; boss confirms in week-1 |
| Person alias collision | Surface in `state/people/_to_merge.json` for boss |
| Intent confidence < 0.6 | Default to `通知`, mark `low_confidence: true` |
| Rate-limit hit | Exponential backoff; resume on next heartbeat |
---
## Performance Targets
| Metric | V0 target |
|--------|-----------|
| Extraction throughput | 200 msgs/min on a single worker |
| Stage 3 dequote precision | 92% (manual eval over 100-message sample) |
| Stage 4 thread accuracy | 95% (vs human-labeled) |
| Stage 5 entity recall (people) | 98% |
| Stage 6 intent accuracy | 80% top-1, 95% top-3 |
| End-to-end latency | < 2 sec/msg avg incl. LLM calls |
---
## Reuse vs Build
| Component | Approach |
|-----------|----------|
| IMAP / Gmail / Exchange auth + fetch | **Reuse** `imap-tools`, `google-api-python-client`, `exchangelib` |
| MIME parse | **Reuse** stdlib `email` |
| HTMLtext | **Reuse** `readability-lxml` |
| Quote stripping | **Reuse + extend** `EmailReplyParser` baseline + our regex packs |
| Language detection | **Reuse** `fasttext-langdetect` |
| Date parsing | **Reuse** `dateparser` |
| Entity extraction (NER) | **Reuse** `spacy` zh + en models |
| Intent classification | **Build (LLM few-shot)** small custom prompt, cache by body hash |
| Threading | **Build** header-first, custom fallbacks |
| Alias merging | **Build** boss-curated `aliases.json` |
**Estimate:** 57 dev days to V0 for one mail backend (IMAP); +2 days each for Gmail / Exchange.
---
## V0 Deliverable Checklist
- [ ] IMAP fetcher with incremental UID sync
- [ ] MIME clean text pipeline (stages 23) at 92% dequote precision
- [ ] Threading at 95% accuracy on a 100-thread eval set
- [ ] Entity extraction (people / dates / amounts / project keywords)
- [ ] Intent classifier with 30-shot reference set
- [ ] Canonical `Email` JSON writer
- [ ] `state/people/aliases.json` and `state/customers/domain_map.json` seed format
- [ ] Failure quarantine bucket
- [ ] CLI: `atlas-extract --since YYYY-MM-DD` for ad-hoc backfill