assistant-claw/atlas/mcp-tools/email-extractor.md
Vega (Atlas scaffolding) ce9f27320a Add Atlas profile under atlas/ — boss-perspective project execution radar
This adds the full Atlas (总助 Claw / 老板视角项目执行雷达) scaffolding as a
sibling profile to the existing Vega general-purpose assistant. All Atlas content
lives under atlas/ to keep the existing top-level skeleton intact.

What's included:

- atlas/IDENTITY.md, SOUL.md, USER.md, AGENTS.md, MEMORY.md, BOOTSTRAP.md,
  HEARTBEAT.md, TOOLS.md (+ zh-CN mirrors) — full OpenClaw 8-piece set
  matching the zero-cca convention
- atlas/skills/ — 6 sub-skills with frontmatter:
  claw-email-parser / claw-project-tracker / claw-people-observer /
  claw-customer-radar / claw-boss-distiller / claw-report-writer
- atlas/skills/claw-boss-distiller/ — adapter notes for nuwa-skill, 5-layer
  boss_skill seed template (23 rules across Expression DNA / Mental Models /
  Decision Heuristics / Anti-Patterns / Honest Boundaries), and a complete
  synthetic distillation demo (10 input emails -> validated 5-layer output)
- atlas/mcp-tools/email-extractor/ — Python implementation of stages 1-3
  (fetch + decode + dequote), 7 pytest tests passing, CLI: atlas-extract
- atlas/state-schemas/ — formal JSON schemas for project / person / customer
  cards with the no-employee-rating hard constraint baked in
- atlas/client-deck/ — 2-page client-facing pitch document
- autopilots/atlas-*.yaml — 5 autopilot configs (daily / weekly / monthly /
  quarterly + andon event-triggered) for a future Multica-side scheduler

Notes:

- nuwa-skill (MIT, https://github.com/alchaincyf/nuwa-skill) NOT vendored;
  fetch at deploy time via instructions in
  atlas/skills/claw-boss-distiller/upstream/README.md
- Vega-side prompts/skills/tools/autopilots/docs scaffold left untouched
- Top-level README.md updated with a brief Atlas pointer; rest preserved
2026-05-09 17:00:29 +08:00

10 KiB
Raw Blame History

MCP Tool: email-extractor

The most underestimated component of Atlas. "Connecting to email" is a 2-day job; extracting useful structure out of email is a 2-week job and the rest of Atlas falls apart without it.

This doc specifies the 7-stage extraction pipeline, the canonical Email object schema, and the open-source libraries we lean on (so we don't reinvent IMAP / MIME / language detection).


Why a dedicated tool

Raw email is a hostile data source:

  • HTML wrapped in CSS, inline images, base64 attachments, multiple MIME parts
  • Quoted reply chains stacking 10+ deep, each with a different signature block
  • Auto-forwards, mailing lists, calendar invites, OOO replies polluting the signal
  • 8+ languages mixed in one thread (中/英/日 + tech jargon)
  • Senders use the same name with different addresses (zhang@a.com vs zhang.san@a-corp.cn)
  • Subject lines drift across replies (Re: Re: 项目 → 客户A 改版进度跟进)

Atlas's downstream skills (claw-project-tracker etc.) assume clean, normalized, deduplicated, intent-tagged Email objects. The extractor is the bridge from MIME chaos to that contract.


7-Stage Pipeline

[Stage 1: Fetch]      IMAP / Gmail API / Exchange → raw MIME bytes
        ↓
[Stage 2: Decode]     MIME parse, charset, HTML→text (readability)
        ↓
[Stage 3: Dequote]    strip quoted replies + signatures + disclaimers
        ↓
[Stage 4: Thread]     group by Message-ID / In-Reply-To / References / subject-fuzzy
        ↓
[Stage 5: Entities]   extract people, orgs, dates, amounts, project keywords
        ↓
[Stage 6: Intent]     classify into 8 categories (催办 / 决策 / 转交 / ...)
        ↓
[Stage 7: Normalize]  emit canonical Email JSON → state/extracted/<msg_id>.json

Stage 1 — Fetch

Backend Lib Notes
IMAP imap-tools (Python) or node-imap Use UID-based incremental sync; persist last_uid per folder
Gmail API google-api-python-client OAuth2; use historyId for incremental
Exchange / O365 exchangelib or MS Graph SDK Modern auth (OAuth2); avoid legacy EWS basic auth

Output: raw_mime bytes + envelope (account, folder, uid, internal_date)

Configuration:

  • Folders to scan: INBOX, Sent, optionally Drafts. Exclude Spam, Trash, mailing-list folders.
  • Date range: configurable (default V0 first run = past 12 months; subsequent runs = since last sync)
  • Rate limit: respect server limits; backoff on OVERQUOTA / 429

Stage 2 — Decode

  • Parse MIME with stdlib email (Python) or mailparser (Node)
  • Detect charset; fallback chain: declared → chardet sniff → utf-8 with errors=replace
  • HTML body → plain text via readability-lxml (preserves structure) or html2text
  • Inline images: keep cid: reference for later attachment OCR (V1)
  • Calendar invites (text/calendar): extract event metadata, do NOT treat as conversation
  • Detect language per body part with fasttext-langdetect (multilingual support)

Output adds: body_text, body_html, language, attachments_meta

Stage 3 — Dequote (the unglamorous but critical step)

Most emails contain a quoted history of the entire prior thread. If we don't strip it, every email looks like every other email and clustering is destroyed.

Strategies (combine, fall through):

  1. Marker patterns (regex):
    • ^On .* wrote:$ (English)
    • ^.* 写道:$ / ^.* 于 \d{4}年.*写道:$ (Chinese)
    • ^------ (转发|原始)邮件 ------ / ------ Forwarded message ------
    • ^From: .*\nSent: .*\nTo: .* (Outlook block headers)
    • ^>+ (RFC quoted lines)
  2. Signature blocks: detect --\s*$ separator, or trailing block with phone/title patterns
  3. Disclaimer footers: regex for 本邮件包含保密信息, CONFIDENTIAL, etc.
  4. Library helper: vendor EmailReplyParser (Python or Node port) as a baseline, then layer our patterns on top

Result: body_text_clean — only the new content the sender wrote in this message.

Stage 4 — Thread

Goal: group all messages of one conversation into a thread_id.

Method Strength Weakness
Message-ID + In-Reply-To + References headers Most reliable Outlook sometimes drops these
Normalized subject (strip Re: / Fwd: / 回复: / 转发: prefixes) + participant overlap Catches Outlook gaps Subject drift breaks it
Embedding similarity over body_text_clean[:500] Catches subject drift Expensive; only as tiebreaker

Persist thread_id per message; threads are first-class — claw-project-tracker clusters at thread level, not message level.

Stage 5 — Entity Extraction

Per cleaned message, extract:

Entity Method
People from / to / cc parsed addresses → normalize to (name, email) tuples; fuzzy-merge identities (zhang san <zhang@a.com>张三 <zhang.san@a-corp.cn>) using a maintained alias map under state/people/aliases.json
Internal vs external email_domain ∈ company_domains → internal; else external (= candidate customer)
Organization (customer) external email domain → lookup in state/customers/domain_map.json; new domain → create candidate customers/UNCLASSIFIED-<domain>.json for boss confirmation
Dates dateparser lib (multi-language) for "下周三" / "by EOM" / "Mar 15"
Amounts regex for ¥1,200 / $50K / 30 万 / 200万元
Project keywords (a) seed list from boss; (b) noun phrases via spacy zh+en models; cluster across thread
Action verbs small classifier or regex set: 催 / urge / 等 / waiting / 决定 / decide / 转 / forward / 否决 / reject

Stage 6 — Intent Classification

8 intents (mutually exclusive primary + multiple secondary):

Intent Examples
催办 (urge) "麻烦 ASAP" / "deadline 已过"
决策 (decide) "我同意 / 不同意 / 选 A"
转交 (delegate) "请张三跟一下" / "+张三"
询问 (ask) "进展如何" / "有更新吗"
抱怨 (complain) "再不给答复就..." / "为什么这么慢"
表扬 (praise) "辛苦了 / 做得不错"
通知 (inform) "FYI / 同步一下"
闲聊 (smalltalk) greetings, pleasantries

Method: few-shot LLM classification with 30-example reference set (seeded from the boss's own emails). Cache by body_text_clean hash to avoid re-classifying duplicates.

Stage 7 — Normalize → Canonical Email JSON

Final output stored as state/extracted/YYYY-MM/<thread_id>/<msg_id>.json:

{
  "msg_id": "CAH+...@mail.gmail.com",
  "thread_id": "thr-2026-04-12-abc123",
  "internal_date": "2026-04-22T14:33:00+08:00",
  "from": {"name": "客户A 王总", "email": "wang@clientco.com", "internal": false},
  "to": [{"name": "李四", "email": "lisi@us.com", "internal": true}],
  "cc": [{"name": "Boss", "email": "boss@us.com", "internal": true}],
  "subject_normalized": "客户A 官网改版 进度跟进",
  "language": "zh-CN",
  "body_text_clean": "再这样下去我就找别家了。这个礼拜必须给个准信。",
  "entities": {
    "dates": [{"text": "这个礼拜", "iso": "2026-04-26", "confidence": 0.85}],
    "amounts": [],
    "project_keywords": ["官网改版"],
    "internal_people": ["李四", "Boss"],
    "external_people": ["客户A 王总"],
    "customer_id_candidate": "CUST-clientco"
  },
  "intent": {"primary": "抱怨", "secondary": ["催办"], "confidence": 0.91},
  "attachments": [],
  "extraction_version": "v0.1",
  "extracted_at": "2026-05-09T07:30:12Z",
  "rule_audit": {"dequote_strategy": "marker+signature", "thread_method": "header"}
}

This is the contract. claw-project-tracker, claw-people-observer, claw-customer-radar consume only this — never raw MIME.


Failure Handling

Failure Recovery
MIME parse fails Log to state/extracted/_failed/, continue with next
Charset undetectable Mark body_text_clean = "", intent = unknown, surface in unclustered queue
Thread headers missing Fall through to subject+participant strategy
Customer domain unknown Create UNCLASSIFIED-<domain> candidate; boss confirms in week-1
Person alias collision Surface in state/people/_to_merge.json for boss
Intent confidence < 0.6 Default to 通知, mark low_confidence: true
Rate-limit hit Exponential backoff; resume on next heartbeat

Performance Targets

Metric V0 target
Extraction throughput ≥ 200 msgs/min on a single worker
Stage 3 dequote precision ≥ 92% (manual eval over 100-message sample)
Stage 4 thread accuracy ≥ 95% (vs human-labeled)
Stage 5 entity recall (people) ≥ 98%
Stage 6 intent accuracy ≥ 80% top-1, ≥ 95% top-3
End-to-end latency < 2 sec/msg avg incl. LLM calls

Reuse vs Build

Component Approach
IMAP / Gmail / Exchange auth + fetch Reuseimap-tools, google-api-python-client, exchangelib
MIME parse Reuse — stdlib email
HTML→text Reusereadability-lxml
Quote stripping Reuse + extendEmailReplyParser baseline + our regex packs
Language detection Reusefasttext-langdetect
Date parsing Reusedateparser
Entity extraction (NER) Reusespacy zh + en models
Intent classification Build (LLM few-shot) — small custom prompt, cache by body hash
Threading Build — header-first, custom fallbacks
Alias merging Build — boss-curated aliases.json

Estimate: 57 dev days to V0 for one mail backend (IMAP); +2 days each for Gmail / Exchange.


V0 Deliverable Checklist

  • IMAP fetcher with incremental UID sync
  • MIME → clean text pipeline (stages 23) at ≥ 92% dequote precision
  • Threading at ≥ 95% accuracy on a 100-thread eval set
  • Entity extraction (people / dates / amounts / project keywords)
  • Intent classifier with 30-shot reference set
  • Canonical Email JSON writer
  • state/people/aliases.json and state/customers/domain_map.json seed format
  • Failure quarantine bucket
  • CLI: atlas-extract --since YYYY-MM-DD for ad-hoc backfill