This repo IS Atlas (总助 Claw / 老板视角项目执行雷达). The earlier
two-profile framing (Atlas + Vega placeholder) was a misread — Vega is
the agent persona answering Multica issues, not the product. Vega has
no relationship to assistant-claw the product.
Changes:
- Move atlas/* to top-level (git mv preserves history)
- Remove empty Vega placeholders prompts/.gitkeep, tools/.gitkeep
- Delete atlas/ wrapper directory (now empty)
- Update path references in INTEGRATION-hermes.md, scripts/mirror-...sh,
docs/decisions/0001-mirror-nuwa-skill.md
- Rewrite README.md as Atlas-only, remove dual-profile language
After this commit:
- Top-level OpenClaw 8 files (IDENTITY/SOUL/USER/AGENTS/TOOLS/MEMORY/
BOOTSTRAP/HEARTBEAT + CLAUDE symlink + zh-CN mirrors)
- skills/{6 sub-skills + DESCRIPTION + README}
- mcp-tools/{spec + Python implementation}
- state-schemas/{project, person, customer + README}
- autopilots/{5 atlas-*.yaml}
- client-deck/, docs/decisions/, scripts/
The ~/.hermes/skills/atlas/ destination convention preserved (atlas as
a skill namespace on the operator's machine, distinct from source path).
221 lines
10 KiB
Markdown
221 lines
10 KiB
Markdown
# MCP Tool: email-extractor
|
||
|
||
The most underestimated component of Atlas. "Connecting to email" is a 2-day job; **extracting useful structure out of email is a 2-week job and the rest of Atlas falls apart without it.**
|
||
|
||
This doc specifies the 7-stage extraction pipeline, the canonical `Email` object schema, and the open-source libraries we lean on (so we don't reinvent IMAP / MIME / language detection).
|
||
|
||
---
|
||
|
||
## Why a dedicated tool
|
||
|
||
Raw email is a hostile data source:
|
||
|
||
- HTML wrapped in CSS, inline images, base64 attachments, multiple MIME parts
|
||
- Quoted reply chains stacking 10+ deep, each with a different signature block
|
||
- Auto-forwards, mailing lists, calendar invites, OOO replies polluting the signal
|
||
- 8+ languages mixed in one thread (中/英/日 + tech jargon)
|
||
- Senders use the same name with different addresses (`zhang@a.com` vs `zhang.san@a-corp.cn`)
|
||
- Subject lines drift across replies (`Re: Re: 项目 → 客户A 改版进度跟进`)
|
||
|
||
Atlas's downstream skills (`claw-project-tracker` etc.) assume **clean, normalized, deduplicated, intent-tagged Email objects**. The extractor is the bridge from MIME chaos to that contract.
|
||
|
||
---
|
||
|
||
## 7-Stage Pipeline
|
||
|
||
```
|
||
[Stage 1: Fetch] IMAP / Gmail API / Exchange → raw MIME bytes
|
||
↓
|
||
[Stage 2: Decode] MIME parse, charset, HTML→text (readability)
|
||
↓
|
||
[Stage 3: Dequote] strip quoted replies + signatures + disclaimers
|
||
↓
|
||
[Stage 4: Thread] group by Message-ID / In-Reply-To / References / subject-fuzzy
|
||
↓
|
||
[Stage 5: Entities] extract people, orgs, dates, amounts, project keywords
|
||
↓
|
||
[Stage 6: Intent] classify into 8 categories (催办 / 决策 / 转交 / ...)
|
||
↓
|
||
[Stage 7: Normalize] emit canonical Email JSON → state/extracted/<msg_id>.json
|
||
```
|
||
|
||
### Stage 1 — Fetch
|
||
|
||
| Backend | Lib | Notes |
|
||
|---------|-----|-------|
|
||
| IMAP | `imap-tools` (Python) or `node-imap` | Use UID-based incremental sync; persist `last_uid` per folder |
|
||
| Gmail API | `google-api-python-client` | OAuth2; use `historyId` for incremental |
|
||
| Exchange / O365 | `exchangelib` or MS Graph SDK | Modern auth (OAuth2); avoid legacy EWS basic auth |
|
||
|
||
Output: `raw_mime` bytes + envelope (account, folder, uid, internal_date)
|
||
|
||
**Configuration:**
|
||
- Folders to scan: `INBOX`, `Sent`, optionally `Drafts`. Exclude `Spam`, `Trash`, mailing-list folders.
|
||
- Date range: configurable (default V0 first run = past 12 months; subsequent runs = since last sync)
|
||
- Rate limit: respect server limits; backoff on `OVERQUOTA` / `429`
|
||
|
||
### Stage 2 — Decode
|
||
|
||
- Parse MIME with stdlib `email` (Python) or `mailparser` (Node)
|
||
- Detect charset; fallback chain: declared → `chardet` sniff → `utf-8` with errors=replace
|
||
- HTML body → plain text via `readability-lxml` (preserves structure) or `html2text`
|
||
- Inline images: keep `cid:` reference for later attachment OCR (V1)
|
||
- Calendar invites (`text/calendar`): extract event metadata, do NOT treat as conversation
|
||
- Detect language per body part with `fasttext-langdetect` (multilingual support)
|
||
|
||
Output adds: `body_text`, `body_html`, `language`, `attachments_meta`
|
||
|
||
### Stage 3 — Dequote (the unglamorous but critical step)
|
||
|
||
Most emails contain a quoted history of the entire prior thread. If we don't strip it, every email looks like every other email and clustering is destroyed.
|
||
|
||
**Strategies (combine, fall through):**
|
||
|
||
1. **Marker patterns** (regex):
|
||
- `^On .* wrote:$` (English)
|
||
- `^.* 写道:$` / `^.* 于 \d{4}年.*写道:$` (Chinese)
|
||
- `^------ (转发|原始)邮件 ------` / `------ Forwarded message ------`
|
||
- `^From: .*\nSent: .*\nTo: .*` (Outlook block headers)
|
||
- `^>+ ` (RFC quoted lines)
|
||
2. **Signature blocks**: detect `--\s*$` separator, or trailing block with phone/title patterns
|
||
3. **Disclaimer footers**: regex for `本邮件包含保密信息`, `CONFIDENTIAL`, etc.
|
||
4. **Library helper**: vendor `EmailReplyParser` (Python or Node port) as a baseline, then layer our patterns on top
|
||
|
||
**Result:** `body_text_clean` — only the new content the sender wrote in this message.
|
||
|
||
### Stage 4 — Thread
|
||
|
||
Goal: group all messages of one conversation into a `thread_id`.
|
||
|
||
| Method | Strength | Weakness |
|
||
|--------|----------|----------|
|
||
| `Message-ID` + `In-Reply-To` + `References` headers | Most reliable | Outlook sometimes drops these |
|
||
| Normalized subject (strip `Re:` / `Fwd:` / `回复:` / `转发:` prefixes) + participant overlap | Catches Outlook gaps | Subject drift breaks it |
|
||
| Embedding similarity over `body_text_clean[:500]` | Catches subject drift | Expensive; only as tiebreaker |
|
||
|
||
Persist `thread_id` per message; threads are first-class — `claw-project-tracker` clusters at thread level, not message level.
|
||
|
||
### Stage 5 — Entity Extraction
|
||
|
||
Per cleaned message, extract:
|
||
|
||
| Entity | Method |
|
||
|--------|--------|
|
||
| **People** | from / to / cc parsed addresses → normalize to `(name, email)` tuples; fuzzy-merge identities (`zhang san <zhang@a.com>` ≡ `张三 <zhang.san@a-corp.cn>`) using a maintained alias map under `state/people/aliases.json` |
|
||
| **Internal vs external** | `email_domain ∈ company_domains` → internal; else external (= candidate customer) |
|
||
| **Organization (customer)** | external email domain → lookup in `state/customers/domain_map.json`; new domain → create candidate `customers/UNCLASSIFIED-<domain>.json` for boss confirmation |
|
||
| **Dates** | `dateparser` lib (multi-language) for "下周三" / "by EOM" / "Mar 15" |
|
||
| **Amounts** | regex for `¥1,200` / `$50K` / `30 万` / `200万元` |
|
||
| **Project keywords** | (a) seed list from boss; (b) noun phrases via `spacy` zh+en models; cluster across thread |
|
||
| **Action verbs** | small classifier or regex set: 催 / urge / 等 / waiting / 决定 / decide / 转 / forward / 否决 / reject |
|
||
|
||
### Stage 6 — Intent Classification
|
||
|
||
8 intents (mutually exclusive primary + multiple secondary):
|
||
|
||
| Intent | Examples |
|
||
|--------|----------|
|
||
| `催办` (urge) | "麻烦 ASAP" / "deadline 已过" |
|
||
| `决策` (decide) | "我同意 / 不同意 / 选 A" |
|
||
| `转交` (delegate) | "请张三跟一下" / "+张三" |
|
||
| `询问` (ask) | "进展如何" / "有更新吗" |
|
||
| `抱怨` (complain) | "再不给答复就..." / "为什么这么慢" |
|
||
| `表扬` (praise) | "辛苦了 / 做得不错" |
|
||
| `通知` (inform) | "FYI / 同步一下" |
|
||
| `闲聊` (smalltalk) | greetings, pleasantries |
|
||
|
||
**Method:** few-shot LLM classification with 30-example reference set (seeded from the boss's own emails). Cache by `body_text_clean` hash to avoid re-classifying duplicates.
|
||
|
||
### Stage 7 — Normalize → Canonical Email JSON
|
||
|
||
Final output stored as `state/extracted/YYYY-MM/<thread_id>/<msg_id>.json`:
|
||
|
||
```json
|
||
{
|
||
"msg_id": "CAH+...@mail.gmail.com",
|
||
"thread_id": "thr-2026-04-12-abc123",
|
||
"internal_date": "2026-04-22T14:33:00+08:00",
|
||
"from": {"name": "客户A 王总", "email": "wang@clientco.com", "internal": false},
|
||
"to": [{"name": "李四", "email": "lisi@us.com", "internal": true}],
|
||
"cc": [{"name": "Boss", "email": "boss@us.com", "internal": true}],
|
||
"subject_normalized": "客户A 官网改版 进度跟进",
|
||
"language": "zh-CN",
|
||
"body_text_clean": "再这样下去我就找别家了。这个礼拜必须给个准信。",
|
||
"entities": {
|
||
"dates": [{"text": "这个礼拜", "iso": "2026-04-26", "confidence": 0.85}],
|
||
"amounts": [],
|
||
"project_keywords": ["官网改版"],
|
||
"internal_people": ["李四", "Boss"],
|
||
"external_people": ["客户A 王总"],
|
||
"customer_id_candidate": "CUST-clientco"
|
||
},
|
||
"intent": {"primary": "抱怨", "secondary": ["催办"], "confidence": 0.91},
|
||
"attachments": [],
|
||
"extraction_version": "v0.1",
|
||
"extracted_at": "2026-05-09T07:30:12Z",
|
||
"rule_audit": {"dequote_strategy": "marker+signature", "thread_method": "header"}
|
||
}
|
||
```
|
||
|
||
This is the contract. `claw-project-tracker`, `claw-people-observer`, `claw-customer-radar` consume only this — never raw MIME.
|
||
|
||
---
|
||
|
||
## Failure Handling
|
||
|
||
| Failure | Recovery |
|
||
|---------|---------|
|
||
| MIME parse fails | Log to `state/extracted/_failed/`, continue with next |
|
||
| Charset undetectable | Mark `body_text_clean = ""`, intent = `unknown`, surface in unclustered queue |
|
||
| Thread headers missing | Fall through to subject+participant strategy |
|
||
| Customer domain unknown | Create `UNCLASSIFIED-<domain>` candidate; boss confirms in week-1 |
|
||
| Person alias collision | Surface in `state/people/_to_merge.json` for boss |
|
||
| Intent confidence < 0.6 | Default to `通知`, mark `low_confidence: true` |
|
||
| Rate-limit hit | Exponential backoff; resume on next heartbeat |
|
||
|
||
---
|
||
|
||
## Performance Targets
|
||
|
||
| Metric | V0 target |
|
||
|--------|-----------|
|
||
| Extraction throughput | ≥ 200 msgs/min on a single worker |
|
||
| Stage 3 dequote precision | ≥ 92% (manual eval over 100-message sample) |
|
||
| Stage 4 thread accuracy | ≥ 95% (vs human-labeled) |
|
||
| Stage 5 entity recall (people) | ≥ 98% |
|
||
| Stage 6 intent accuracy | ≥ 80% top-1, ≥ 95% top-3 |
|
||
| End-to-end latency | < 2 sec/msg avg incl. LLM calls |
|
||
|
||
---
|
||
|
||
## Reuse vs Build
|
||
|
||
| Component | Approach |
|
||
|-----------|----------|
|
||
| IMAP / Gmail / Exchange auth + fetch | **Reuse** — `imap-tools`, `google-api-python-client`, `exchangelib` |
|
||
| MIME parse | **Reuse** — stdlib `email` |
|
||
| HTML→text | **Reuse** — `readability-lxml` |
|
||
| Quote stripping | **Reuse + extend** — `EmailReplyParser` baseline + our regex packs |
|
||
| Language detection | **Reuse** — `fasttext-langdetect` |
|
||
| Date parsing | **Reuse** — `dateparser` |
|
||
| Entity extraction (NER) | **Reuse** — `spacy` zh + en models |
|
||
| Intent classification | **Build (LLM few-shot)** — small custom prompt, cache by body hash |
|
||
| Threading | **Build** — header-first, custom fallbacks |
|
||
| Alias merging | **Build** — boss-curated `aliases.json` |
|
||
|
||
**Estimate:** 5–7 dev days to V0 for one mail backend (IMAP); +2 days each for Gmail / Exchange.
|
||
|
||
---
|
||
|
||
## V0 Deliverable Checklist
|
||
|
||
- [ ] IMAP fetcher with incremental UID sync
|
||
- [ ] MIME → clean text pipeline (stages 2–3) at ≥ 92% dequote precision
|
||
- [ ] Threading at ≥ 95% accuracy on a 100-thread eval set
|
||
- [ ] Entity extraction (people / dates / amounts / project keywords)
|
||
- [ ] Intent classifier with 30-shot reference set
|
||
- [ ] Canonical `Email` JSON writer
|
||
- [ ] `state/people/aliases.json` and `state/customers/domain_map.json` seed format
|
||
- [ ] Failure quarantine bucket
|
||
- [ ] CLI: `atlas-extract --since YYYY-MM-DD` for ad-hoc backfill
|