This repo IS Atlas (总助 Claw / 老板视角项目执行雷达). The earlier
two-profile framing (Atlas + Vega placeholder) was a misread — Vega is
the agent persona answering Multica issues, not the product. Vega has
no relationship to assistant-claw the product.
Changes:
- Move atlas/* to top-level (git mv preserves history)
- Remove empty Vega placeholders prompts/.gitkeep, tools/.gitkeep
- Delete atlas/ wrapper directory (now empty)
- Update path references in INTEGRATION-hermes.md, scripts/mirror-...sh,
docs/decisions/0001-mirror-nuwa-skill.md
- Rewrite README.md as Atlas-only, remove dual-profile language
After this commit:
- Top-level OpenClaw 8 files (IDENTITY/SOUL/USER/AGENTS/TOOLS/MEMORY/
BOOTSTRAP/HEARTBEAT + CLAUDE symlink + zh-CN mirrors)
- skills/{6 sub-skills + DESCRIPTION + README}
- mcp-tools/{spec + Python implementation}
- state-schemas/{project, person, customer + README}
- autopilots/{5 atlas-*.yaml}
- client-deck/, docs/decisions/, scripts/
The ~/.hermes/skills/atlas/ destination convention preserved (atlas as
a skill namespace on the operator's machine, distinct from source path).
10 KiB
MCP Tool: email-extractor
The most underestimated component of Atlas. "Connecting to email" is a 2-day job; extracting useful structure out of email is a 2-week job and the rest of Atlas falls apart without it.
This doc specifies the 7-stage extraction pipeline, the canonical Email object schema, and the open-source libraries we lean on (so we don't reinvent IMAP / MIME / language detection).
Why a dedicated tool
Raw email is a hostile data source:
- HTML wrapped in CSS, inline images, base64 attachments, multiple MIME parts
- Quoted reply chains stacking 10+ deep, each with a different signature block
- Auto-forwards, mailing lists, calendar invites, OOO replies polluting the signal
- 8+ languages mixed in one thread (中/英/日 + tech jargon)
- Senders use the same name with different addresses (
zhang@a.comvszhang.san@a-corp.cn) - Subject lines drift across replies (
Re: Re: 项目 → 客户A 改版进度跟进)
Atlas's downstream skills (claw-project-tracker etc.) assume clean, normalized, deduplicated, intent-tagged Email objects. The extractor is the bridge from MIME chaos to that contract.
7-Stage Pipeline
[Stage 1: Fetch] IMAP / Gmail API / Exchange → raw MIME bytes
↓
[Stage 2: Decode] MIME parse, charset, HTML→text (readability)
↓
[Stage 3: Dequote] strip quoted replies + signatures + disclaimers
↓
[Stage 4: Thread] group by Message-ID / In-Reply-To / References / subject-fuzzy
↓
[Stage 5: Entities] extract people, orgs, dates, amounts, project keywords
↓
[Stage 6: Intent] classify into 8 categories (催办 / 决策 / 转交 / ...)
↓
[Stage 7: Normalize] emit canonical Email JSON → state/extracted/<msg_id>.json
Stage 1 — Fetch
| Backend | Lib | Notes |
|---|---|---|
| IMAP | imap-tools (Python) or node-imap |
Use UID-based incremental sync; persist last_uid per folder |
| Gmail API | google-api-python-client |
OAuth2; use historyId for incremental |
| Exchange / O365 | exchangelib or MS Graph SDK |
Modern auth (OAuth2); avoid legacy EWS basic auth |
Output: raw_mime bytes + envelope (account, folder, uid, internal_date)
Configuration:
- Folders to scan:
INBOX,Sent, optionallyDrafts. ExcludeSpam,Trash, mailing-list folders. - Date range: configurable (default V0 first run = past 12 months; subsequent runs = since last sync)
- Rate limit: respect server limits; backoff on
OVERQUOTA/429
Stage 2 — Decode
- Parse MIME with stdlib
email(Python) ormailparser(Node) - Detect charset; fallback chain: declared →
chardetsniff →utf-8with errors=replace - HTML body → plain text via
readability-lxml(preserves structure) orhtml2text - Inline images: keep
cid:reference for later attachment OCR (V1) - Calendar invites (
text/calendar): extract event metadata, do NOT treat as conversation - Detect language per body part with
fasttext-langdetect(multilingual support)
Output adds: body_text, body_html, language, attachments_meta
Stage 3 — Dequote (the unglamorous but critical step)
Most emails contain a quoted history of the entire prior thread. If we don't strip it, every email looks like every other email and clustering is destroyed.
Strategies (combine, fall through):
- Marker patterns (regex):
^On .* wrote:$(English)^.* 写道:$/^.* 于 \d{4}年.*写道:$(Chinese)^------ (转发|原始)邮件 ------/------ Forwarded message ------^From: .*\nSent: .*\nTo: .*(Outlook block headers)^>+(RFC quoted lines)
- Signature blocks: detect
--\s*$separator, or trailing block with phone/title patterns - Disclaimer footers: regex for
本邮件包含保密信息,CONFIDENTIAL, etc. - Library helper: vendor
EmailReplyParser(Python or Node port) as a baseline, then layer our patterns on top
Result: body_text_clean — only the new content the sender wrote in this message.
Stage 4 — Thread
Goal: group all messages of one conversation into a thread_id.
| Method | Strength | Weakness |
|---|---|---|
Message-ID + In-Reply-To + References headers |
Most reliable | Outlook sometimes drops these |
Normalized subject (strip Re: / Fwd: / 回复: / 转发: prefixes) + participant overlap |
Catches Outlook gaps | Subject drift breaks it |
Embedding similarity over body_text_clean[:500] |
Catches subject drift | Expensive; only as tiebreaker |
Persist thread_id per message; threads are first-class — claw-project-tracker clusters at thread level, not message level.
Stage 5 — Entity Extraction
Per cleaned message, extract:
| Entity | Method |
|---|---|
| People | from / to / cc parsed addresses → normalize to (name, email) tuples; fuzzy-merge identities (zhang san <zhang@a.com> ≡ 张三 <zhang.san@a-corp.cn>) using a maintained alias map under state/people/aliases.json |
| Internal vs external | email_domain ∈ company_domains → internal; else external (= candidate customer) |
| Organization (customer) | external email domain → lookup in state/customers/domain_map.json; new domain → create candidate customers/UNCLASSIFIED-<domain>.json for boss confirmation |
| Dates | dateparser lib (multi-language) for "下周三" / "by EOM" / "Mar 15" |
| Amounts | regex for ¥1,200 / $50K / 30 万 / 200万元 |
| Project keywords | (a) seed list from boss; (b) noun phrases via spacy zh+en models; cluster across thread |
| Action verbs | small classifier or regex set: 催 / urge / 等 / waiting / 决定 / decide / 转 / forward / 否决 / reject |
Stage 6 — Intent Classification
8 intents (mutually exclusive primary + multiple secondary):
| Intent | Examples |
|---|---|
催办 (urge) |
"麻烦 ASAP" / "deadline 已过" |
决策 (decide) |
"我同意 / 不同意 / 选 A" |
转交 (delegate) |
"请张三跟一下" / "+张三" |
询问 (ask) |
"进展如何" / "有更新吗" |
抱怨 (complain) |
"再不给答复就..." / "为什么这么慢" |
表扬 (praise) |
"辛苦了 / 做得不错" |
通知 (inform) |
"FYI / 同步一下" |
闲聊 (smalltalk) |
greetings, pleasantries |
Method: few-shot LLM classification with 30-example reference set (seeded from the boss's own emails). Cache by body_text_clean hash to avoid re-classifying duplicates.
Stage 7 — Normalize → Canonical Email JSON
Final output stored as state/extracted/YYYY-MM/<thread_id>/<msg_id>.json:
{
"msg_id": "CAH+...@mail.gmail.com",
"thread_id": "thr-2026-04-12-abc123",
"internal_date": "2026-04-22T14:33:00+08:00",
"from": {"name": "客户A 王总", "email": "wang@clientco.com", "internal": false},
"to": [{"name": "李四", "email": "lisi@us.com", "internal": true}],
"cc": [{"name": "Boss", "email": "boss@us.com", "internal": true}],
"subject_normalized": "客户A 官网改版 进度跟进",
"language": "zh-CN",
"body_text_clean": "再这样下去我就找别家了。这个礼拜必须给个准信。",
"entities": {
"dates": [{"text": "这个礼拜", "iso": "2026-04-26", "confidence": 0.85}],
"amounts": [],
"project_keywords": ["官网改版"],
"internal_people": ["李四", "Boss"],
"external_people": ["客户A 王总"],
"customer_id_candidate": "CUST-clientco"
},
"intent": {"primary": "抱怨", "secondary": ["催办"], "confidence": 0.91},
"attachments": [],
"extraction_version": "v0.1",
"extracted_at": "2026-05-09T07:30:12Z",
"rule_audit": {"dequote_strategy": "marker+signature", "thread_method": "header"}
}
This is the contract. claw-project-tracker, claw-people-observer, claw-customer-radar consume only this — never raw MIME.
Failure Handling
| Failure | Recovery |
|---|---|
| MIME parse fails | Log to state/extracted/_failed/, continue with next |
| Charset undetectable | Mark body_text_clean = "", intent = unknown, surface in unclustered queue |
| Thread headers missing | Fall through to subject+participant strategy |
| Customer domain unknown | Create UNCLASSIFIED-<domain> candidate; boss confirms in week-1 |
| Person alias collision | Surface in state/people/_to_merge.json for boss |
| Intent confidence < 0.6 | Default to 通知, mark low_confidence: true |
| Rate-limit hit | Exponential backoff; resume on next heartbeat |
Performance Targets
| Metric | V0 target |
|---|---|
| Extraction throughput | ≥ 200 msgs/min on a single worker |
| Stage 3 dequote precision | ≥ 92% (manual eval over 100-message sample) |
| Stage 4 thread accuracy | ≥ 95% (vs human-labeled) |
| Stage 5 entity recall (people) | ≥ 98% |
| Stage 6 intent accuracy | ≥ 80% top-1, ≥ 95% top-3 |
| End-to-end latency | < 2 sec/msg avg incl. LLM calls |
Reuse vs Build
| Component | Approach |
|---|---|
| IMAP / Gmail / Exchange auth + fetch | Reuse — imap-tools, google-api-python-client, exchangelib |
| MIME parse | Reuse — stdlib email |
| HTML→text | Reuse — readability-lxml |
| Quote stripping | Reuse + extend — EmailReplyParser baseline + our regex packs |
| Language detection | Reuse — fasttext-langdetect |
| Date parsing | Reuse — dateparser |
| Entity extraction (NER) | Reuse — spacy zh + en models |
| Intent classification | Build (LLM few-shot) — small custom prompt, cache by body hash |
| Threading | Build — header-first, custom fallbacks |
| Alias merging | Build — boss-curated aliases.json |
Estimate: 5–7 dev days to V0 for one mail backend (IMAP); +2 days each for Gmail / Exchange.
V0 Deliverable Checklist
- IMAP fetcher with incremental UID sync
- MIME → clean text pipeline (stages 2–3) at ≥ 92% dequote precision
- Threading at ≥ 95% accuracy on a 100-thread eval set
- Entity extraction (people / dates / amounts / project keywords)
- Intent classifier with 30-shot reference set
- Canonical
EmailJSON writer state/people/aliases.jsonandstate/customers/domain_map.jsonseed format- Failure quarantine bucket
- CLI:
atlas-extract --since YYYY-MM-DDfor ad-hoc backfill