Atlas refactor bd0be97630 Refactor: drop Vega framing, promote Atlas to repo root

This repo IS Atlas (总助 Claw / 老板视角项目执行雷达). The earlier
two-profile framing (Atlas + Vega placeholder) was a misread — Vega is
the agent persona answering Multica issues, not the product. Vega has
no relationship to assistant-claw the product.

Changes:
- Move atlas/* to top-level (git mv preserves history)
- Remove empty Vega placeholders prompts/.gitkeep, tools/.gitkeep
- Delete atlas/ wrapper directory (now empty)
- Update path references in INTEGRATION-hermes.md, scripts/mirror-...sh,
  docs/decisions/0001-mirror-nuwa-skill.md
- Rewrite README.md as Atlas-only, remove dual-profile language

After this commit:
- Top-level OpenClaw 8 files (IDENTITY/SOUL/USER/AGENTS/TOOLS/MEMORY/
  BOOTSTRAP/HEARTBEAT + CLAUDE symlink + zh-CN mirrors)
- skills/{6 sub-skills + DESCRIPTION + README}
- mcp-tools/{spec + Python implementation}
- state-schemas/{project, person, customer + README}
- autopilots/{5 atlas-*.yaml}
- client-deck/, docs/decisions/, scripts/

The ~/.hermes/skills/atlas/ destination convention preserved (atlas as
a skill namespace on the operator's machine, distinct from source path).

2026-05-09 17:54:18 +08:00

10 KiB

Raw Blame History

MCP Tool: email-extractor

The most underestimated component of Atlas. "Connecting to email" is a 2-day job; extracting useful structure out of email is a 2-week job and the rest of Atlas falls apart without it.

This doc specifies the 7-stage extraction pipeline, the canonical Email object schema, and the open-source libraries we lean on (so we don't reinvent IMAP / MIME / language detection).

Why a dedicated tool

Raw email is a hostile data source:

HTML wrapped in CSS, inline images, base64 attachments, multiple MIME parts
Quoted reply chains stacking 10+ deep, each with a different signature block
Auto-forwards, mailing lists, calendar invites, OOO replies polluting the signal
8+ languages mixed in one thread (中/英/日 + tech jargon)
Senders use the same name with different addresses (zhang@a.com vs zhang.san@a-corp.cn)
Subject lines drift across replies (Re: Re: 项目 → 客户A 改版进度跟进)

Atlas's downstream skills (claw-project-tracker etc.) assume clean, normalized, deduplicated, intent-tagged Email objects. The extractor is the bridge from MIME chaos to that contract.

7-Stage Pipeline

[Stage 1: Fetch]      IMAP / Gmail API / Exchange → raw MIME bytes
        ↓
[Stage 2: Decode]     MIME parse, charset, HTML→text (readability)
        ↓
[Stage 3: Dequote]    strip quoted replies + signatures + disclaimers
        ↓
[Stage 4: Thread]     group by Message-ID / In-Reply-To / References / subject-fuzzy
        ↓
[Stage 5: Entities]   extract people, orgs, dates, amounts, project keywords
        ↓
[Stage 6: Intent]     classify into 8 categories (催办 / 决策 / 转交 / ...)
        ↓
[Stage 7: Normalize]  emit canonical Email JSON → state/extracted/<msg_id>.json

Stage 1 — Fetch

Backend	Lib	Notes
IMAP	`imap-tools` (Python) or `node-imap`	Use UID-based incremental sync; persist `last_uid` per folder
Gmail API	`google-api-python-client`	OAuth2; use `historyId` for incremental
Exchange / O365	`exchangelib` or MS Graph SDK	Modern auth (OAuth2); avoid legacy EWS basic auth

Output: raw_mime bytes + envelope (account, folder, uid, internal_date)

Configuration:

Folders to scan: INBOX, Sent, optionally Drafts. Exclude Spam, Trash, mailing-list folders.
Date range: configurable (default V0 first run = past 12 months; subsequent runs = since last sync)
Rate limit: respect server limits; backoff on OVERQUOTA / 429

Stage 2 — Decode

Parse MIME with stdlib email (Python) or mailparser (Node)
Detect charset; fallback chain: declared → chardet sniff → utf-8 with errors=replace
HTML body → plain text via readability-lxml (preserves structure) or html2text
Inline images: keep cid: reference for later attachment OCR (V1)
Calendar invites (text/calendar): extract event metadata, do NOT treat as conversation
Detect language per body part with fasttext-langdetect (multilingual support)

Output adds: body_text, body_html, language, attachments_meta

Stage 3 — Dequote (the unglamorous but critical step)

Most emails contain a quoted history of the entire prior thread. If we don't strip it, every email looks like every other email and clustering is destroyed.

Strategies (combine, fall through):

Marker patterns (regex):
- ^On .* wrote:$ (English)
- ^.* 写道：$ / ^.* 于 \d{4}年.*写道：$ (Chinese)
- ^------ (转发|原始)邮件 ------ / ------ Forwarded message ------
- ^From: .*\nSent: .*\nTo: .* (Outlook block headers)
- ^>+ (RFC quoted lines)
Signature blocks: detect --\s*$ separator, or trailing block with phone/title patterns
Disclaimer footers: regex for 本邮件包含保密信息, CONFIDENTIAL, etc.
Library helper: vendor EmailReplyParser (Python or Node port) as a baseline, then layer our patterns on top

Result: body_text_clean — only the new content the sender wrote in this message.

Stage 4 — Thread

Goal: group all messages of one conversation into a thread_id.

Method	Strength	Weakness
`Message-ID` + `In-Reply-To` + `References` headers	Most reliable	Outlook sometimes drops these
Normalized subject (strip `Re:` / `Fwd:` / `回复:` / `转发:` prefixes) + participant overlap	Catches Outlook gaps	Subject drift breaks it
Embedding similarity over `body_text_clean[:500]`	Catches subject drift	Expensive; only as tiebreaker

Persist thread_id per message; threads are first-class — claw-project-tracker clusters at thread level, not message level.

Stage 5 — Entity Extraction

Per cleaned message, extract:

Entity	Method
People	from / to / cc parsed addresses → normalize to `(name, email)` tuples; fuzzy-merge identities (`zhang san <zhang@a.com>` ≡ `张三 <zhang.san@a-corp.cn>`) using a maintained alias map under `state/people/aliases.json`
Internal vs external	`email_domain ∈ company_domains` → internal; else external (= candidate customer)
Organization (customer)	external email domain → lookup in `state/customers/domain_map.json`; new domain → create candidate `customers/UNCLASSIFIED-<domain>.json` for boss confirmation
Dates	`dateparser` lib (multi-language) for "下周三" / "by EOM" / "Mar 15"
Amounts	regex for `¥1,200` / `$50K` / `30 万` / `200万元`
Project keywords	(a) seed list from boss; (b) noun phrases via `spacy` zh+en models; cluster across thread
Action verbs	small classifier or regex set: 催 / urge / 等 / waiting / 决定 / decide / 转 / forward / 否决 / reject

Stage 6 — Intent Classification

8 intents (mutually exclusive primary + multiple secondary):

Intent	Examples
`催办` (urge)	"麻烦 ASAP" / "deadline 已过"
`决策` (decide)	"我同意 / 不同意 / 选 A"
`转交` (delegate)	"请张三跟一下" / "+张三"
`询问` (ask)	"进展如何" / "有更新吗"
`抱怨` (complain)	"再不给答复就..." / "为什么这么慢"
`表扬` (praise)	"辛苦了 / 做得不错"
`通知` (inform)	"FYI / 同步一下"
`闲聊` (smalltalk)	greetings, pleasantries

Method: few-shot LLM classification with 30-example reference set (seeded from the boss's own emails). Cache by body_text_clean hash to avoid re-classifying duplicates.

Stage 7 — Normalize → Canonical Email JSON

Final output stored as state/extracted/YYYY-MM/<thread_id>/<msg_id>.json:

{
  "msg_id": "CAH+...@mail.gmail.com",
  "thread_id": "thr-2026-04-12-abc123",
  "internal_date": "2026-04-22T14:33:00+08:00",
  "from": {"name": "客户A 王总", "email": "wang@clientco.com", "internal": false},
  "to": [{"name": "李四", "email": "lisi@us.com", "internal": true}],
  "cc": [{"name": "Boss", "email": "boss@us.com", "internal": true}],
  "subject_normalized": "客户A 官网改版 进度跟进",
  "language": "zh-CN",
  "body_text_clean": "再这样下去我就找别家了。这个礼拜必须给个准信。",
  "entities": {
    "dates": [{"text": "这个礼拜", "iso": "2026-04-26", "confidence": 0.85}],
    "amounts": [],
    "project_keywords": ["官网改版"],
    "internal_people": ["李四", "Boss"],
    "external_people": ["客户A 王总"],
    "customer_id_candidate": "CUST-clientco"
  },
  "intent": {"primary": "抱怨", "secondary": ["催办"], "confidence": 0.91},
  "attachments": [],
  "extraction_version": "v0.1",
  "extracted_at": "2026-05-09T07:30:12Z",
  "rule_audit": {"dequote_strategy": "marker+signature", "thread_method": "header"}
}

This is the contract. claw-project-tracker, claw-people-observer, claw-customer-radar consume only this — never raw MIME.

Failure Handling

Failure	Recovery
MIME parse fails	Log to `state/extracted/_failed/`, continue with next
Charset undetectable	Mark `body_text_clean = ""`, intent = `unknown`, surface in unclustered queue
Thread headers missing	Fall through to subject+participant strategy
Customer domain unknown	Create `UNCLASSIFIED-<domain>` candidate; boss confirms in week-1
Person alias collision	Surface in `state/people/_to_merge.json` for boss
Intent confidence < 0.6	Default to `通知`, mark `low_confidence: true`
Rate-limit hit	Exponential backoff; resume on next heartbeat

Performance Targets

Metric	V0 target
Extraction throughput	≥ 200 msgs/min on a single worker
Stage 3 dequote precision	≥ 92% (manual eval over 100-message sample)
Stage 4 thread accuracy	≥ 95% (vs human-labeled)
Stage 5 entity recall (people)	≥ 98%
Stage 6 intent accuracy	≥ 80% top-1, ≥ 95% top-3
End-to-end latency	< 2 sec/msg avg incl. LLM calls

Reuse vs Build

Component	Approach
IMAP / Gmail / Exchange auth + fetch	Reuse — `imap-tools`, `google-api-python-client`, `exchangelib`
MIME parse	Reuse — stdlib `email`
HTML→text	Reuse — `readability-lxml`
Quote stripping	Reuse + extend — `EmailReplyParser` baseline + our regex packs
Language detection	Reuse — `fasttext-langdetect`
Date parsing	Reuse — `dateparser`
Entity extraction (NER)	Reuse — `spacy` zh + en models
Intent classification	Build (LLM few-shot) — small custom prompt, cache by body hash
Threading	Build — header-first, custom fallbacks
Alias merging	Build — boss-curated `aliases.json`

Estimate: 5–7 dev days to V0 for one mail backend (IMAP); +2 days each for Gmail / Exchange.

V0 Deliverable Checklist

IMAP fetcher with incremental UID sync
MIME → clean text pipeline (stages 2–3) at ≥ 92% dequote precision
Threading at ≥ 95% accuracy on a 100-thread eval set
Entity extraction (people / dates / amounts / project keywords)
Intent classifier with 30-shot reference set
Canonical Email JSON writer
state/people/aliases.json and state/customers/domain_map.json seed format
Failure quarantine bucket
CLI: atlas-extract --since YYYY-MM-DD for ad-hoc backfill

10 KiB Raw Blame History Unescape Escape