assistant-claw/mcp-tools/email-extractor
Atlas refactor bd0be97630 Refactor: drop Vega framing, promote Atlas to repo root
This repo IS Atlas (总助 Claw / 老板视角项目执行雷达). The earlier
two-profile framing (Atlas + Vega placeholder) was a misread — Vega is
the agent persona answering Multica issues, not the product. Vega has
no relationship to assistant-claw the product.

Changes:
- Move atlas/* to top-level (git mv preserves history)
- Remove empty Vega placeholders prompts/.gitkeep, tools/.gitkeep
- Delete atlas/ wrapper directory (now empty)
- Update path references in INTEGRATION-hermes.md, scripts/mirror-...sh,
  docs/decisions/0001-mirror-nuwa-skill.md
- Rewrite README.md as Atlas-only, remove dual-profile language

After this commit:
- Top-level OpenClaw 8 files (IDENTITY/SOUL/USER/AGENTS/TOOLS/MEMORY/
  BOOTSTRAP/HEARTBEAT + CLAUDE symlink + zh-CN mirrors)
- skills/{6 sub-skills + DESCRIPTION + README}
- mcp-tools/{spec + Python implementation}
- state-schemas/{project, person, customer + README}
- autopilots/{5 atlas-*.yaml}
- client-deck/, docs/decisions/, scripts/

The ~/.hermes/skills/atlas/ destination convention preserved (atlas as
a skill namespace on the operator's machine, distinct from source path).
2026-05-09 17:54:18 +08:00
..
atlas_extractor Refactor: drop Vega framing, promote Atlas to repo root 2026-05-09 17:54:18 +08:00
tests Refactor: drop Vega framing, promote Atlas to repo root 2026-05-09 17:54:18 +08:00
pyproject.toml Refactor: drop Vega framing, promote Atlas to repo root 2026-05-09 17:54:18 +08:00
README.md Refactor: drop Vega framing, promote Atlas to repo root 2026-05-09 17:54:18 +08:00

atlas-extractor — V0 Implementation

Python implementation of Stages 1-3 of the email-extractor pipeline spec (../email-extractor.md). Stages 4-7 (threading, entity extraction, intent classification, canonical-JSON normalization) live in sibling modules to be added.

Why only 1-3 in V0?

Stages 1-3 are the unsexy but critical foundation. Threading and intent classification are easier (well-understood techniques + LLM); fetch / decode / dequote is where most "AI email tools" silently break and produce garbage downstream. We invest here first.

Install

cd mcp-tools/email-extractor
pip install -e .[test]

Usage

Single .eml file (smoke test)

atlas-extract eml \
  --input tests/fixtures/sample_thread.eml \
  --state-dir /tmp/atlas-state

Directory of .txt or .eml files (e.g., the boss-distiller demo INPUT/)

atlas-extract dir \
  --input-dir ../../skills/claw-boss-distiller/demo/INPUT \
  --state-dir /tmp/atlas-state

Real IMAP account

ATLAS_IMAP_PASSWORD='app-specific-password' atlas-extract imap \
  --host imap.gmail.com \
  --user wang@us-saas.cn \
  --folder INBOX --folder Sent \
  --state-dir ./state \
  --since-days 365

The since-days flag bounds the cold start. Subsequent runs are incremental via persisted last_uid per (account, folder).

Output

Each message produces a JSON file under state-dir/extracted/YYYY-MM/<msg_id>.json:

{
  "msg_id": "demo-001@us-saas.cn",
  "account": "test",
  "folder": "local",
  "uid": "1",
  "internal_date": "2026-04-22T01:14:03+00:00",
  "subject": "PRJ-001 客户A 改版 — 第三次问",
  "from": {"name": "王", "email": "wang@us-saas.cn"},
  "to": [{"name": "张三", "email": "zhangsan@us-saas.cn"}],
  "cc": [{"name": "李四", "email": "lisi@us-saas.cn"}],
  "in_reply_to": null,
  "references": [],
  "body_text_clean": "张三,\n\nPRJ-001 我上次问已经过去 6 天了 ...",
  "body_text_full_chars": 312,
  "body_text_clean_chars": 134,
  "dequote": {
    "strategies_used": [
      "marker:^On\\s.+?wrote:\\s*$",
      "signature_sep_dashdash",
      "disclaimer:本邮件(及其附件)?(包含|含有)?(保密"
    ],
    "chars_stripped": 178
  },
  "_extraction": { "stages_complete": [1, 2, 3], "extractor_version": "0.1.0", ... }
}

body_text_clean is the contract handed to Stages 4-7 (threading, entities, intent, canonical normalization).

Run tests

pytest -q

The test suite verifies the 3 critical guarantees:

  1. Decode pulls headers + body correctly from the canonical fixture
  2. Dequote strips the English On X wrote: marker, signature block, and disclaimer
  3. Real content is preserved (no over-aggressive stripping)

Performance target (per spec)

  • ≥ 200 msgs/min on a single worker (V0 acceptable)
  • Dequote precision ≥ 92% on a manually-labeled 100-message eval set (TBD)

Failure modes (how each error gets handled)

Failure Behavior
MIME parse exception Log to state/extracted/_failed/<account>__<uid>.error, continue
Charset undetectable Fall through to gb18030latin-1utf-8 with errors="replace"
HTML-only message with broken HTML readability falls back to html2text raw
Quote-stripping leaves < 8 chars Marked low_signal_clean in run summary; not skipped
IMAP rate limit / quota Exponential backoff in imap-tools (built in); checkpoint via last_uid

What's NOT in this V0

  • Stage 4 (threading): header-first + subject-fuzzy fallback; comes next
  • Stage 5 (entity extraction): spacy + regex packs; comes next
  • Stage 6 (intent classification): LLM few-shot with 30-sample reference; comes next
  • Stage 7 (canonical normalization): structured Email JSON contract; comes next
  • Gmail API / Exchange: only IMAP in V0; same interface, different fetcher
  • MCP server wrapping: V0 is pure CLI; MCP is a V0.5 refactor

Roadmap to V0.5

  1. Add thread.py — RFC headers first, fall back to subject+participant similarity
  2. Add entities.py — spacy zh+en NER + regex (dates, amounts, project keywords)
  3. Add intent.py — LLM few-shot classifier with body-hash cache
  4. Add normalize.py — canonical Email JSON output (matching the spec in ../email-extractor.md)
  5. Add 100-message eval set + accuracy harness
  6. Wrap as MCP server with the same CLI underneath