This adds the full Atlas (总助 Claw / 老板视角项目执行雷达) scaffolding as a sibling profile to the existing Vega general-purpose assistant. All Atlas content lives under atlas/ to keep the existing top-level skeleton intact. What's included: - atlas/IDENTITY.md, SOUL.md, USER.md, AGENTS.md, MEMORY.md, BOOTSTRAP.md, HEARTBEAT.md, TOOLS.md (+ zh-CN mirrors) — full OpenClaw 8-piece set matching the zero-cca convention - atlas/skills/ — 6 sub-skills with frontmatter: claw-email-parser / claw-project-tracker / claw-people-observer / claw-customer-radar / claw-boss-distiller / claw-report-writer - atlas/skills/claw-boss-distiller/ — adapter notes for nuwa-skill, 5-layer boss_skill seed template (23 rules across Expression DNA / Mental Models / Decision Heuristics / Anti-Patterns / Honest Boundaries), and a complete synthetic distillation demo (10 input emails -> validated 5-layer output) - atlas/mcp-tools/email-extractor/ — Python implementation of stages 1-3 (fetch + decode + dequote), 7 pytest tests passing, CLI: atlas-extract - atlas/state-schemas/ — formal JSON schemas for project / person / customer cards with the no-employee-rating hard constraint baked in - atlas/client-deck/ — 2-page client-facing pitch document - autopilots/atlas-*.yaml — 5 autopilot configs (daily / weekly / monthly / quarterly + andon event-triggered) for a future Multica-side scheduler Notes: - nuwa-skill (MIT, https://github.com/alchaincyf/nuwa-skill) NOT vendored; fetch at deploy time via instructions in atlas/skills/claw-boss-distiller/upstream/README.md - Vega-side prompts/skills/tools/autopilots/docs scaffold left untouched - Top-level README.md updated with a brief Atlas pointer; rest preserved
4.3 KiB
atlas-extractor — V0 Implementation
Python implementation of Stages 1-3 of the email-extractor pipeline spec
(../email-extractor.md). Stages 4-7 (threading, entity extraction, intent
classification, canonical-JSON normalization) live in sibling modules to be
added.
Why only 1-3 in V0?
Stages 1-3 are the unsexy but critical foundation. Threading and intent classification are easier (well-understood techniques + LLM); fetch / decode / dequote is where most "AI email tools" silently break and produce garbage downstream. We invest here first.
Install
cd mcp-tools/email-extractor
pip install -e .[test]
Usage
Single .eml file (smoke test)
atlas-extract eml \
--input tests/fixtures/sample_thread.eml \
--state-dir /tmp/atlas-state
Directory of .txt or .eml files (e.g., the boss-distiller demo INPUT/)
atlas-extract dir \
--input-dir ../../skills/claw-boss-distiller/demo/INPUT \
--state-dir /tmp/atlas-state
Real IMAP account
ATLAS_IMAP_PASSWORD='app-specific-password' atlas-extract imap \
--host imap.gmail.com \
--user wang@us-saas.cn \
--folder INBOX --folder Sent \
--state-dir ./state \
--since-days 365
The since-days flag bounds the cold start. Subsequent runs are incremental
via persisted last_uid per (account, folder).
Output
Each message produces a JSON file under
state-dir/extracted/YYYY-MM/<msg_id>.json:
{
"msg_id": "demo-001@us-saas.cn",
"account": "test",
"folder": "local",
"uid": "1",
"internal_date": "2026-04-22T01:14:03+00:00",
"subject": "PRJ-001 客户A 改版 — 第三次问",
"from": {"name": "王", "email": "wang@us-saas.cn"},
"to": [{"name": "张三", "email": "zhangsan@us-saas.cn"}],
"cc": [{"name": "李四", "email": "lisi@us-saas.cn"}],
"in_reply_to": null,
"references": [],
"body_text_clean": "张三,\n\nPRJ-001 我上次问已经过去 6 天了 ...",
"body_text_full_chars": 312,
"body_text_clean_chars": 134,
"dequote": {
"strategies_used": [
"marker:^On\\s.+?wrote:\\s*$",
"signature_sep_dashdash",
"disclaimer:本邮件(及其附件)?(包含|含有)?(保密"
],
"chars_stripped": 178
},
"_extraction": { "stages_complete": [1, 2, 3], "extractor_version": "0.1.0", ... }
}
body_text_clean is the contract handed to Stages 4-7 (threading, entities,
intent, canonical normalization).
Run tests
pytest -q
The test suite verifies the 3 critical guarantees:
- Decode pulls headers + body correctly from the canonical fixture
- Dequote strips the English
On X wrote:marker, signature block, and disclaimer - Real content is preserved (no over-aggressive stripping)
Performance target (per spec)
- ≥ 200 msgs/min on a single worker (V0 acceptable)
- Dequote precision ≥ 92% on a manually-labeled 100-message eval set (TBD)
Failure modes (how each error gets handled)
| Failure | Behavior |
|---|---|
| MIME parse exception | Log to state/extracted/_failed/<account>__<uid>.error, continue |
| Charset undetectable | Fall through to gb18030 → latin-1 → utf-8 with errors="replace" |
| HTML-only message with broken HTML | readability falls back to html2text raw |
| Quote-stripping leaves < 8 chars | Marked low_signal_clean in run summary; not skipped |
| IMAP rate limit / quota | Exponential backoff in imap-tools (built in); checkpoint via last_uid |
What's NOT in this V0
- Stage 4 (threading): header-first + subject-fuzzy fallback; comes next
- Stage 5 (entity extraction): spacy + regex packs; comes next
- Stage 6 (intent classification): LLM few-shot with 30-sample reference; comes next
- Stage 7 (canonical normalization): structured Email JSON contract; comes next
- Gmail API / Exchange: only IMAP in V0; same interface, different fetcher
- MCP server wrapping: V0 is pure CLI; MCP is a V0.5 refactor
Roadmap to V0.5
- Add
thread.py— RFC headers first, fall back to subject+participant similarity - Add
entities.py— spacy zh+en NER + regex (dates, amounts, project keywords) - Add
intent.py— LLM few-shot classifier with body-hash cache - Add
normalize.py— canonical Email JSON output (matching the spec in../email-extractor.md) - Add 100-message eval set + accuracy harness
- Wrap as MCP server with the same CLI underneath