This repo IS Atlas (总助 Claw / 老板视角项目执行雷达). The earlier
two-profile framing (Atlas + Vega placeholder) was a misread — Vega is
the agent persona answering Multica issues, not the product. Vega has
no relationship to assistant-claw the product.
Changes:
- Move atlas/* to top-level (git mv preserves history)
- Remove empty Vega placeholders prompts/.gitkeep, tools/.gitkeep
- Delete atlas/ wrapper directory (now empty)
- Update path references in INTEGRATION-hermes.md, scripts/mirror-...sh,
docs/decisions/0001-mirror-nuwa-skill.md
- Rewrite README.md as Atlas-only, remove dual-profile language
After this commit:
- Top-level OpenClaw 8 files (IDENTITY/SOUL/USER/AGENTS/TOOLS/MEMORY/
BOOTSTRAP/HEARTBEAT + CLAUDE symlink + zh-CN mirrors)
- skills/{6 sub-skills + DESCRIPTION + README}
- mcp-tools/{spec + Python implementation}
- state-schemas/{project, person, customer + README}
- autopilots/{5 atlas-*.yaml}
- client-deck/, docs/decisions/, scripts/
The ~/.hermes/skills/atlas/ destination convention preserved (atlas as
a skill namespace on the operator's machine, distinct from source path).
133 lines
4.3 KiB
Markdown
133 lines
4.3 KiB
Markdown
# `atlas-extractor` — V0 Implementation
|
|
|
|
Python implementation of Stages 1-3 of the email-extractor pipeline spec
|
|
(`../email-extractor.md`). Stages 4-7 (threading, entity extraction, intent
|
|
classification, canonical-JSON normalization) live in sibling modules to be
|
|
added.
|
|
|
|
## Why only 1-3 in V0?
|
|
|
|
Stages 1-3 are the *unsexy but critical* foundation. Threading and intent
|
|
classification are easier (well-understood techniques + LLM); fetch / decode /
|
|
dequote is where most "AI email tools" silently break and produce garbage
|
|
downstream. We invest here first.
|
|
|
|
## Install
|
|
|
|
```bash
|
|
cd mcp-tools/email-extractor
|
|
pip install -e .[test]
|
|
```
|
|
|
|
## Usage
|
|
|
|
### Single .eml file (smoke test)
|
|
|
|
```bash
|
|
atlas-extract eml \
|
|
--input tests/fixtures/sample_thread.eml \
|
|
--state-dir /tmp/atlas-state
|
|
```
|
|
|
|
### Directory of .txt or .eml files (e.g., the boss-distiller demo INPUT/)
|
|
|
|
```bash
|
|
atlas-extract dir \
|
|
--input-dir ../../skills/claw-boss-distiller/demo/INPUT \
|
|
--state-dir /tmp/atlas-state
|
|
```
|
|
|
|
### Real IMAP account
|
|
|
|
```bash
|
|
ATLAS_IMAP_PASSWORD='app-specific-password' atlas-extract imap \
|
|
--host imap.gmail.com \
|
|
--user wang@us-saas.cn \
|
|
--folder INBOX --folder Sent \
|
|
--state-dir ./state \
|
|
--since-days 365
|
|
```
|
|
|
|
The `since-days` flag bounds the cold start. Subsequent runs are incremental
|
|
via persisted `last_uid` per (account, folder).
|
|
|
|
## Output
|
|
|
|
Each message produces a JSON file under
|
|
`state-dir/extracted/YYYY-MM/<msg_id>.json`:
|
|
|
|
```json
|
|
{
|
|
"msg_id": "demo-001@us-saas.cn",
|
|
"account": "test",
|
|
"folder": "local",
|
|
"uid": "1",
|
|
"internal_date": "2026-04-22T01:14:03+00:00",
|
|
"subject": "PRJ-001 客户A 改版 — 第三次问",
|
|
"from": {"name": "王", "email": "wang@us-saas.cn"},
|
|
"to": [{"name": "张三", "email": "zhangsan@us-saas.cn"}],
|
|
"cc": [{"name": "李四", "email": "lisi@us-saas.cn"}],
|
|
"in_reply_to": null,
|
|
"references": [],
|
|
"body_text_clean": "张三,\n\nPRJ-001 我上次问已经过去 6 天了 ...",
|
|
"body_text_full_chars": 312,
|
|
"body_text_clean_chars": 134,
|
|
"dequote": {
|
|
"strategies_used": [
|
|
"marker:^On\\s.+?wrote:\\s*$",
|
|
"signature_sep_dashdash",
|
|
"disclaimer:本邮件(及其附件)?(包含|含有)?(保密"
|
|
],
|
|
"chars_stripped": 178
|
|
},
|
|
"_extraction": { "stages_complete": [1, 2, 3], "extractor_version": "0.1.0", ... }
|
|
}
|
|
```
|
|
|
|
`body_text_clean` is the contract handed to Stages 4-7 (threading, entities,
|
|
intent, canonical normalization).
|
|
|
|
## Run tests
|
|
|
|
```bash
|
|
pytest -q
|
|
```
|
|
|
|
The test suite verifies the 3 critical guarantees:
|
|
1. Decode pulls headers + body correctly from the canonical fixture
|
|
2. Dequote strips the English `On X wrote:` marker, signature block, and disclaimer
|
|
3. Real content is preserved (no over-aggressive stripping)
|
|
|
|
## Performance target (per spec)
|
|
|
|
- ≥ 200 msgs/min on a single worker (V0 acceptable)
|
|
- Dequote precision ≥ 92% on a manually-labeled 100-message eval set (TBD)
|
|
|
|
## Failure modes (how each error gets handled)
|
|
|
|
| Failure | Behavior |
|
|
|---------|----------|
|
|
| MIME parse exception | Log to `state/extracted/_failed/<account>__<uid>.error`, continue |
|
|
| Charset undetectable | Fall through to `gb18030` → `latin-1` → `utf-8` with `errors="replace"` |
|
|
| HTML-only message with broken HTML | `readability` falls back to `html2text` raw |
|
|
| Quote-stripping leaves < 8 chars | Marked `low_signal_clean` in run summary; not skipped |
|
|
| IMAP rate limit / quota | Exponential backoff in `imap-tools` (built in); checkpoint via `last_uid` |
|
|
|
|
## What's NOT in this V0
|
|
|
|
- **Stage 4 (threading)**: header-first + subject-fuzzy fallback; comes next
|
|
- **Stage 5 (entity extraction)**: spacy + regex packs; comes next
|
|
- **Stage 6 (intent classification)**: LLM few-shot with 30-sample reference; comes next
|
|
- **Stage 7 (canonical normalization)**: structured Email JSON contract; comes next
|
|
- **Gmail API / Exchange**: only IMAP in V0; same interface, different fetcher
|
|
- **MCP server wrapping**: V0 is pure CLI; MCP is a V0.5 refactor
|
|
|
|
## Roadmap to V0.5
|
|
|
|
1. Add `thread.py` — RFC headers first, fall back to subject+participant similarity
|
|
2. Add `entities.py` — spacy zh+en NER + regex (dates, amounts, project keywords)
|
|
3. Add `intent.py` — LLM few-shot classifier with body-hash cache
|
|
4. Add `normalize.py` — canonical Email JSON output (matching the spec in `../email-extractor.md`)
|
|
5. Add 100-message eval set + accuracy harness
|
|
6. Wrap as MCP server with the same CLI underneath
|