File Ingestion (MTAAA)
Benchmarked against: Anthropic โ PDF support / Batch processing Spec: MTAAA v1.4 (38 Captain-approved decisions, 3,121 lines) Status: Implementation Phase (I1โI9 acceptance criteria defined)
SuperPortia uses MTAAA (Multi-dimensional Taxonomy & Auto-classification Architecture) to ingest, route, and classify all content โ text, code, images, audio/video, and data โ through a unified intake point.
Ingestion journeyโ
Follow the lifecycle or jump to what you need.
Caller sourcesโ
Every ingestion starts with an agent calling ingest_fragment(). Content can arrive from any of these channels:
| Channel | Source | Example |
|---|---|---|
manual | Agent directly ingests via CLI/App | ๅฐๅ ingests a spec document |
api | External system calls the ingest API | Webhook pushes a PDF |
patrol | Intelligence Officer patrol results | Daily tech news sweep |
email_forward | Email forwarded to intake address | User forwards an article |
cron | Scheduled batch ingestion | Nightly backup summary |
Intake state schemaโ
Every item passing through UBI carries this state:
class IntakeState(TypedDict):
intake_id: str # "intake-2026-0304-001"
file_name: str # "report.pdf"
file_extension: str # ".pdf"
detected_mime: str # "application/pdf"
raw_content: bytes # File content
input_channel: str # "manual" | "api" | "patrol" | ...
caller_agent: str # "Mac CLI ๅฐๅ
"
caller_ship: str # "SS1"
parent_intake_id: str # For multipart splits
Architecture overviewโ
All content enters through a single intake point (UBI โ Universal Brain Intake), gets routed by content type, and is processed by one of five specialized handlers.
Key design principle: UB is a Knowledge Base, not a universal database. It stores knowledge (decisions, specs, insights, analysis) in text form. Code goes to Git, images to R2, data to filesystem โ each optimized for its use case. This follows the industry-standard Polyglot Persistence pattern.
Entry Dockโ
The Entry Dock (UB ็ขผ้ ญ) is a staging area where all ingested content lands before classification. It corresponds to the entries table in D1.
ingest_fragment() โ Worker.js /brain/ingest โ entries table (Dock)
โ
Cron job (every 5 minutes)
โ
Text Subgraph classifies
โ
/ub/promote โ classified_entries (ๆญฃๅ)
Why a staging area?โ
| Reason | Detail |
|---|---|
| Decoupling | Ingestion is instant โ no waiting for LLM classification |
| Batch efficiency | Cron grabs multiple entries at once, amortizing LLM calls |
| Error isolation | A classification failure doesn't block ingestion |
| Quality gate | Content can be reviewed before promotion |
Dock fieldsโ
| Field | Purpose |
|---|---|
entry_id | Unique ID (e.g., ub-abc123) |
title | Auto-generated or caller-provided |
content | Raw content text |
source_ship | Which ship ingested (SS1, SS2, SS3) |
agent_id | Which agent ingested |
created_at | Ingestion timestamp |
status | pending / processing / classified / failed |
Three-dimension classificationโ
Every text entry in UB gets classified along three orthogonal dimensions:
| Dimension | Question | Values | Method |
|---|---|---|---|
| Topic | What is it about? | Multi-value (1โN) from 5,389 CV paths | LLM + Progressive Disclosure |
| Type | What kind of content? | Single value from 36 types | LLM with definitions |
| Lifecycle | How long to keep? | 4 values | Rule engine (no LLM) |
Topic โ Amazon Books as knowledge taxonomyโ
Key insight: Amazon started as a bookstore. Book categorization is knowledge categorization. The Amazon Browse Tree Guide has 4,883 knowledge-oriented paths covering everything from AI & Machine Learning to Ornithology to Investing > Technical Analysis.
Topic is multi-value โ a digital entry isn't a physical book on one shelf. An internal spec can simultaneously belong to:
SP > Universal Brain > Taxonomy Design(project positioning)Books > Textbooks > Library & Information Science(academic domain)
SP.* paths serve two roles:
| Role | Purpose | Example |
|---|---|---|
| Role A โ Domain extension | Fast-moving domains where Amazon Books lags | Books > ... > AI & ML > SP: Agentic AI |
| Role B โ Internal namespace | Company-specific docs with no Amazon Books equivalent | SP > Engineering Governance > Code Standards |
Type โ 36 text knowledge typesโ
Based on Dublin Core + Schema.org, filtered to knowledge-only types:
Text Document
โโโ Specification โโโ Intel Report
โโโ SOP โโโ Market Sentiment Report
โโโ Decision Record โโโ Trade Journal
โโโ Correction Record โโโ Meeting Notes
โโโ Session Handoff โโโ Transcript
โโโ Status Report โโโ Tutorial
โโโ Gap Analysis โโโ Reference Guide
โโโ Research Summary โโโ FAQ
โโโ Article โโโ Glossary
โโโ Newsletter โโโ Checklist
โโโ Correspondence โโโ Annotation
โโโ Institutional Analysis โโโ Troubleshooting Guide
โโโ Operations Log โโโ Architecture Document
โโโ Framework Document โโโ Release Note
โโโ Personal Note โโโ Legal / Contract
โโโ Observation Record โโโ Academic Paper
โโโ Review / Critique โโโ Book / eBook
โโโ Presentation โโโ Reference Table
Each type has a definition and confused-pair hints so the LLM doesn't misclassify similar types (e.g., "Research Summary" reports findings while "Framework Document" describes a reusable method).
Lifecycle โ rule engine, not LLMโ
Lifecycle is determined by a lookup table (data/lifecycle_rules.csv), not LLM:
| Lifecycle | Meaning | Example triggers |
|---|---|---|
versioned | Superseded by newer versions | Specs, SOPs, architecture docs |
persistent | Permanently relevant | Decisions, corrections, principles |
ephemeral | Short-term relevance | Status reports, session handoffs |
time-bounded | Expires on a date | Market intel, event-tied analysis |
Text Subgraph โ Classification pipelineโ
้็บ (guล yรฉ, "Boiler Grandpa") โ text entries come in raw, leave properly classified.
The text handler is the only handler that requires LLM-based classification. It's built as a LangGraph agentic pipeline with 5 nodes:
Orange nodes = LLM calls (5 total per entry). Green nodes = rule-based (zero cost).
Node 0 โ PK Awareness Checkโ
The LLM self-assesses whether it understands the content before classifying:
- HIGH โ proceed directly
- MEDIUM โ search UB for context first, then proceed with enriched state
- LOW โ mark
needs_review = true, best-effort classify, flag for human
This is why ้็บ is agentic, not a static script โ it needs to assess its own knowledge and conditionally search UB.
Node 1 โ PK Context Injectionโ
Keyword matching (no LLM) against the PK Context Table (data/pk_context.csv). When an entry mentions a technology SuperPortia uses (e.g., "LangGraph", "Cloudflare D1"), inject the corresponding SP.* Topic path into the candidate list.
This ensures internal technologies always get their SP.* project path alongside the Amazon Books academic path.
Node 2 โ Topic PD Drill-downโ
The full Topic CV has 5,353 paths โ far too many for a single LLM prompt. Progressive Disclosure drills down in 3 steps:
- L1 Selection โ Choose from ~32 top-level categories
- L2 Refinement โ Choose from ~11 subcategories under the selected L1
- L3 Refinement โ Choose from ~15 options under the selected L2
Each step is one LLM call. If the LLM returns UNMATCHED at L1, the entry is flagged and the CV Growth Flow kicks in.
Node 3 โ Type Selectionโ
Select from 36 types. The prompt includes definitions and confused-pair hints from data/type_definitions.csv:
- Specification: A formal document defining requirements, interfaces, or standards
- Framework Document: Describes a reusable method, pattern, or conceptual framework
โ ๏ธ confused_with: Research Summary (reports findings, not patterns)
Node 4 โ Lifecycle Rule Engineโ
Pure lookup โ no LLM. Rules are externalized in data/lifecycle_rules.csv:
input_channel | type_pattern | lifecycle |
|---|---|---|
* | Specification | versioned |
* | Decision Record | persistent |
* | Status Report | ephemeral |
patrol | Intel Report | time-bounded |
email_forward | * | ephemeral |
Cost modelโ
| Node | LLM calls | Input tokens | Output tokens |
|---|---|---|---|
| 0: PK Awareness | 1 | ~300 | ~10 |
| 1: PK Context | 0 (regex) | โ | โ |
| 2: Topic PD (L1โL2โL3) | 3 | ~1,300 | ~90 |
| 3: Type Selection | 1 | ~1,200 | ~30 |
| 4: Lifecycle | 0 (rule engine) | โ | โ |
| Total | 5 | ~2,800 | ~130 |
Per entry: ~$0.0002 (DeepSeek V3) Batch 327 entries: ~$0.07 LLM is hot-swappable โ changing the model is a config change only. The CV anchors consistency, not the LLM.
Cross-handler feedbackโ
When non-text handlers extract text from their content (OCR from images, transcripts from audio), that extracted text is sent back to the Text Subgraph via Entry Dock for UB classification:
The UB entry's references field links back to the original non-text artifact via UID, enabling cross-system traceability.
Promote to productionโ
After the Text Subgraph classifies an entry in the Dock, it must be promoted to the production area (classified_entries table) to become searchable.
Entry Dock (entries) โ /ub/promote โ UB ๆญฃๅ (classified_entries)
status: "classified" + Vectorize embedding
| Stage | Table | Searchable? | Purpose |
|---|---|---|---|
| Dock | entries | No (staging only) | Raw landing, awaiting classification |
| ๆญฃๅ | classified_entries | Yes (search_brain finds it) | Production knowledge base |
The promote step:
- Copies the classified entry from
entriestoclassified_entries - Triggers Vectorize to generate the semantic embedding (Gemini embedding-001, 768d)
- Updates the Dock entry status to
promoted
Entries that fail classification remain in the Dock with status: failed for manual review.
Output formatโ
After classification, the Text Subgraph writes to D1:
{
"entry_id": "ub-a623b20d3159",
"intake_id": "intake-2026-0304-001",
"taxonomy_topics": [
"Books > Computers & Technology > AI & Machine Learning",
"SP > Universal Brain > Taxonomy Design"
],
"taxonomy_type": "Text Document > Decision Record",
"lifecycle": "persistent",
"classification_meta": {
"pk_confidence": "HIGH",
"needs_review": false,
"llm": "deepseek-v3",
"pipeline_version": "1.0",
"source_handler": "router",
"caller_agent": "Mac CLI ๅฐๅ
",
"caller_ship": "SS1"
}
}
Every entry traces back to the original UBI intake via intake_id, enabling end-to-end audit.
Error handlingโ
| Condition | Behavior |
|---|---|
| PK confidence = LOW | needs_review = true, best-effort classify, flag for human |
| Topic L1 = UNMATCHED | Flag entry, trigger CV Growth Flow |
| Type = UNMATCHED | Flag for human review |
| DeepSeek API timeout | RetryPolicy: 3 attempts, exponential backoff |
| All LLM nodes fail | Entry stays unclassified, logged for manual processing |
CV Growth Flowโ
When ้็บ encounters content that doesn't match any CV path:
CV changes are code changes (in build_cv.py), reviewed via git diff. No runtime CV mutation.
Five handlers at a glanceโ
| Handler | Chapter | Intelligence | Destination | Lifecycle |
|---|---|---|---|---|
| ๐ณ Text Subgraph | ยง2.4 | AI agent (LLM) | UB (D1 + Vectorize) | Rule engine |
| โจ๏ธ Code | ยง2.7 | Rule-based | Git repos | Always versioned |
| ๐ผ๏ธ Image | ยง2.8 | AI agent (Vision) | R2 / Image DB | By image type |
| ๐๏ธ Audio/Video | ยง2.9 | Whisper API + rules | R2 / AV DB | By content type |
| ๐ Data | ยง2.10 | Rule-based | Filesystem | By data category |
For UBI Router architecture details (how content is routed to these handlers), see UBI Router Architecture.
Controlled Vocabulary sourcesโ
| Source | Role | Paths | Priority |
|---|---|---|---|
| Amazon Books | Primary knowledge classification | 4,883 | 1st |
| SP.* Role A | Fast-moving domain extensions | 16 | 2nd |
| SP.* Role B | Internal project namespace | 36 | 2nd |
| ๅๅฎขไพ | Chinese book categories | 262 | 3rd |
| arXiv | Academic depth | 151 | 4th |
| Google PT | Supplementary (knowledge only) | 5 | 5th |
| Dublin Core / Schema.org | Type dimension (36 types) | 36 | โ |
| Total | 5,389 |
CV is stored as a local CSV file (data/cv.csv), generated by build_cv.py. No cloud dependency for classification lookups.
Implementation phasesโ
| Gate | Scope | Status |
|---|---|---|
| G1 | Controlled Vocabulary rebuild (build_cv.py with Amazon Books separation) | I1 |
| G2 | Three-dimension D1 schema migration | I2 |
| G3 | Text Subgraph 5-node pipeline (text_subgraph) | I2 |
| G4 | Type definitions + confused-pair CSV | I3 |
| G5 | Lifecycle rules CSV (all 5 handlers) | I4โI5 |
| G6 | Code Handler (ยง2.7) | I6 |
| G7 | Image, AV, Data Handlers (ยง2.8โยง2.10) | I7โI9 |
Key decisionsโ
| # | Decision | Source |
|---|---|---|
| 4 | LLM = DeepSeek V3, not Haiku (3.6x cheaper) | ub-06272134b21d |
| 12 | Topic is multi-value (1โN) | ub-60c0e80489d9 |
| 16 | ้็บ = agentic (LangGraph), needs UB search | spec v1.4 |
| 17 | Type CV needs definitions for LLM accuracy | spec v1.4 |
| 25 | Lifecycle = CSV lookup table, not LLM | spec v1.4 |
| 27 | Type CV: 63 โ 36 (non-knowledge types removed) | spec v1.4 |
| 33 | UBI Router Pattern with 5 handlers | spec v1.4 |
| 34 | No n8n/Flowise โ LangGraph covers all needs | spec v1.4 |
The complete MTAAA v1.4 spec (3,121 lines, 38 decisions, 41 changelog items) is maintained at docs/MTAAA-Spec-v1.4-DRAFT.md and ingested to UB.
Related pagesโ
| Page | Relationship |
|---|---|
| UB Governance | The rules that MTAAA enforces โ language policy, tag format, quality checklist, freshness management |
| UBI Router Architecture | How content is routed to the 5 handlers before reaching Entry Dock |
| EGS Spec | MTAAA implements EGS Chapter 9 (UB Governance) |
| Ingest Fragment | The MCP tool that triggers ingestion into Entry Dock |
| Search Brain | Retrieves classified entries from UB ๆญฃๅ |