File Ingestion (MTAAA)

Benchmarked against: Anthropic — PDF support / Batch processing Spec: MTAAA v1.4 (38 Captain-approved decisions, 3,121 lines) Status: Implementation Phase (I1–I9 acceptance criteria defined)

SuperPortia uses MTAAA (Multi-dimensional Taxonomy & Auto-classification Architecture) to ingest, route, and classify all content — text, code, images, audio/video, and data — through a unified intake point.

Ingestion journey

Follow the lifecycle or jump to what you need.

Caller sources

Every ingestion starts with an agent calling ingest_fragment(). Content can arrive from any of these channels:

Channel	Source	Example
`manual`	Agent directly ingests via CLI/App	小克 ingests a spec document
`api`	External system calls the ingest API	Webhook pushes a PDF
`patrol`	Intelligence Officer patrol results	Daily tech news sweep
`email_forward`	Email forwarded to intake address	User forwards an article
`cron`	Scheduled batch ingestion	Nightly backup summary

Intake state schema

Every item passing through UBI carries this state:

class IntakeState(TypedDict):
    intake_id: str          # "intake-2026-0304-001"
    file_name: str          # "report.pdf"
    file_extension: str     # ".pdf"
    detected_mime: str      # "application/pdf"
    raw_content: bytes      # File content
    input_channel: str      # "manual" | "api" | "patrol" | ...
    caller_agent: str       # "Mac CLI 小克"
    caller_ship: str        # "SS1"
    parent_intake_id: str   # For multipart splits

Architecture overview

All content enters through a single intake point (UBI — Universal Brain Intake), gets routed by content type, and is processed by one of five specialized handlers.

Key design principle: UB is a Knowledge Base, not a universal database. It stores knowledge (decisions, specs, insights, analysis) in text form. Code goes to Git, images to R2, data to filesystem — each optimized for its use case. This follows the industry-standard Polyglot Persistence pattern.

Entry Dock

The Entry Dock (UB 碼頭) is a staging area where all ingested content lands before classification. It corresponds to the entries table in D1.

ingest_fragment() → Worker.js /brain/ingest → entries table (Dock)
                                                    ↓
                                        Cron job (every 5 minutes)
                                                    ↓
                                        Text Subgraph classifies
                                                    ↓
                                        /ub/promote → classified_entries (正區)

Why a staging area?

Reason	Detail
Decoupling	Ingestion is instant — no waiting for LLM classification
Batch efficiency	Cron grabs multiple entries at once, amortizing LLM calls
Error isolation	A classification failure doesn't block ingestion
Quality gate	Content can be reviewed before promotion

Dock fields

Field	Purpose
`entry_id`	Unique ID (e.g., `ub-abc123`)
`title`	Auto-generated or caller-provided
`content`	Raw content text
`source_ship`	Which ship ingested (SS1, SS2, SS3)
`agent_id`	Which agent ingested
`created_at`	Ingestion timestamp
`status`	`pending` / `processing` / `classified` / `failed`

Three-dimension classification

Every text entry in UB gets classified along three orthogonal dimensions:

Dimension	Question	Values	Method
Topic	What is it about?	Multi-value (1–N) from 5,389 CV paths	LLM + Progressive Disclosure
Type	What kind of content?	Single value from 36 types	LLM with definitions
Lifecycle	How long to keep?	4 values	Rule engine (no LLM)

Topic — Amazon Books as knowledge taxonomy

Key insight: Amazon started as a bookstore. Book categorization is knowledge categorization. The Amazon Browse Tree Guide has 4,883 knowledge-oriented paths covering everything from AI & Machine Learning to Ornithology to Investing > Technical Analysis.

Topic is multi-value — a digital entry isn't a physical book on one shelf. An internal spec can simultaneously belong to:

SP > Universal Brain > Taxonomy Design (project positioning)
Books > Textbooks > Library & Information Science (academic domain)

SP.* paths serve two roles:

Role	Purpose	Example
Role A — Domain extension	Fast-moving domains where Amazon Books lags	`Books > ... > AI & ML > SP: Agentic AI`
Role B — Internal namespace	Company-specific docs with no Amazon Books equivalent	`SP > Engineering Governance > Code Standards`

Type — 36 text knowledge types

Based on Dublin Core + Schema.org, filtered to knowledge-only types:

Text Document
├── Specification          ├── Intel Report
├── SOP                    ├── Market Sentiment Report
├── Decision Record        ├── Trade Journal
├── Correction Record      ├── Meeting Notes
├── Session Handoff        ├── Transcript
├── Status Report          ├── Tutorial
├── Gap Analysis           ├── Reference Guide
├── Research Summary       ├── FAQ
├── Article                ├── Glossary
├── Newsletter             ├── Checklist
├── Correspondence         ├── Annotation
├── Institutional Analysis ├── Troubleshooting Guide
├── Operations Log         ├── Architecture Document
├── Framework Document     ├── Release Note
├── Personal Note          ├── Legal / Contract
├── Observation Record     ├── Academic Paper
├── Review / Critique      ├── Book / eBook
└── Presentation           └── Reference Table

Each type has a definition and confused-pair hints so the LLM doesn't misclassify similar types (e.g., "Research Summary" reports findings while "Framework Document" describes a reusable method).

Lifecycle — rule engine, not LLM

Lifecycle is determined by a lookup table (data/lifecycle_rules.csv), not LLM:

Lifecycle	Meaning	Example triggers
`versioned`	Superseded by newer versions	Specs, SOPs, architecture docs
`persistent`	Permanently relevant	Decisions, corrections, principles
`ephemeral`	Short-term relevance	Status reports, session handoffs
`time-bounded`	Expires on a date	Market intel, event-tied analysis

Text Subgraph — Classification pipeline

鍋爺 (guō yé, "Boiler Grandpa") — text entries come in raw, leave properly classified.

The text handler is the only handler that requires LLM-based classification. It's built as a LangGraph agentic pipeline with 5 nodes:

Orange nodes = LLM calls (5 total per entry). Green nodes = rule-based (zero cost).

Node 0 — PK Awareness Check

The LLM self-assesses whether it understands the content before classifying:

HIGH — proceed directly
MEDIUM — search UB for context first, then proceed with enriched state
LOW — mark needs_review = true, best-effort classify, flag for human

This is why 鍋爺 is agentic, not a static script — it needs to assess its own knowledge and conditionally search UB.

Node 1 — PK Context Injection

Keyword matching (no LLM) against the PK Context Table (data/pk_context.csv). When an entry mentions a technology SuperPortia uses (e.g., "LangGraph", "Cloudflare D1"), inject the corresponding SP.* Topic path into the candidate list.

This ensures internal technologies always get their SP.* project path alongside the Amazon Books academic path.

Node 2 — Topic PD Drill-down

The full Topic CV has 5,353 paths — far too many for a single LLM prompt. Progressive Disclosure drills down in 3 steps:

L1 Selection — Choose from ~32 top-level categories
L2 Refinement — Choose from ~11 subcategories under the selected L1
L3 Refinement — Choose from ~15 options under the selected L2

Each step is one LLM call. If the LLM returns UNMATCHED at L1, the entry is flagged and the CV Growth Flow kicks in.

Node 3 — Type Selection

Select from 36 types. The prompt includes definitions and confused-pair hints from data/type_definitions.csv:

- Specification: A formal document defining requirements, interfaces, or standards
- Framework Document: Describes a reusable method, pattern, or conceptual framework
  ⚠️ confused_with: Research Summary (reports findings, not patterns)

Node 4 — Lifecycle Rule Engine

Pure lookup — no LLM. Rules are externalized in data/lifecycle_rules.csv:

`input_channel`	`type_pattern`	`lifecycle`
`*`	`Specification`	`versioned`
`*`	`Decision Record`	`persistent`
`*`	`Status Report`	`ephemeral`
`patrol`	`Intel Report`	`time-bounded`
`email_forward`	`*`	`ephemeral`

Cost model

Node	LLM calls	Input tokens	Output tokens
0: PK Awareness	1	~300	~10
1: PK Context	0 (regex)	—	—
2: Topic PD (L1→L2→L3)	3	~1,300	~90
3: Type Selection	1	~1,200	~30
4: Lifecycle	0 (rule engine)	—	—
Total	5	~2,800	~130

Per entry: ~$0.0002 (DeepSeek V3) Batch 327 entries: ~$0.07 LLM is hot-swappable — changing the model is a config change only. The CV anchors consistency, not the LLM.

Cross-handler feedback

When non-text handlers extract text from their content (OCR from images, transcripts from audio), that extracted text is sent back to the Text Subgraph via Entry Dock for UB classification:

The UB entry's references field links back to the original non-text artifact via UID, enabling cross-system traceability.

Promote to production

After the Text Subgraph classifies an entry in the Dock, it must be promoted to the production area (classified_entries table) to become searchable.

Entry Dock (entries)     →  /ub/promote  →  UB 正區 (classified_entries)
status: "classified"                         + Vectorize embedding

Stage	Table	Searchable?	Purpose
Dock	`entries`	No (staging only)	Raw landing, awaiting classification
正區	`classified_entries`	Yes (search_brain finds it)	Production knowledge base

The promote step:

Copies the classified entry from entries to classified_entries
Triggers Vectorize to generate the semantic embedding (Gemini embedding-001, 768d)
Updates the Dock entry status to promoted

Entries that fail classification remain in the Dock with status: failed for manual review.

Output format

After classification, the Text Subgraph writes to D1:

{
  "entry_id": "ub-a623b20d3159",
  "intake_id": "intake-2026-0304-001",
  "taxonomy_topics": [
    "Books > Computers & Technology > AI & Machine Learning",
    "SP > Universal Brain > Taxonomy Design"
  ],
  "taxonomy_type": "Text Document > Decision Record",
  "lifecycle": "persistent",
  "classification_meta": {
    "pk_confidence": "HIGH",
    "needs_review": false,
    "llm": "deepseek-v3",
    "pipeline_version": "1.0",
    "source_handler": "router",
    "caller_agent": "Mac CLI 小克",
    "caller_ship": "SS1"
  }
}

Every entry traces back to the original UBI intake via intake_id, enabling end-to-end audit.

Error handling

Condition	Behavior
PK confidence = LOW	`needs_review = true`, best-effort classify, flag for human
Topic L1 = UNMATCHED	Flag entry, trigger CV Growth Flow
Type = UNMATCHED	Flag for human review
DeepSeek API timeout	RetryPolicy: 3 attempts, exponential backoff
All LLM nodes fail	Entry stays unclassified, logged for manual processing

CV Growth Flow

When 鍋爺 encounters content that doesn't match any CV path:

CV changes are code changes (in build_cv.py), reviewed via git diff. No runtime CV mutation.

Five handlers at a glance

Handler	Chapter	Intelligence	Destination	Lifecycle
🍳 Text Subgraph	§2.4	AI agent (LLM)	UB (D1 + Vectorize)	Rule engine
⌨️ Code	§2.7	Rule-based	Git repos	Always `versioned`
🖼️ Image	§2.8	AI agent (Vision)	R2 / Image DB	By image type
🎙️ Audio/Video	§2.9	Whisper API + rules	R2 / AV DB	By content type
📊 Data	§2.10	Rule-based	Filesystem	By data category

For UBI Router architecture details (how content is routed to these handlers), see UBI Router Architecture.

Controlled Vocabulary sources

Source	Role	Paths	Priority
Amazon Books	Primary knowledge classification	4,883	1st
SP.* Role A	Fast-moving domain extensions	16	2nd
SP.* Role B	Internal project namespace	36	2nd
博客來	Chinese book categories	262	3rd
arXiv	Academic depth	151	4th
Google PT	Supplementary (knowledge only)	5	5th
Dublin Core / Schema.org	Type dimension (36 types)	36	—
Total		5,389

CV is stored as a local CSV file (data/cv.csv), generated by build_cv.py. No cloud dependency for classification lookups.

Implementation phases

Gate	Scope	Status
G1	Controlled Vocabulary rebuild (`build_cv.py` with Amazon Books separation)	I1
G2	Three-dimension D1 schema migration	I2
G3	Text Subgraph 5-node pipeline (text_subgraph)	I2
G4	Type definitions + confused-pair CSV	I3
G5	Lifecycle rules CSV (all 5 handlers)	I4–I5
G6	Code Handler (§2.7)	I6
G7	Image, AV, Data Handlers (§2.8–§2.10)	I7–I9

Key decisions

#	Decision	Source
4	LLM = DeepSeek V3, not Haiku (3.6x cheaper)	ub-06272134b21d
12	Topic is multi-value (1–N)	ub-60c0e80489d9
16	鍋爺 = agentic (LangGraph), needs UB search	spec v1.4
17	Type CV needs definitions for LLM accuracy	spec v1.4
25	Lifecycle = CSV lookup table, not LLM	spec v1.4
27	Type CV: 63 → 36 (non-knowledge types removed)	spec v1.4
33	UBI Router Pattern with 5 handlers	spec v1.4
34	No n8n/Flowise — LangGraph covers all needs	spec v1.4

Full Specification

The complete MTAAA v1.4 spec (3,121 lines, 38 decisions, 41 changelog items) is maintained at docs/MTAAA-Spec-v1.4-DRAFT.md and ingested to UB.

Page	Relationship
UB Governance	The rules that MTAAA enforces — language policy, tag format, quality checklist, freshness management
UBI Router Architecture	How content is routed to the 5 handlers before reaching Entry Dock
EGS Spec	MTAAA implements EGS Chapter 9 (UB Governance)
Ingest Fragment	The MCP tool that triggers ingestion into Entry Dock
Search Brain	Retrieves classified entries from UB 正區

Ingestion journey​

Caller

UBI Router

Entry Dock

Text Subgraph

3D Classification

Promote

Result

Caller sources​

Intake state schema​

Architecture overview​

Entry Dock​

Why a staging area?​

Dock fields​

Three-dimension classification​

Topic — Amazon Books as knowledge taxonomy​

Type — 36 text knowledge types​

Lifecycle — rule engine, not LLM​

Text Subgraph — Classification pipeline​

Node 0 — PK Awareness Check​

Node 1 — PK Context Injection​

Node 2 — Topic PD Drill-down​

Node 3 — Type Selection​

Node 4 — Lifecycle Rule Engine​

Cost model​

Cross-handler feedback​

Promote to production​

Output format​

Error handling​

CV Growth Flow​

Five handlers at a glance​

Controlled Vocabulary sources​

Implementation phases​

Key decisions​

Related pages​

Ingestion journey

Caller sources

Intake state schema

Architecture overview

Entry Dock

Why a staging area?

Dock fields

Three-dimension classification

Topic — Amazon Books as knowledge taxonomy

Type — 36 text knowledge types

Lifecycle — rule engine, not LLM

Text Subgraph — Classification pipeline

Node 0 — PK Awareness Check

Node 1 — PK Context Injection

Node 2 — Topic PD Drill-down

Node 3 — Type Selection

Node 4 — Lifecycle Rule Engine

Cost model

Cross-handler feedback

Promote to production

Output format

Error handling

CV Growth Flow

Five handlers at a glance

Controlled Vocabulary sources

Implementation phases

Key decisions

Related pages