Skip to main content

File Ingestion (MTAAA)

Benchmarked against: Anthropic โ€” PDF support / Batch processing Spec: MTAAA v1.4 (38 Captain-approved decisions, 3,121 lines) Status: Implementation Phase (I1โ€“I9 acceptance criteria defined)

SuperPortia uses MTAAA (Multi-dimensional Taxonomy & Auto-classification Architecture) to ingest, route, and classify all content โ€” text, code, images, audio/video, and data โ€” through a unified intake point.


Ingestion journeyโ€‹

Follow the lifecycle or jump to what you need.


Caller sourcesโ€‹

Every ingestion starts with an agent calling ingest_fragment(). Content can arrive from any of these channels:

ChannelSourceExample
manualAgent directly ingests via CLI/Appๅฐๅ…‹ ingests a spec document
apiExternal system calls the ingest APIWebhook pushes a PDF
patrolIntelligence Officer patrol resultsDaily tech news sweep
email_forwardEmail forwarded to intake addressUser forwards an article
cronScheduled batch ingestionNightly backup summary

Intake state schemaโ€‹

Every item passing through UBI carries this state:

class IntakeState(TypedDict):
intake_id: str # "intake-2026-0304-001"
file_name: str # "report.pdf"
file_extension: str # ".pdf"
detected_mime: str # "application/pdf"
raw_content: bytes # File content
input_channel: str # "manual" | "api" | "patrol" | ...
caller_agent: str # "Mac CLI ๅฐๅ…‹"
caller_ship: str # "SS1"
parent_intake_id: str # For multipart splits

Architecture overviewโ€‹

All content enters through a single intake point (UBI โ€” Universal Brain Intake), gets routed by content type, and is processed by one of five specialized handlers.

Key design principle: UB is a Knowledge Base, not a universal database. It stores knowledge (decisions, specs, insights, analysis) in text form. Code goes to Git, images to R2, data to filesystem โ€” each optimized for its use case. This follows the industry-standard Polyglot Persistence pattern.


Entry Dockโ€‹

The Entry Dock (UB ็ขผ้ ญ) is a staging area where all ingested content lands before classification. It corresponds to the entries table in D1.

ingest_fragment() โ†’ Worker.js /brain/ingest โ†’ entries table (Dock)
โ†“
Cron job (every 5 minutes)
โ†“
Text Subgraph classifies
โ†“
/ub/promote โ†’ classified_entries (ๆญฃๅ€)

Why a staging area?โ€‹

ReasonDetail
DecouplingIngestion is instant โ€” no waiting for LLM classification
Batch efficiencyCron grabs multiple entries at once, amortizing LLM calls
Error isolationA classification failure doesn't block ingestion
Quality gateContent can be reviewed before promotion

Dock fieldsโ€‹

FieldPurpose
entry_idUnique ID (e.g., ub-abc123)
titleAuto-generated or caller-provided
contentRaw content text
source_shipWhich ship ingested (SS1, SS2, SS3)
agent_idWhich agent ingested
created_atIngestion timestamp
statuspending / processing / classified / failed

Three-dimension classificationโ€‹

Every text entry in UB gets classified along three orthogonal dimensions:

DimensionQuestionValuesMethod
TopicWhat is it about?Multi-value (1โ€“N) from 5,389 CV pathsLLM + Progressive Disclosure
TypeWhat kind of content?Single value from 36 typesLLM with definitions
LifecycleHow long to keep?4 valuesRule engine (no LLM)

Topic โ€” Amazon Books as knowledge taxonomyโ€‹

Key insight: Amazon started as a bookstore. Book categorization is knowledge categorization. The Amazon Browse Tree Guide has 4,883 knowledge-oriented paths covering everything from AI & Machine Learning to Ornithology to Investing > Technical Analysis.

Topic is multi-value โ€” a digital entry isn't a physical book on one shelf. An internal spec can simultaneously belong to:

  • SP > Universal Brain > Taxonomy Design (project positioning)
  • Books > Textbooks > Library & Information Science (academic domain)

SP.* paths serve two roles:

RolePurposeExample
Role A โ€” Domain extensionFast-moving domains where Amazon Books lagsBooks > ... > AI & ML > SP: Agentic AI
Role B โ€” Internal namespaceCompany-specific docs with no Amazon Books equivalentSP > Engineering Governance > Code Standards

Type โ€” 36 text knowledge typesโ€‹

Based on Dublin Core + Schema.org, filtered to knowledge-only types:

Text Document
โ”œโ”€โ”€ Specification โ”œโ”€โ”€ Intel Report
โ”œโ”€โ”€ SOP โ”œโ”€โ”€ Market Sentiment Report
โ”œโ”€โ”€ Decision Record โ”œโ”€โ”€ Trade Journal
โ”œโ”€โ”€ Correction Record โ”œโ”€โ”€ Meeting Notes
โ”œโ”€โ”€ Session Handoff โ”œโ”€โ”€ Transcript
โ”œโ”€โ”€ Status Report โ”œโ”€โ”€ Tutorial
โ”œโ”€โ”€ Gap Analysis โ”œโ”€โ”€ Reference Guide
โ”œโ”€โ”€ Research Summary โ”œโ”€โ”€ FAQ
โ”œโ”€โ”€ Article โ”œโ”€โ”€ Glossary
โ”œโ”€โ”€ Newsletter โ”œโ”€โ”€ Checklist
โ”œโ”€โ”€ Correspondence โ”œโ”€โ”€ Annotation
โ”œโ”€โ”€ Institutional Analysis โ”œโ”€โ”€ Troubleshooting Guide
โ”œโ”€โ”€ Operations Log โ”œโ”€โ”€ Architecture Document
โ”œโ”€โ”€ Framework Document โ”œโ”€โ”€ Release Note
โ”œโ”€โ”€ Personal Note โ”œโ”€โ”€ Legal / Contract
โ”œโ”€โ”€ Observation Record โ”œโ”€โ”€ Academic Paper
โ”œโ”€โ”€ Review / Critique โ”œโ”€โ”€ Book / eBook
โ””โ”€โ”€ Presentation โ””โ”€โ”€ Reference Table

Each type has a definition and confused-pair hints so the LLM doesn't misclassify similar types (e.g., "Research Summary" reports findings while "Framework Document" describes a reusable method).

Lifecycle โ€” rule engine, not LLMโ€‹

Lifecycle is determined by a lookup table (data/lifecycle_rules.csv), not LLM:

LifecycleMeaningExample triggers
versionedSuperseded by newer versionsSpecs, SOPs, architecture docs
persistentPermanently relevantDecisions, corrections, principles
ephemeralShort-term relevanceStatus reports, session handoffs
time-boundedExpires on a dateMarket intel, event-tied analysis

Text Subgraph โ€” Classification pipelineโ€‹

้‹็ˆบ (guล yรฉ, "Boiler Grandpa") โ€” text entries come in raw, leave properly classified.

The text handler is the only handler that requires LLM-based classification. It's built as a LangGraph agentic pipeline with 5 nodes:

Orange nodes = LLM calls (5 total per entry). Green nodes = rule-based (zero cost).

Node 0 โ€” PK Awareness Checkโ€‹

The LLM self-assesses whether it understands the content before classifying:

  • HIGH โ€” proceed directly
  • MEDIUM โ€” search UB for context first, then proceed with enriched state
  • LOW โ€” mark needs_review = true, best-effort classify, flag for human

This is why ้‹็ˆบ is agentic, not a static script โ€” it needs to assess its own knowledge and conditionally search UB.

Node 1 โ€” PK Context Injectionโ€‹

Keyword matching (no LLM) against the PK Context Table (data/pk_context.csv). When an entry mentions a technology SuperPortia uses (e.g., "LangGraph", "Cloudflare D1"), inject the corresponding SP.* Topic path into the candidate list.

This ensures internal technologies always get their SP.* project path alongside the Amazon Books academic path.

Node 2 โ€” Topic PD Drill-downโ€‹

The full Topic CV has 5,353 paths โ€” far too many for a single LLM prompt. Progressive Disclosure drills down in 3 steps:

  1. L1 Selection โ€” Choose from ~32 top-level categories
  2. L2 Refinement โ€” Choose from ~11 subcategories under the selected L1
  3. L3 Refinement โ€” Choose from ~15 options under the selected L2

Each step is one LLM call. If the LLM returns UNMATCHED at L1, the entry is flagged and the CV Growth Flow kicks in.

Node 3 โ€” Type Selectionโ€‹

Select from 36 types. The prompt includes definitions and confused-pair hints from data/type_definitions.csv:

- Specification: A formal document defining requirements, interfaces, or standards
- Framework Document: Describes a reusable method, pattern, or conceptual framework
โš ๏ธ confused_with: Research Summary (reports findings, not patterns)

Node 4 โ€” Lifecycle Rule Engineโ€‹

Pure lookup โ€” no LLM. Rules are externalized in data/lifecycle_rules.csv:

input_channeltype_patternlifecycle
*Specificationversioned
*Decision Recordpersistent
*Status Reportephemeral
patrolIntel Reporttime-bounded
email_forward*ephemeral

Cost modelโ€‹

NodeLLM callsInput tokensOutput tokens
0: PK Awareness1~300~10
1: PK Context0 (regex)โ€”โ€”
2: Topic PD (L1โ†’L2โ†’L3)3~1,300~90
3: Type Selection1~1,200~30
4: Lifecycle0 (rule engine)โ€”โ€”
Total5~2,800~130

Per entry: ~$0.0002 (DeepSeek V3) Batch 327 entries: ~$0.07 LLM is hot-swappable โ€” changing the model is a config change only. The CV anchors consistency, not the LLM.


Cross-handler feedbackโ€‹

When non-text handlers extract text from their content (OCR from images, transcripts from audio), that extracted text is sent back to the Text Subgraph via Entry Dock for UB classification:

The UB entry's references field links back to the original non-text artifact via UID, enabling cross-system traceability.


Promote to productionโ€‹

After the Text Subgraph classifies an entry in the Dock, it must be promoted to the production area (classified_entries table) to become searchable.

Entry Dock (entries)     โ†’  /ub/promote  โ†’  UB ๆญฃๅ€ (classified_entries)
status: "classified" + Vectorize embedding
StageTableSearchable?Purpose
DockentriesNo (staging only)Raw landing, awaiting classification
ๆญฃๅ€classified_entriesYes (search_brain finds it)Production knowledge base

The promote step:

  1. Copies the classified entry from entries to classified_entries
  2. Triggers Vectorize to generate the semantic embedding (Gemini embedding-001, 768d)
  3. Updates the Dock entry status to promoted

Entries that fail classification remain in the Dock with status: failed for manual review.


Output formatโ€‹

After classification, the Text Subgraph writes to D1:

{
"entry_id": "ub-a623b20d3159",
"intake_id": "intake-2026-0304-001",
"taxonomy_topics": [
"Books > Computers & Technology > AI & Machine Learning",
"SP > Universal Brain > Taxonomy Design"
],
"taxonomy_type": "Text Document > Decision Record",
"lifecycle": "persistent",
"classification_meta": {
"pk_confidence": "HIGH",
"needs_review": false,
"llm": "deepseek-v3",
"pipeline_version": "1.0",
"source_handler": "router",
"caller_agent": "Mac CLI ๅฐๅ…‹",
"caller_ship": "SS1"
}
}

Every entry traces back to the original UBI intake via intake_id, enabling end-to-end audit.


Error handlingโ€‹

ConditionBehavior
PK confidence = LOWneeds_review = true, best-effort classify, flag for human
Topic L1 = UNMATCHEDFlag entry, trigger CV Growth Flow
Type = UNMATCHEDFlag for human review
DeepSeek API timeoutRetryPolicy: 3 attempts, exponential backoff
All LLM nodes failEntry stays unclassified, logged for manual processing

CV Growth Flowโ€‹

When ้‹็ˆบ encounters content that doesn't match any CV path:

CV changes are code changes (in build_cv.py), reviewed via git diff. No runtime CV mutation.


Five handlers at a glanceโ€‹

HandlerChapterIntelligenceDestinationLifecycle
๐Ÿณ Text Subgraphยง2.4AI agent (LLM)UB (D1 + Vectorize)Rule engine
โŒจ๏ธ Codeยง2.7Rule-basedGit reposAlways versioned
๐Ÿ–ผ๏ธ Imageยง2.8AI agent (Vision)R2 / Image DBBy image type
๐ŸŽ™๏ธ Audio/Videoยง2.9Whisper API + rulesR2 / AV DBBy content type
๐Ÿ“Š Dataยง2.10Rule-basedFilesystemBy data category

For UBI Router architecture details (how content is routed to these handlers), see UBI Router Architecture.


Controlled Vocabulary sourcesโ€‹

SourceRolePathsPriority
Amazon BooksPrimary knowledge classification4,8831st
SP.* Role AFast-moving domain extensions162nd
SP.* Role BInternal project namespace362nd
ๅšๅฎขไพ†Chinese book categories2623rd
arXivAcademic depth1514th
Google PTSupplementary (knowledge only)55th
Dublin Core / Schema.orgType dimension (36 types)36โ€”
Total5,389

CV is stored as a local CSV file (data/cv.csv), generated by build_cv.py. No cloud dependency for classification lookups.


Implementation phasesโ€‹

GateScopeStatus
G1Controlled Vocabulary rebuild (build_cv.py with Amazon Books separation)I1
G2Three-dimension D1 schema migrationI2
G3Text Subgraph 5-node pipeline (text_subgraph)I2
G4Type definitions + confused-pair CSVI3
G5Lifecycle rules CSV (all 5 handlers)I4โ€“I5
G6Code Handler (ยง2.7)I6
G7Image, AV, Data Handlers (ยง2.8โ€“ยง2.10)I7โ€“I9

Key decisionsโ€‹

#DecisionSource
4LLM = DeepSeek V3, not Haiku (3.6x cheaper)ub-06272134b21d
12Topic is multi-value (1โ€“N)ub-60c0e80489d9
16้‹็ˆบ = agentic (LangGraph), needs UB searchspec v1.4
17Type CV needs definitions for LLM accuracyspec v1.4
25Lifecycle = CSV lookup table, not LLMspec v1.4
27Type CV: 63 โ†’ 36 (non-knowledge types removed)spec v1.4
33UBI Router Pattern with 5 handlersspec v1.4
34No n8n/Flowise โ€” LangGraph covers all needsspec v1.4
Full Specification

The complete MTAAA v1.4 spec (3,121 lines, 38 decisions, 41 changelog items) is maintained at docs/MTAAA-Spec-v1.4-DRAFT.md and ingested to UB.


PageRelationship
UB GovernanceThe rules that MTAAA enforces โ€” language policy, tag format, quality checklist, freshness management
UBI Router ArchitectureHow content is routed to the 5 handlers before reaching Entry Dock
EGS SpecMTAAA implements EGS Chapter 9 (UB Governance)
Ingest FragmentThe MCP tool that triggers ingestion into Entry Dock
Search BrainRetrieves classified entries from UB ๆญฃๅ€