Controlled Vocabulary
Benchmarked against: Anthropic — (No direct equivalent — SuperPortia innovation) Source file:
data/cv.csvSpec: MTAAA v1.4, Section 2.3
The Controlled Vocabulary (CV) is the dictionary that constrains how the MTAAA pipeline classifies content. LLM classifiers can only choose from CV terms — no freeform values allowed.
Why controlled vocabulary?
Without CV constraints, LLM classifiers would:
- Invent inconsistent category names ("AI", "ai", "Artificial Intelligence", "ML/AI")
- Create overlapping taxonomies
- Make search and filtering unreliable
The CV ensures every entry uses the same terminology, enabling reliable search, filtering, and reporting across the fleet.
Three dimensions
The CV defines valid values for each MTAAA classification dimension:
Topic dimension
Hierarchical subject matter classification (2 levels):
| Level 1 | Level 2 examples |
|---|---|
| AI Agents | Architecture, Prompt Engineering, Tool Use, Multi-agent |
| Infrastructure | Cloud, Database, Deployment, Monitoring |
| Trading | Strategy, Analysis, Risk Management, Automation |
| Photography | Workflow, Equipment, Post-processing |
| Birding | Species, Locations, Photography |
| Operations | Fleet Management, Cost, SRE, Backup |
| Development | Frontend, Backend, API, Testing |
| Knowledge Management | Classification, Search, Ingestion |
Type dimension
What kind of content this is:
| Type | Description | Example |
|---|---|---|
| Specification | Formal technical spec | EGS v1.2, MTAAA v1.4 |
| Decision Record | Captain-approved decision | Engine selection principle |
| Research | Investigation findings | LangGraph ecosystem analysis |
| Intel | External intelligence | Market research, tech updates |
| Session Record | Agent work session | Session handoff notes |
| SOP | Standard operating procedure | Pre-Flight Check steps |
| Configuration | System configuration | MCP server settings |
| Source Code | Code artifact | Python scripts, configs |
| Incident | Problem report and resolution | RCA for service outage |
| Guide | How-to documentation | Fleet setup walkthrough |
Lifecycle dimension
How current the content is expected to remain:
| Lifecycle | Duration | Example |
|---|---|---|
| versioned | Updated periodically | Specs (EGS v1.1 → v1.2) |
| persistent | Long-lived, rarely changes | Architecture decisions |
| ephemeral | Short-lived, time-sensitive | Intel reports, session records |
| archived | Historical, superseded | Deprecated specs |
Tag controlled vocabulary
Beyond the 3D classification, tags use a controlled format:
Format rules
| Rule | Correct | Incorrect |
|---|---|---|
| Lowercase | decision | Decision |
| Hyphenated | captain-approved | captain_approved |
| No spaces | ai-agents | ai agents |
| No CamelCase | pre-flight-check | preFlightCheck |
| Max 8 tags | 8 tags per entry | 15 tags |
Mandatory tags by content type
| Content type | Required tags |
|---|---|
| Research / Intel | research, [domain], [YYYY-MM] |
| Decision Record | decision, [project], captain-approved |
| Incident / RCA | incident, rca, P0-P3 |
| Spec / Design | spec, [project], [version] |
| Session Record | session, [ship] |
| Correction | correction, [topic] |
Common tags
| Tag | Usage |
|---|---|
intel | External research findings |
architecture | System design decisions |
cost | Cost-related decisions |
perishable | Time-sensitive information |
verified-YYYY-MM | Freshness verification date |
session-handoff | Cross-session todo items |
CV management
Viewing the CV
The CV is stored as a CSV file at data/cv.csv:
dimension,level1,level2,description
topic,AI Agents,Architecture,Agent design and system architecture
topic,AI Agents,Prompt Engineering,System prompts and CLAUDE.md
type,Specification,,Formal technical specification document
type,Decision Record,,Captain-approved decision with rationale
lifecycle,versioned,,Updated periodically with version numbers
lifecycle,persistent,,Long-lived content that rarely changes
Adding new CV terms
- Identify the gap — a concept that doesn't fit existing CV terms
- Propose the new term with dimension, level, and description
- Captain approval required for new CV terms (HITL)
- Update
data/cv.csv - Pipeline automatically uses updated CV
CV anti-patterns
| Anti-pattern | Problem | Solution |
|---|---|---|
| Too granular | 4+ levels deep | Max 2 levels |
| Too broad | "Technology" | Use specific Level 1 |
| Redundant | "AI" and "Artificial Intelligence" | Pick one canonical form |
| Project-specific | "SS1-mac-mini-config" | Use tags for projects |
How the pipeline uses CV
The LLM classifier receives the full CV as context and must select from it. If the LLM attempts to use a term not in the CV, the schema_matcher node falls back to the closest valid term.
Related pages
| Page | Relationship |
|---|---|
| Ingest Fragment API | Where CV is applied |
| UB Source Tracking | Tag governance |