Skip to main content

Controlled Vocabulary

Benchmarked against: Anthropic — (No direct equivalent — SuperPortia innovation) Source file: data/cv.csv Spec: MTAAA v1.4, Section 2.3

The Controlled Vocabulary (CV) is the dictionary that constrains how the MTAAA pipeline classifies content. LLM classifiers can only choose from CV terms — no freeform values allowed.


Why controlled vocabulary?

Without CV constraints, LLM classifiers would:

  • Invent inconsistent category names ("AI", "ai", "Artificial Intelligence", "ML/AI")
  • Create overlapping taxonomies
  • Make search and filtering unreliable

The CV ensures every entry uses the same terminology, enabling reliable search, filtering, and reporting across the fleet.


Three dimensions

The CV defines valid values for each MTAAA classification dimension:

Topic dimension

Hierarchical subject matter classification (2 levels):

Level 1Level 2 examples
AI AgentsArchitecture, Prompt Engineering, Tool Use, Multi-agent
InfrastructureCloud, Database, Deployment, Monitoring
TradingStrategy, Analysis, Risk Management, Automation
PhotographyWorkflow, Equipment, Post-processing
BirdingSpecies, Locations, Photography
OperationsFleet Management, Cost, SRE, Backup
DevelopmentFrontend, Backend, API, Testing
Knowledge ManagementClassification, Search, Ingestion

Type dimension

What kind of content this is:

TypeDescriptionExample
SpecificationFormal technical specEGS v1.2, MTAAA v1.4
Decision RecordCaptain-approved decisionEngine selection principle
ResearchInvestigation findingsLangGraph ecosystem analysis
IntelExternal intelligenceMarket research, tech updates
Session RecordAgent work sessionSession handoff notes
SOPStandard operating procedurePre-Flight Check steps
ConfigurationSystem configurationMCP server settings
Source CodeCode artifactPython scripts, configs
IncidentProblem report and resolutionRCA for service outage
GuideHow-to documentationFleet setup walkthrough

Lifecycle dimension

How current the content is expected to remain:

LifecycleDurationExample
versionedUpdated periodicallySpecs (EGS v1.1 → v1.2)
persistentLong-lived, rarely changesArchitecture decisions
ephemeralShort-lived, time-sensitiveIntel reports, session records
archivedHistorical, supersededDeprecated specs

Tag controlled vocabulary

Beyond the 3D classification, tags use a controlled format:

Format rules

RuleCorrectIncorrect
LowercasedecisionDecision
Hyphenatedcaptain-approvedcaptain_approved
No spacesai-agentsai agents
No CamelCasepre-flight-checkpreFlightCheck
Max 8 tags8 tags per entry15 tags

Mandatory tags by content type

Content typeRequired tags
Research / Intelresearch, [domain], [YYYY-MM]
Decision Recorddecision, [project], captain-approved
Incident / RCAincident, rca, P0-P3
Spec / Designspec, [project], [version]
Session Recordsession, [ship]
Correctioncorrection, [topic]

Common tags

TagUsage
intelExternal research findings
architectureSystem design decisions
costCost-related decisions
perishableTime-sensitive information
verified-YYYY-MMFreshness verification date
session-handoffCross-session todo items

CV management

Viewing the CV

The CV is stored as a CSV file at data/cv.csv:

dimension,level1,level2,description
topic,AI Agents,Architecture,Agent design and system architecture
topic,AI Agents,Prompt Engineering,System prompts and CLAUDE.md
type,Specification,,Formal technical specification document
type,Decision Record,,Captain-approved decision with rationale
lifecycle,versioned,,Updated periodically with version numbers
lifecycle,persistent,,Long-lived content that rarely changes

Adding new CV terms

  1. Identify the gap — a concept that doesn't fit existing CV terms
  2. Propose the new term with dimension, level, and description
  3. Captain approval required for new CV terms (HITL)
  4. Update data/cv.csv
  5. Pipeline automatically uses updated CV

CV anti-patterns

Anti-patternProblemSolution
Too granular4+ levels deepMax 2 levels
Too broad"Technology"Use specific Level 1
Redundant"AI" and "Artificial Intelligence"Pick one canonical form
Project-specific"SS1-mac-mini-config"Use tags for projects

How the pipeline uses CV

The LLM classifier receives the full CV as context and must select from it. If the LLM attempts to use a term not in the CV, the schema_matcher node falls back to the closest valid term.


PageRelationship
Ingest Fragment APIWhere CV is applied
UB Source TrackingTag governance