Skip to main content

SRE Status & Health

Benchmarked against: Anthropic — Claude Code Analytics API / Status page Tools: sre_status, cloud_ub_health, factory_floor_status Scope: All infrastructure components across the fleet

SRE (Site Reliability Engineering) status monitoring gives the Captain and agents a real-time view of system health. Every infrastructure component — Cloud UB, D1 database, Vectorize, MCP servers, and agent processes — can be checked through dedicated tools.


Health check tools

cloud_ub_health

Quick service health check for Cloud UB.

cloud_ub_health()

Response:

{
"status": "healthy",
"version": "1.0.0",
"timestamp": "2026-03-05T10:15:00Z",
"services": {
"d1": "ok",
"vectorize": "ok",
"r2": "ok"
}
}

sre_status

Comprehensive infrastructure health check.

sre_status()

Response includes:

CheckWhat it verifies
D1 databaseConnection, read/write capability, table existence
VectorizeIndex accessible, query capability
Agent registryOnline agents, last seen timestamps
Table row countsentries, work_orders, agent_messages — volume metrics
Failed entriesCount of ingestion failures

factory_floor_status

Real-time operational overview (who's doing what).

factory_floor_status()

Response includes:

SectionData
Captain locationCurrent working location
Team statusEach agent: online/idle/working + current WO
Active WOsAll in_progress work orders
Awaiting reviewWOs in review status
Blocked itemsWOs in blocked status
Recent completionsRecently approved WOs

Infrastructure components

Cloud UB (Cloudflare)

ComponentHealth indicatorFailure impact
WorkerHTTP 200 on /healthAll UB operations unavailable
D1 databaseQuery response timeKnowledge search, WO system down
VectorizeIndex query capabilitySemantic search unavailable (keyword still works)
R2Bucket accessibleBackups unavailable

Local UBI (per-ship)

ComponentHealth indicatorFailure impact
MCP server processstdio responsiveAll local tools unavailable
Local SQLiteQuery responseLocal search unavailable
Local QdrantVector queryLocal semantic search unavailable
Python environmentImport successMTAAA pipeline unavailable

Agent infrastructure

ComponentHealth indicatorFailure impact
Claude Code CLIProcess runningAgent offline
Claude Desktop AppApp responsiveApp agents offline
MCP connectionsTool calls succeedSpecific tool sets unavailable
Chrome extensionWebSocket connectedBrowser automation unavailable

Monitoring patterns

Session start check

Every agent session begins with implicit health verification:

1. agent_heartbeat()  → Verifies Cloud UB connectivity
2. check_agent_mailbox() → Verifies messaging system
3. (Any tool call) → Verifies MCP server connectivity

If any of these fail, the agent should report to Captain before proceeding.

On-demand diagnostics

When something seems wrong:

# Quick check
cloud_ub_health()

# Detailed check
sre_status()

# Operational overview
factory_floor_status()

Incident detection

SymptomLikely causeDiagnostic
search_brain returns emptyD1 or Vectorize issuesre_status()
WO creation failsD1 write issuecloud_ub_health()
Agent not appearing on floorHeartbeat failedCheck MCP connection
Slow responsesNetwork or DB overloadCheck response times
Tool call timeoutsMCP server crashedRestart MCP server

SRE history logging

The Cloud UB Worker supports history logging via query parameter:

GET /sre/status?log=true

When log=true is passed, the SRE status result is also written to D1 as a timestamped record, enabling historical trend analysis.


Alerting (future work)

FeatureStatusDescription
Health endpoint monitoringPlannedPeriodic /health checks with alerting
D1 row count trendingPlannedAlert on unexpected growth or shrinkage
Agent offline detectionPlannedAlert when expected agent misses heartbeat window
Error rate trackingPlannedAlert on elevated error rates
Automated recoveryPlannedAuto-restart MCP servers on failure

These are inspection mirror gaps — capabilities we know we need.


PageRelationship
Fleet ManagementOperational overview
Data ResidencyInfrastructure architecture
Backup & RecoveryWhat to do when things go wrong
Cloud UB MCPCloud UB server details