SRE Status & Health

Benchmarked against: Anthropic — Claude Code Analytics API / Status page Tools: sre_status, cloud_ub_health, factory_floor_status Scope: All infrastructure components across the fleet

SRE (Site Reliability Engineering) status monitoring gives the Captain and agents a real-time view of system health. Every infrastructure component — Cloud UB, D1 database, Vectorize, MCP servers, and agent processes — can be checked through dedicated tools.

Health check tools

`cloud_ub_health`

Quick service health check for Cloud UB.

cloud_ub_health()

Response:

{
  "status": "healthy",
  "version": "1.0.0",
  "timestamp": "2026-03-05T10:15:00Z",
  "services": {
    "d1": "ok",
    "vectorize": "ok",
    "r2": "ok"
  }
}

`sre_status`

Comprehensive infrastructure health check.

sre_status()

Response includes:

Check	What it verifies
D1 database	Connection, read/write capability, table existence
Vectorize	Index accessible, query capability
Agent registry	Online agents, last seen timestamps
Table row counts	entries, work_orders, agent_messages — volume metrics
Failed entries	Count of ingestion failures

`factory_floor_status`

Real-time operational overview (who's doing what).

factory_floor_status()

Response includes:

Section	Data
Captain location	Current working location
Team status	Each agent: online/idle/working + current WO
Active WOs	All in_progress work orders
Awaiting review	WOs in review status
Blocked items	WOs in blocked status
Recent completions	Recently approved WOs

Infrastructure components

Cloud UB (Cloudflare)

Component	Health indicator	Failure impact
Worker	HTTP 200 on `/health`	All UB operations unavailable
D1 database	Query response time	Knowledge search, WO system down
Vectorize	Index query capability	Semantic search unavailable (keyword still works)
R2	Bucket accessible	Backups unavailable

Local UBI (per-ship)

Component	Health indicator	Failure impact
MCP server process	stdio responsive	All local tools unavailable
Local SQLite	Query response	Local search unavailable
Local Qdrant	Vector query	Local semantic search unavailable
Python environment	Import success	MTAAA pipeline unavailable

Agent infrastructure

Component	Health indicator	Failure impact
Claude Code CLI	Process running	Agent offline
Claude Desktop App	App responsive	App agents offline
MCP connections	Tool calls succeed	Specific tool sets unavailable
Chrome extension	WebSocket connected	Browser automation unavailable

Monitoring patterns

Session start check

Every agent session begins with implicit health verification:

agent_heartbeat()  → Verifies Cloud UB connectivity
check_agent_mailbox()  → Verifies messaging system
(Any tool call)  → Verifies MCP server connectivity

If any of these fail, the agent should report to Captain before proceeding.

On-demand diagnostics

When something seems wrong:

# Quick check
cloud_ub_health()

# Detailed check
sre_status()

# Operational overview
factory_floor_status()

Incident detection

Symptom	Likely cause	Diagnostic
`search_brain` returns empty	D1 or Vectorize issue	`sre_status()`
WO creation fails	D1 write issue	`cloud_ub_health()`
Agent not appearing on floor	Heartbeat failed	Check MCP connection
Slow responses	Network or DB overload	Check response times
Tool call timeouts	MCP server crashed	Restart MCP server

SRE history logging

The Cloud UB Worker supports history logging via query parameter:

GET /sre/status?log=true

When log=true is passed, the SRE status result is also written to D1 as a timestamped record, enabling historical trend analysis.

Alerting (future work)

Feature	Status	Description
Health endpoint monitoring	Planned	Periodic `/health` checks with alerting
D1 row count trending	Planned	Alert on unexpected growth or shrinkage
Agent offline detection	Planned	Alert when expected agent misses heartbeat window
Error rate tracking	Planned	Alert on elevated error rates
Automated recovery	Planned	Auto-restart MCP servers on failure

These are inspection mirror gaps — capabilities we know we need.

Page	Relationship
Fleet Management	Operational overview
Data Residency	Infrastructure architecture
Backup & Recovery	What to do when things go wrong
Cloud UB MCP	Cloud UB server details

Health check tools​

cloud_ub_health​

sre_status​

factory_floor_status​

Infrastructure components​

Cloud UB (Cloudflare)​

Local UBI (per-ship)​

Agent infrastructure​

Monitoring patterns​

Session start check​

On-demand diagnostics​

Incident detection​

SRE history logging​

Alerting (future work)​

Related pages​