SRE Status & Health
Benchmarked against: Anthropic — Claude Code Analytics API / Status page Tools:
sre_status,cloud_ub_health,factory_floor_statusScope: All infrastructure components across the fleet
SRE (Site Reliability Engineering) status monitoring gives the Captain and agents a real-time view of system health. Every infrastructure component — Cloud UB, D1 database, Vectorize, MCP servers, and agent processes — can be checked through dedicated tools.
Health check tools
cloud_ub_health
Quick service health check for Cloud UB.
cloud_ub_health()
Response:
{
"status": "healthy",
"version": "1.0.0",
"timestamp": "2026-03-05T10:15:00Z",
"services": {
"d1": "ok",
"vectorize": "ok",
"r2": "ok"
}
}
sre_status
Comprehensive infrastructure health check.
sre_status()
Response includes:
| Check | What it verifies |
|---|---|
| D1 database | Connection, read/write capability, table existence |
| Vectorize | Index accessible, query capability |
| Agent registry | Online agents, last seen timestamps |
| Table row counts | entries, work_orders, agent_messages — volume metrics |
| Failed entries | Count of ingestion failures |
factory_floor_status
Real-time operational overview (who's doing what).
factory_floor_status()
Response includes:
| Section | Data |
|---|---|
| Captain location | Current working location |
| Team status | Each agent: online/idle/working + current WO |
| Active WOs | All in_progress work orders |
| Awaiting review | WOs in review status |
| Blocked items | WOs in blocked status |
| Recent completions | Recently approved WOs |
Infrastructure components
Cloud UB (Cloudflare)
| Component | Health indicator | Failure impact |
|---|---|---|
| Worker | HTTP 200 on /health | All UB operations unavailable |
| D1 database | Query response time | Knowledge search, WO system down |
| Vectorize | Index query capability | Semantic search unavailable (keyword still works) |
| R2 | Bucket accessible | Backups unavailable |
Local UBI (per-ship)
| Component | Health indicator | Failure impact |
|---|---|---|
| MCP server process | stdio responsive | All local tools unavailable |
| Local SQLite | Query response | Local search unavailable |
| Local Qdrant | Vector query | Local semantic search unavailable |
| Python environment | Import success | MTAAA pipeline unavailable |
Agent infrastructure
| Component | Health indicator | Failure impact |
|---|---|---|
| Claude Code CLI | Process running | Agent offline |
| Claude Desktop App | App responsive | App agents offline |
| MCP connections | Tool calls succeed | Specific tool sets unavailable |
| Chrome extension | WebSocket connected | Browser automation unavailable |
Monitoring patterns
Session start check
Every agent session begins with implicit health verification:
1. agent_heartbeat() → Verifies Cloud UB connectivity
2. check_agent_mailbox() → Verifies messaging system
3. (Any tool call) → Verifies MCP server connectivity
If any of these fail, the agent should report to Captain before proceeding.
On-demand diagnostics
When something seems wrong:
# Quick check
cloud_ub_health()
# Detailed check
sre_status()
# Operational overview
factory_floor_status()
Incident detection
| Symptom | Likely cause | Diagnostic |
|---|---|---|
search_brain returns empty | D1 or Vectorize issue | sre_status() |
| WO creation fails | D1 write issue | cloud_ub_health() |
| Agent not appearing on floor | Heartbeat failed | Check MCP connection |
| Slow responses | Network or DB overload | Check response times |
| Tool call timeouts | MCP server crashed | Restart MCP server |
SRE history logging
The Cloud UB Worker supports history logging via query parameter:
GET /sre/status?log=true
When log=true is passed, the SRE status result is also written to D1 as a timestamped record, enabling historical trend analysis.
Alerting (future work)
| Feature | Status | Description |
|---|---|---|
| Health endpoint monitoring | Planned | Periodic /health checks with alerting |
| D1 row count trending | Planned | Alert on unexpected growth or shrinkage |
| Agent offline detection | Planned | Alert when expected agent misses heartbeat window |
| Error rate tracking | Planned | Alert on elevated error rates |
| Automated recovery | Planned | Auto-restart MCP servers on failure |
These are inspection mirror gaps — capabilities we know we need.
Related pages
| Page | Relationship |
|---|---|
| Fleet Management | Operational overview |
| Data Residency | Infrastructure architecture |
| Backup & Recovery | What to do when things go wrong |
| Cloud UB MCP | Cloud UB server details |