Cluster operations
Day-to-day cluster management — adding nodes, watching health, managing capacity.
The day-to-day of operating a Samoza cluster is mostly zshell. This page covers the common tasks beyond the zshell tour — adding capacity, draining a node, watching the right things.
Watching the cluster
Three commands answer 80% of “is everything fine?”:
zshell[0] >> status # all four services on this node
zshell[1] >> cluster # cluster-wide rollup
zshell[2] >> mesh peers # who's connected to whom
If any of those show unreachable, degraded, or fewer peers than you expected, dig further with the per-service commands.
Adding a node to the cluster
A new edge node joining is just the install procedure with a different ZMESH_ZHOST_ID:
export ZMESH_CONTROLPLANE="ws://cloud.example.com:9000/ws"
export ZMESH_REGION="edge-site-N"
export ZMESH_CLUSTER="edge-cluster"
export ZMESH_ZHOST_ID="edge-node-N"
./bin/startz -config configs/samoza_edge.yaml
Once it’s running, cluster zhosts will show it. From that moment zrms is allowed to place new spaces on it. Existing topologies aren’t moved automatically — they stay where they are unless you explicitly redeploy.
Removing a node
The clean shutdown sequence:
- Drain. Stop accepting new placements on the target node. (Today this is implicit — stop submitting topologies that target it; future versions will support
cluster drain <zhost>.) - Migrate workloads.
topo stop <id>and re-submit, or wait for the node to be marked failed (~30s after shutdown begins) and let recovery do it. - Stop services.
pkill zrun zmesh zbaseon the node. - Verify.
cluster zhostsshould no longer list it (after the failure timeout).
Heartbeats and timeouts
Defaults:
| Setting | Default | Where |
|---|---|---|
| Heartbeat interval | 10 s | zrms config: resource_reporting.interval |
| Failure timeout | 30 s | zrms config: failure_detection.timeout |
| Health check interval | 5 s | zrms config: failure_detection.check_interval |
Tighten the timeout if your network is reliable and you want faster recovery. Loosen it if intermittent latency is causing false positives.
Capacity headroom
Each cluster zhosts row reports resource usage if you ask for it (depending on build). The general rule:
- Aim for ≤70% steady-state utilisation.
- If a single node fails, can the rest absorb its workload? If not, you don’t have N+1.
- If two fail simultaneously, can recovery still place all critical topologies? If not, you don’t have N+2.
For production, N+1 minimum; N+2 for anything safety-related.
Failure recovery in practice
When zrms declares a node failed:
ClusterManager: zhost-delta lastHeartbeat older than 30s
│
▼
RecoveryManager.HandleZhostFailure(zhost-delta)
│
├── Mark zhost-delta unavailable
│
├── Find topologies with spaces on zhost-delta
│
└── Re-place those spaces on healthy zhosts
│
▼
For each affected topology:
state: running → degraded → running (if recovery succeeds)
state: running → degraded → failed (if no zhost can host it)
A degraded state is the system actively trying to recover. A failed state means it gave up — usually because no remaining zhost has the capacity, capabilities, or hardware to host the orphaned spaces.
Observability
Today (v1):
zshellis the operator’s interface.- Each service (
zmesh,zrms,zrun,zbase) has an HTTP API for health, status, and metrics — see the configuration reference (coming soon). - Logs go to stderr by default; redirect or pipe into your log aggregator of choice.
Future:
- A web UI surfacing cluster status, topology placements, and node health.
- Native Prometheus metrics endpoints.
- Distributed tracing across
zmesh → zrun → zmesh.
See also
- Topology lifecycle — deploy, stop, recover, redeploy.
- Troubleshooting — when something’s wrong.
- Architecture — what each service is doing under the hood.