OPERATE

Cluster operations

Day-to-day cluster management — adding nodes, watching health, managing capacity.

The day-to-day of operating a Samoza cluster is mostly zshell. This page covers the common tasks beyond the zshell tour — adding capacity, draining a node, watching the right things.

Watching the cluster

Three commands answer 80% of “is everything fine?”:

zshell[0] >> status                  # all four services on this node
zshell[1] >> cluster                 # cluster-wide rollup
zshell[2] >> mesh peers              # who's connected to whom

If any of those show unreachable, degraded, or fewer peers than you expected, dig further with the per-service commands.

Adding a node to the cluster

A new edge node joining is just the install procedure with a different ZMESH_ZHOST_ID:

export ZMESH_CONTROLPLANE="ws://cloud.example.com:9000/ws"
export ZMESH_REGION="edge-site-N"
export ZMESH_CLUSTER="edge-cluster"
export ZMESH_ZHOST_ID="edge-node-N"

./bin/startz -config configs/samoza_edge.yaml

Once it’s running, cluster zhosts will show it. From that moment zrms is allowed to place new spaces on it. Existing topologies aren’t moved automatically — they stay where they are unless you explicitly redeploy.

Removing a node

The clean shutdown sequence:

Drain. Stop accepting new placements on the target node. (Today this is implicit — stop submitting topologies that target it; future versions will support cluster drain <zhost>.)
Migrate workloads. topo stop <id> and re-submit, or wait for the node to be marked failed (~30s after shutdown begins) and let recovery do it.
Stop services. pkill zrun zmesh zbase on the node.
Verify. cluster zhosts should no longer list it (after the failure timeout).

Heartbeats and timeouts

Defaults:

Setting	Default	Where
Heartbeat interval	10 s	`zrms` config: `resource_reporting.interval`
Failure timeout	30 s	`zrms` config: `failure_detection.timeout`
Health check interval	5 s	`zrms` config: `failure_detection.check_interval`

Tighten the timeout if your network is reliable and you want faster recovery. Loosen it if intermittent latency is causing false positives.

Capacity headroom

Each cluster zhosts row reports resource usage if you ask for it (depending on build). The general rule:

Aim for ≤70% steady-state utilisation.
If a single node fails, can the rest absorb its workload? If not, you don’t have N+1.
If two fail simultaneously, can recovery still place all critical topologies? If not, you don’t have N+2.

For production, N+1 minimum; N+2 for anything safety-related.

Failure recovery in practice

When zrms declares a node failed:

ClusterManager: zhost-delta lastHeartbeat older than 30s
        │
        ▼
RecoveryManager.HandleZhostFailure(zhost-delta)
        │
        ├── Mark zhost-delta unavailable
        │
        ├── Find topologies with spaces on zhost-delta
        │
        └── Re-place those spaces on healthy zhosts
                │
                ▼
        For each affected topology:
                state: running → degraded → running (if recovery succeeds)
                state: running → degraded → failed  (if no zhost can host it)

A degraded state is the system actively trying to recover. A failed state means it gave up — usually because no remaining zhost has the capacity, capabilities, or hardware to host the orphaned spaces.

Observability

Today (v1):

zshell is the operator’s interface.
Each service (zmesh, zrms, zrun, zbase) has an HTTP API for health, status, and metrics — see the configuration reference (coming soon).
Logs go to stderr by default; redirect or pipe into your log aggregator of choice.

Future:

A web UI surfacing cluster status, topology placements, and node health.
Native Prometheus metrics endpoints.
Distributed tracing across zmesh → zrun → zmesh.