Troubleshooting
Common failure modes — what they look like, why they happen, how to fix them.
A handful of patterns covers most of what goes wrong. This page is organised by symptom.
”zmesh: unreachable” in status
Symptom: zshell says it can’t reach the local zmesh.
Check, in order:
- Is
zmeshrunning on this node?ps aux | grep zmesh. - Is
SAMOZA_ZMESH_URLcorrect? (Default:http://127.0.0.1:9200.) - Is anything listening on the port?
curl http://127.0.0.1:9200/v1/status. - Firewall blocking it? On the same host this should never happen, but worth checking on virtualised setups.
”zrms: unreachable” in status
Same playbook as zmesh, against the resource manager:
ps aux | grep zrmsSAMOZA_ZRMS_URLcorrect? (Default:http://127.0.0.1:7331.)curl http://127.0.0.1:7331/v1/health.
Edge node can’t connect to the control plane
Symptom: edge node starts, zmesh logs say it can’t reach the control plane.
# 1. Network reachability
curl http://cloud.example.com:9000/healthz
# 2. WebSocket URL format — must be ws://, NOT http://
echo $ZMESH_CONTROLPLANE
# Correct: ws://cloud.example.com:9000/ws
# 3. Firewall on the cloud node — port 9000 must be open
Nodes don’t appear in cluster zhosts
A node started but never showed up.
# Check zbase registration on the cloud node
curl "http://127.0.0.1:7071/v1/instances?kind=zmesh" | jq .
# Each node must heartbeat to stay registered. TTL default is 30s.
If zbase shows the node but cluster zhosts doesn’t, the node is registered but zrms hasn’t claimed it yet — give it 10-30 seconds.
Topology stuck in pending
Symptom: topo submit succeeded, state stays pending for more than a minute.
Likely causes:
- No available zhosts.
cluster zhostsshows everythingdegradedor empty. - Insufficient resources. A space needs more CPU/memory than any zhost has free.
- Capability requirement. A space declared a capability (e.g. a specific sensor) no zhost provides.
Check zrms logs:
# On the cloud node
journalctl -u zrms -f # or wherever your logs go
You’ll usually see placement failed: no suitable zhost for space <name> (need: capability X).
Topology shows degraded
A zhost dropped out and recovery is in progress.
zshell[0] >> topo status <id>
# Watch the State field — it'll move to running if recovery succeeds,
# or failed if no zhost can host the orphaned spaces.
To accelerate or troubleshoot:
zshell[1] >> cluster zhosts # is the dropped zhost still degraded?
zshell[2] >> mesh ping # is the network healthy?
If recovery isn’t happening fast enough, the failure timeout (zrms config: failure_detection.timeout) may be too lax. Default is 30 s.
Topology marked failed
Recovery couldn’t place at least one space.
zshell[0] >> topo status <id>
# Look for FailureReason — it explains why
Most common reasons:
- No zhost satisfies capability requirement ’…’ — you need to add a zhost with that capability, or relax the requirement in your MEX.
- Insufficient capacity on remaining zhosts — add capacity or stop other workloads.
- Spatial constraint cannot be satisfied — the MEX pinned a space to a region or device that doesn’t have a healthy zhost.
”No spaces registered in the mesh”
Spaces are running locally (spaces shows them) but invisible in mesh spaces.
Possible causes:
- The space hasn’t finished registering yet — wait 5-10 s.
zmeshisn’t connected tozrun. Checkzmeshlogs for connection state.- Space registration failed. Check
zrunlogs around the space’s start time.
High latency in mesh ping
Symptom: ping RTTs are seconds rather than milliseconds, or you see many hops where you expected direct paths.
- Physical network. Run a regular
pingbetween hosts. If that’s slow too, it’s not Samoza. - Multi-hop routing. If
mesh pingshows two-hop paths to a node that should be direct, that node’s super-node connection is intermittent. - Super-node overload. If one super is handling many edges, traffic concentrates. Consider electing additional supers.
Reading logs
Each service logs to stderr by default. Common patterns:
| Service | Useful greps |
|---|---|
zmesh | connect, fail, register, spaces |
zrms | placement, recovery, heartbeat, cluster manager started |
zrun | space loaded, capability, wasm error, on_init |
zbase | register, expire, query |
When in doubt:
tail -f /var/log/samoza/*.log | grep -i "error\|warn"
Health-check endpoints
Quick “is this thing up” checks:
curl http://127.0.0.1:7071/healthz # zbase
curl http://127.0.0.1:9000/v1/zhosts # control plane
curl http://127.0.0.1:9200/v1/status # zmesh agent
curl http://127.0.0.1:7331/v1/health # zrms
curl http://127.0.0.1:8080/v0/health # zrun
If any of these don’t respond, you’ve narrowed the problem to that service.
When to ask for help
If you’ve worked through the above and the issue is persistent or weird, capture:
- Output of
statusandclusterfrom the affected node. - Output of
topo status <id>for any affected topology. - The last 200 lines of the relevant service’s log.
- The MEX manifest of the affected topology (without secrets).
Then file an issue or reach out to whoever’s standing up your Samoza deployment.