OPERATE

Troubleshooting

Common failure modes — what they look like, why they happen, how to fix them.

A handful of patterns covers most of what goes wrong. This page is organised by symptom.

”zmesh: unreachable” in status

Symptom: zshell says it can’t reach the local zmesh.

Check, in order:

  1. Is zmesh running on this node? ps aux | grep zmesh.
  2. Is SAMOZA_ZMESH_URL correct? (Default: http://127.0.0.1:9200.)
  3. Is anything listening on the port? curl http://127.0.0.1:9200/v1/status.
  4. Firewall blocking it? On the same host this should never happen, but worth checking on virtualised setups.

”zrms: unreachable” in status

Same playbook as zmesh, against the resource manager:

  1. ps aux | grep zrms
  2. SAMOZA_ZRMS_URL correct? (Default: http://127.0.0.1:7331.)
  3. curl http://127.0.0.1:7331/v1/health.

Edge node can’t connect to the control plane

Symptom: edge node starts, zmesh logs say it can’t reach the control plane.

# 1. Network reachability
curl http://cloud.example.com:9000/healthz

# 2. WebSocket URL format — must be ws://, NOT http://
echo $ZMESH_CONTROLPLANE
# Correct: ws://cloud.example.com:9000/ws

# 3. Firewall on the cloud node — port 9000 must be open

Nodes don’t appear in cluster zhosts

A node started but never showed up.

# Check zbase registration on the cloud node
curl "http://127.0.0.1:7071/v1/instances?kind=zmesh" | jq .

# Each node must heartbeat to stay registered. TTL default is 30s.

If zbase shows the node but cluster zhosts doesn’t, the node is registered but zrms hasn’t claimed it yet — give it 10-30 seconds.

Topology stuck in pending

Symptom: topo submit succeeded, state stays pending for more than a minute.

Likely causes:

  • No available zhosts. cluster zhosts shows everything degraded or empty.
  • Insufficient resources. A space needs more CPU/memory than any zhost has free.
  • Capability requirement. A space declared a capability (e.g. a specific sensor) no zhost provides.

Check zrms logs:

# On the cloud node
journalctl -u zrms -f   # or wherever your logs go

You’ll usually see placement failed: no suitable zhost for space <name> (need: capability X).

Topology shows degraded

A zhost dropped out and recovery is in progress.

zshell[0] >> topo status <id>
# Watch the State field — it'll move to running if recovery succeeds,
# or failed if no zhost can host the orphaned spaces.

To accelerate or troubleshoot:

zshell[1] >> cluster zhosts          # is the dropped zhost still degraded?
zshell[2] >> mesh ping               # is the network healthy?

If recovery isn’t happening fast enough, the failure timeout (zrms config: failure_detection.timeout) may be too lax. Default is 30 s.

Topology marked failed

Recovery couldn’t place at least one space.

zshell[0] >> topo status <id>
# Look for FailureReason — it explains why

Most common reasons:

  • No zhost satisfies capability requirement ’…’ — you need to add a zhost with that capability, or relax the requirement in your MEX.
  • Insufficient capacity on remaining zhosts — add capacity or stop other workloads.
  • Spatial constraint cannot be satisfied — the MEX pinned a space to a region or device that doesn’t have a healthy zhost.

”No spaces registered in the mesh”

Spaces are running locally (spaces shows them) but invisible in mesh spaces.

Possible causes:

  1. The space hasn’t finished registering yet — wait 5-10 s.
  2. zmesh isn’t connected to zrun. Check zmesh logs for connection state.
  3. Space registration failed. Check zrun logs around the space’s start time.

High latency in mesh ping

Symptom: ping RTTs are seconds rather than milliseconds, or you see many hops where you expected direct paths.

  • Physical network. Run a regular ping between hosts. If that’s slow too, it’s not Samoza.
  • Multi-hop routing. If mesh ping shows two-hop paths to a node that should be direct, that node’s super-node connection is intermittent.
  • Super-node overload. If one super is handling many edges, traffic concentrates. Consider electing additional supers.

Reading logs

Each service logs to stderr by default. Common patterns:

ServiceUseful greps
zmeshconnect, fail, register, spaces
zrmsplacement, recovery, heartbeat, cluster manager started
zrunspace loaded, capability, wasm error, on_init
zbaseregister, expire, query

When in doubt:

tail -f /var/log/samoza/*.log | grep -i "error\|warn"

Health-check endpoints

Quick “is this thing up” checks:

curl http://127.0.0.1:7071/healthz                # zbase
curl http://127.0.0.1:9000/v1/zhosts              # control plane
curl http://127.0.0.1:9200/v1/status              # zmesh agent
curl http://127.0.0.1:7331/v1/health              # zrms
curl http://127.0.0.1:8080/v0/health              # zrun

If any of these don’t respond, you’ve narrowed the problem to that service.

When to ask for help

If you’ve worked through the above and the issue is persistent or weird, capture:

  • Output of status and cluster from the affected node.
  • Output of topo status <id> for any affected topology.
  • The last 200 lines of the relevant service’s log.
  • The MEX manifest of the affected topology (without secrets).

Then file an issue or reach out to whoever’s standing up your Samoza deployment.