Mesh and zhosts
How the network discovers, peers, and routes messages between distributed Spaces.
A zhost is a single machine running the Samoza stack. Multiple zhosts mesh together to form a cluster. The mesh is what makes the cluster feel like one machine to the spaces that run on it.
Two kinds of zhost
| Role | Responsibilities |
|---|---|
| Super node | Coordinates cluster operations, receives heartbeats, makes placement and recovery decisions. |
| Local node | Runs workloads. Reports to a super node. |
The control plane (a separate zmesh-controlplane service) is the source of truth for which zhosts exist, what cluster they belong to, and which super node each one reports to. Super nodes are elected; they aren’t fixed in config.
A small cluster has one super and several locals. A larger one might have multiple supers for resilience.
How a zhost joins
zhost-b starts
│
▼
zmesh connects to control plane (ws://controlplane:9000)
│
▼
POST /v1/register
│
▼
Control plane returns:
- Assigned super nodes
- Cluster peers
│
▼
zmesh connects to super node (zhost-a)
│
▼
zhost-b is now part of the cluster
Once joined, the local node sends heartbeats and is eligible for workload placement.
Heartbeats
Every local node sends a heartbeat to its super node every 10 seconds (configurable). The heartbeat carries:
- Resource usage (CPU, memory, disk).
- The list of currently running spaces.
- A monotonic counter so duplicate or stale heartbeats are detectable.
The super node updates a lastHeartbeat[zhost] table. If a node misses heartbeats for ~30 seconds (configurable), it’s declared failed.
Failure detection and recovery
zhost-b stops sending heartbeats
│
▼ (after 30s timeout)
ClusterManager on zhost-a detects failure
│
▼
failureCallback(zhost-b)
│
▼
RecoveryManager.HandleZhostFailure(zhost-b)
│
├── Mark zhost-b unavailable
│
├── Find affected topologies
│
└── Re-place affected spaces, excluding zhost-b
The runtime tries to recover automatically. A topology in this state shows up as degraded in topo status. When all spaces are re-placed and healthy, it returns to running.
Cross-host message routing
Spaces in the same topology may live on different zhosts. The mesh handles routing transparently.
Space on zhost-a wants to send to Space on zhost-b
│
▼
zrun-a calls zmesh-a POST /v1/spaces/send
│
▼
zmesh-a looks up the destination space
│
├── Local? Deliver directly.
│
└── Remote? Forward via P2P connection
│
▼
zmesh-b receives message
│
▼
POST to zrun-b /v0/spaces/message
│
▼
Space on zhost-b receives message
The sending space has no idea whether the receiving space is local or three nodes away. Latency and routing path are observable in mesh ping.
Discovery
Spaces register their location with zmesh at startup. The super node maintains a global registry — mesh spaces lists every registered space across the cluster:
zshell[0] >> mesh spaces
SPACE_NAME ZHOST_ID TOPOLOGY
--------------------------------------------------------------------------------
web-frontend zhost-alpha topo-web-app
api-gateway zhost-beta topo-web-app
data-processor zhost-gamma topo-pipeline
When a space moves (during recovery, for instance), the registry updates. Subsequent messages route to the new location.
Why a mesh
Three reasons.
No single point of failure for transport. Every zhost connects to multiple peers. If one peer is down, traffic flows through another.
Edge-friendly placement. A space that reads a sensor can run on the zhost physically next to it. A space that handles bulk compute can run in the cloud. The runtime sorts out routing.
Failure isolation. A zhost going down takes its workloads with it temporarily, but the mesh notices fast and replaces them elsewhere. The cluster heals around failures rather than waiting for human intervention.