CONCEPTS

Mesh and zhosts

How the network discovers, peers, and routes messages between distributed Spaces.

A zhost is a single machine running the Samoza stack. Multiple zhosts mesh together to form a cluster. The mesh is what makes the cluster feel like one machine to the spaces that run on it.

Two kinds of zhost

RoleResponsibilities
Super nodeCoordinates cluster operations, receives heartbeats, makes placement and recovery decisions.
Local nodeRuns workloads. Reports to a super node.

The control plane (a separate zmesh-controlplane service) is the source of truth for which zhosts exist, what cluster they belong to, and which super node each one reports to. Super nodes are elected; they aren’t fixed in config.

A small cluster has one super and several locals. A larger one might have multiple supers for resilience.

How a zhost joins

zhost-b starts


zmesh connects to control plane (ws://controlplane:9000)


POST /v1/register


Control plane returns:
  - Assigned super nodes
  - Cluster peers


zmesh connects to super node (zhost-a)


zhost-b is now part of the cluster

Once joined, the local node sends heartbeats and is eligible for workload placement.

Heartbeats

Every local node sends a heartbeat to its super node every 10 seconds (configurable). The heartbeat carries:

  • Resource usage (CPU, memory, disk).
  • The list of currently running spaces.
  • A monotonic counter so duplicate or stale heartbeats are detectable.

The super node updates a lastHeartbeat[zhost] table. If a node misses heartbeats for ~30 seconds (configurable), it’s declared failed.

Failure detection and recovery

zhost-b stops sending heartbeats

      ▼ (after 30s timeout)
ClusterManager on zhost-a detects failure


failureCallback(zhost-b)


RecoveryManager.HandleZhostFailure(zhost-b)

      ├── Mark zhost-b unavailable

      ├── Find affected topologies

      └── Re-place affected spaces, excluding zhost-b

The runtime tries to recover automatically. A topology in this state shows up as degraded in topo status. When all spaces are re-placed and healthy, it returns to running.

Cross-host message routing

Spaces in the same topology may live on different zhosts. The mesh handles routing transparently.

Space on zhost-a wants to send to Space on zhost-b


zrun-a calls zmesh-a POST /v1/spaces/send


zmesh-a looks up the destination space

      ├── Local? Deliver directly.

      └── Remote? Forward via P2P connection


         zmesh-b receives message


         POST to zrun-b /v0/spaces/message


         Space on zhost-b receives message

The sending space has no idea whether the receiving space is local or three nodes away. Latency and routing path are observable in mesh ping.

Discovery

Spaces register their location with zmesh at startup. The super node maintains a global registry — mesh spaces lists every registered space across the cluster:

zshell[0] >> mesh spaces
SPACE_NAME           ZHOST_ID           TOPOLOGY
--------------------------------------------------------------------------------
web-frontend         zhost-alpha        topo-web-app
api-gateway          zhost-beta         topo-web-app
data-processor       zhost-gamma        topo-pipeline

When a space moves (during recovery, for instance), the registry updates. Subsequent messages route to the new location.

Why a mesh

Three reasons.

No single point of failure for transport. Every zhost connects to multiple peers. If one peer is down, traffic flows through another.

Edge-friendly placement. A space that reads a sensor can run on the zhost physically next to it. A space that handles bulk compute can run in the cloud. The runtime sorts out routing.

Failure isolation. A zhost going down takes its workloads with it temporarily, but the mesh notices fast and replaces them elsewhere. The cluster heals around failures rather than waiting for human intervention.