Topology & Chaos Patterns

This page focuses on cluster manipulation: node control, chaos patterns, and what the tooling supports today.

Node control availability

Supported: restart control via NodeControlHandle (compose runner).
Not supported: local runner does not expose node control; k8s runner does not support it yet.
Not yet supported: peer blocking/unblocking and network partitions.

See also: RunContext: BlockFeed & Node Control for the current node-control API surface and limitations.

Restarts: random restarts with minimum delay/cooldown to test recovery.
Partitions (planned): block/unblock peers to simulate partial isolation, then assert height convergence after healing.
Node churn (planned): stop one node and start another (new key) mid-run to test membership changes; expect convergence.
Load SLOs: push transaction rates and assert inclusion/latency budgets instead of only liveness.
API probes: poll HTTP/RPC endpoints during chaos to ensure external contracts stay healthy (shape + latency).

Liveness/height convergence after chaos windows.
SLO checks: inclusion latency, API latency/shape.
Recovery checks: ensure nodes that were isolated or restarted catch up to cluster height within a timeout.

Keep chaos realistic: avoid flapping or patterns you wouldn’t operate in prod.
Scope chaos: choose nodes intentionally; don’t restart all nodes at once unless you’re testing full outages.
Combine chaos with observability: capture block feed/metrics and API health so failures are diagnosable.