Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Topology & Chaos Patterns

This page focuses on cluster manipulation: node control, chaos patterns, and what the tooling supports today.

Node control availability

  • Supported: restart control via NodeControlHandle (compose runner).
  • Not supported: local runner does not expose node control; k8s runner does not support it yet.
  • Not yet supported: peer blocking/unblocking and network partitions.

See also: RunContext: BlockFeed & Node Control for the current node-control API surface and limitations.

Chaos patterns to consider

  • Restarts: random restarts with minimum delay/cooldown to test recovery.
  • Partitions (planned): block/unblock peers to simulate partial isolation, then assert height convergence after healing.
  • Validator churn (planned): stop one validator and start another (new key) mid-run to test membership changes; expect convergence.
  • Load SLOs: push tx/DA rates and assert inclusion/availability budgets instead of only liveness.
  • API probes: poll HTTP/RPC endpoints during chaos to ensure external contracts stay healthy (shape + latency).

Expectations to pair

  • Liveness/height convergence after chaos windows.
  • SLO checks: inclusion latency, DA responsiveness, API latency/shape.
  • Recovery checks: ensure nodes that were isolated or restarted catch up to cluster height within a timeout.

Guidance

  • Keep chaos realistic: avoid flapping or patterns you wouldn’t operate in prod.
  • Scope chaos: choose validators vs executors intentionally; don’t restart all nodes at once unless you’re testing full outages.
  • Combine chaos with observability: capture block feed/metrics and API health so failures are diagnosable.