Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Best Practices

This page collects proven patterns for authoring, running, and maintaining test scenarios that are reliable, maintainable, and actionable.

Scenario Design

State your intent

  • Document the goal of each scenario (throughput, DA validation, resilience) so expectation choices are obvious
  • Use descriptive variable names that explain topology purpose (e.g., star_topology_3val_2exec vs topology)
  • Add comments explaining why specific rates or durations were chosen

Keep runs meaningful

  • Choose durations that allow multiple blocks and make timing-based assertions trustworthy
  • Use FAQ: Run Duration Calculator to estimate minimum duration
  • Avoid runs shorter than 30 seconds unless testing startup behavior specifically

Separate concerns

  • Start with deterministic workloads for functional checks
  • Add chaos in dedicated resilience scenarios to avoid noisy failures
  • Don’t mix high transaction load with aggressive chaos in the same test (hard to debug)

Start small, scale up

  • Begin with minimal topology (1-2 validators) to validate scenario logic
  • Gradually increase topology size and workload rates
  • Use Host runner for fast iteration, then validate on Compose before production

Code Organization

Reuse patterns

  • Standardize on shared topology and workload presets so results are comparable across environments and teams
  • Extract common topology builders into helper functions
  • Create workspace-level constants for standard rates and durations

Example: Topology preset

pub fn standard_da_topology() -> GeneratedTopology {
    TopologyBuilder::new()
        .network_star()
        .validators(3)
        .executors(2)
        .generate()
}

Example: Shared constants

pub const STANDARD_TX_RATE: f64 = 10.0;
pub const STANDARD_DA_CHANNEL_RATE: f64 = 2.0;
pub const SHORT_RUN_DURATION: Duration = Duration::from_secs(60);
pub const LONG_RUN_DURATION: Duration = Duration::from_secs(300);

Debugging & Observability

Observe first, tune second

  • Rely on liveness and inclusion signals to interpret outcomes before tweaking rates or topology
  • Enable detailed logging (RUST_LOG=debug, NOMOS_LOG_LEVEL=debug) only after initial failure
  • Use NOMOS_TESTS_KEEP_LOGS=1 to persist logs when debugging failures

Use BlockFeed effectively

  • Subscribe to BlockFeed in expectations for real-time block monitoring
  • Track block production rate to detect liveness issues early
  • Use block statistics (block_feed.stats().total_transactions()) to verify inclusion

Collect metrics

  • Set up Prometheus/Grafana via scripts/setup/setup-observability.sh compose up for visualizing node behavior
  • Use metrics to identify bottlenecks before adding more load
  • Monitor mempool size, block size, and consensus timing

Environment & Runner Selection

Environment fit

  • Pick runners that match the feedback loop you need:
    • Host: Fast iteration during development, quick CI smoke tests
    • Compose: Reproducible environments (recommended for CI), chaos testing
    • K8s: Production-like fidelity, large topologies (10+ nodes)

Runner-specific considerations

RunnerWhen to UseWhen to Avoid
HostDevelopment iteration, fast feedbackChaos testing, container-specific issues
ComposeCI pipelines, chaos tests, reproducibilityVery large topologies (>10 nodes)
K8sProduction-like testing, cluster behaviorsLocal development, fast iteration

Minimal surprises

  • Seed only necessary wallets and keep configuration deltas explicit when moving between CI and developer machines
  • Use versions.env to pin node versions consistently across environments
  • Document non-default environment variables in scenario comments or README

CI/CD Integration

Use matrix builds

strategy:
  matrix:
    runner: [host, compose]
    topology: [small, medium]

Cache aggressively

  • Cache Rust build artifacts (target/)
  • Cache circuit parameters (assets/stack/kzgrs_test_params/)
  • Cache Docker layers (use BuildKit cache)

Collect logs on failure

- name: Collect logs on failure
  if: failure()
  run: |
    mkdir -p test-logs
    find /tmp -name "nomos-*.log" -exec cp {} test-logs/ \;
- uses: actions/upload-artifact@v3
  if: failure()
  with:
    name: test-logs-${{ matrix.runner }}
    path: test-logs/

Time limits

  • Set job timeout to prevent hung runs: timeout-minutes: 30
  • Use shorter durations in CI (60s) vs local testing (300s)
  • Run expensive tests (k8s, large topologies) only on main branch or release tags

See also: CI Integration for complete workflow examples

Anti-Patterns to Avoid

DON’T: Run without POL_PROOF_DEV_MODE

# BAD: Will hang/timeout on proof generation
cargo run -p runner-examples --bin local_runner

# GOOD: Fast mode for testing
POL_PROOF_DEV_MODE=true cargo run -p runner-examples --bin local_runner

DON’T: Use tiny durations

// BAD: Not enough time for blocks to propagate
.with_run_duration(Duration::from_secs(5))

// GOOD: Allow multiple consensus rounds
.with_run_duration(Duration::from_secs(60))

DON’T: Ignore cleanup failures

// BAD: Next run inherits leaked state
runner.run(&mut scenario).await?;
// forgot to call cleanup or use CleanupGuard

// GOOD: Cleanup via guard (automatic on panic)
let _cleanup = CleanupGuard::new(runner.clone());
runner.run(&mut scenario).await?;

DON’T: Mix concerns in one scenario

// BAD: Hard to debug when it fails
.transactions_with(|tx| tx.rate(50).users(100))  // high load
.chaos_with(|c| c.restart().min_delay(...))        // AND chaos
.da_with(|da| da.channel_rate(10).blob_rate(20))  // AND DA stress

// GOOD: Separate tests for each concern
// Test 1: High transaction load only
// Test 2: Chaos resilience only
// Test 3: DA stress only

DON’T: Hardcode paths or ports

// BAD: Breaks on different machines
let path = PathBuf::from("/home/user/circuits/kzgrs_test_params");
let port = 9000; // might conflict

// GOOD: Use env vars and dynamic allocation
let path = std::env::var("NOMOS_KZGRS_PARAMS_PATH")
    .unwrap_or_else(|_| "assets/stack/kzgrs_test_params/kzgrs_test_params".to_string());
let port = get_available_tcp_port();

DON’T: Ignore resource limits

# BAD: Large topology without checking resources
scripts/run/run-examples.sh -v 20 -e 10 compose
# (might OOM or exhaust ulimits)

# GOOD: Scale gradually and monitor resources
scripts/run/run-examples.sh -v 3 -e 2 compose  # start small
docker stats  # monitor resource usage
# then increase if resources allow

Scenario Design Heuristics

Minimal viable topology

  • Consensus: 3 validators (minimum for Byzantine fault tolerance)
  • DA: 2+ executors (test dispersal and sampling)
  • Network: Star topology (simplest for debugging)

Workload rate selection

  • Start with 1-5 tx/s per user, then increase
  • DA: 1-2 channels, 1-3 blobs/channel initially
  • Chaos: 30s+ intervals between restarts (allow recovery)

Duration guidelines

Test TypeMinimum DurationTypical Duration
Smoke test30s60s
Integration test60s120s
Load test120s300s
Resilience test120s300s
Soak test600s (10m)3600s (1h)

Expectation selection

Test GoalExpectations
Basic functionalityexpect_consensus_liveness()
Transaction handlingexpect_consensus_liveness() + custom inclusion check
DA correctnessexpect_consensus_liveness() + DA dispersal/sampling checks
Resilienceexpect_consensus_liveness() + recovery time measurement

Testing the Tests

Validate scenarios before committing

  1. Run on Host runner first (fast feedback)
  2. Run on Compose runner (reproducibility check)
  3. Check logs for warnings or errors
  4. Verify cleanup (no leaked processes/containers)
  5. Run 2-3 times to check for flakiness

Handling flaky tests

  • Increase run duration (timing-sensitive assertions need longer runs)
  • Reduce workload rates (might be saturating nodes)
  • Check resource limits (CPU/RAM/ulimits)
  • Add debugging output to identify race conditions
  • Consider if test is over-specified (too strict expectations)

See also: