Best Practices
This page collects proven patterns for authoring, running, and maintaining test scenarios that are reliable, maintainable, and actionable.
Scenario Design
State your intent
- Document the goal of each scenario (throughput, DA validation, resilience) so expectation choices are obvious
- Use descriptive variable names that explain topology purpose (e.g.,
star_topology_3val_2execvstopology) - Add comments explaining why specific rates or durations were chosen
Keep runs meaningful
- Choose durations that allow multiple blocks and make timing-based assertions trustworthy
- Use FAQ: Run Duration Calculator to estimate minimum duration
- Avoid runs shorter than 30 seconds unless testing startup behavior specifically
Separate concerns
- Start with deterministic workloads for functional checks
- Add chaos in dedicated resilience scenarios to avoid noisy failures
- Don’t mix high transaction load with aggressive chaos in the same test (hard to debug)
Start small, scale up
- Begin with minimal topology (1-2 validators) to validate scenario logic
- Gradually increase topology size and workload rates
- Use Host runner for fast iteration, then validate on Compose before production
Code Organization
Reuse patterns
- Standardize on shared topology and workload presets so results are comparable across environments and teams
- Extract common topology builders into helper functions
- Create workspace-level constants for standard rates and durations
Example: Topology preset
pub fn standard_da_topology() -> GeneratedTopology {
TopologyBuilder::new()
.network_star()
.validators(3)
.executors(2)
.generate()
}
Example: Shared constants
pub const STANDARD_TX_RATE: f64 = 10.0;
pub const STANDARD_DA_CHANNEL_RATE: f64 = 2.0;
pub const SHORT_RUN_DURATION: Duration = Duration::from_secs(60);
pub const LONG_RUN_DURATION: Duration = Duration::from_secs(300);
Debugging & Observability
Observe first, tune second
- Rely on liveness and inclusion signals to interpret outcomes before tweaking rates or topology
- Enable detailed logging (
RUST_LOG=debug,NOMOS_LOG_LEVEL=debug) only after initial failure - Use
NOMOS_TESTS_KEEP_LOGS=1to persist logs when debugging failures
Use BlockFeed effectively
- Subscribe to BlockFeed in expectations for real-time block monitoring
- Track block production rate to detect liveness issues early
- Use block statistics (
block_feed.stats().total_transactions()) to verify inclusion
Collect metrics
- Set up Prometheus/Grafana via
scripts/setup/setup-observability.sh compose upfor visualizing node behavior - Use metrics to identify bottlenecks before adding more load
- Monitor mempool size, block size, and consensus timing
Environment & Runner Selection
Environment fit
- Pick runners that match the feedback loop you need:
- Host: Fast iteration during development, quick CI smoke tests
- Compose: Reproducible environments (recommended for CI), chaos testing
- K8s: Production-like fidelity, large topologies (10+ nodes)
Runner-specific considerations
| Runner | When to Use | When to Avoid |
|---|---|---|
| Host | Development iteration, fast feedback | Chaos testing, container-specific issues |
| Compose | CI pipelines, chaos tests, reproducibility | Very large topologies (>10 nodes) |
| K8s | Production-like testing, cluster behaviors | Local development, fast iteration |
Minimal surprises
- Seed only necessary wallets and keep configuration deltas explicit when moving between CI and developer machines
- Use
versions.envto pin node versions consistently across environments - Document non-default environment variables in scenario comments or README
CI/CD Integration
Use matrix builds
strategy:
matrix:
runner: [host, compose]
topology: [small, medium]
Cache aggressively
- Cache Rust build artifacts (
target/) - Cache circuit parameters (
assets/stack/kzgrs_test_params/) - Cache Docker layers (use BuildKit cache)
Collect logs on failure
- name: Collect logs on failure
if: failure()
run: |
mkdir -p test-logs
find /tmp -name "nomos-*.log" -exec cp {} test-logs/ \;
- uses: actions/upload-artifact@v3
if: failure()
with:
name: test-logs-${{ matrix.runner }}
path: test-logs/
Time limits
- Set job timeout to prevent hung runs:
timeout-minutes: 30 - Use shorter durations in CI (60s) vs local testing (300s)
- Run expensive tests (k8s, large topologies) only on main branch or release tags
See also: CI Integration for complete workflow examples
Anti-Patterns to Avoid
DON’T: Run without POL_PROOF_DEV_MODE
# BAD: Will hang/timeout on proof generation
cargo run -p runner-examples --bin local_runner
# GOOD: Fast mode for testing
POL_PROOF_DEV_MODE=true cargo run -p runner-examples --bin local_runner
DON’T: Use tiny durations
// BAD: Not enough time for blocks to propagate
.with_run_duration(Duration::from_secs(5))
// GOOD: Allow multiple consensus rounds
.with_run_duration(Duration::from_secs(60))
DON’T: Ignore cleanup failures
// BAD: Next run inherits leaked state
runner.run(&mut scenario).await?;
// forgot to call cleanup or use CleanupGuard
// GOOD: Cleanup via guard (automatic on panic)
let _cleanup = CleanupGuard::new(runner.clone());
runner.run(&mut scenario).await?;
DON’T: Mix concerns in one scenario
// BAD: Hard to debug when it fails
.transactions_with(|tx| tx.rate(50).users(100)) // high load
.chaos_with(|c| c.restart().min_delay(...)) // AND chaos
.da_with(|da| da.channel_rate(10).blob_rate(20)) // AND DA stress
// GOOD: Separate tests for each concern
// Test 1: High transaction load only
// Test 2: Chaos resilience only
// Test 3: DA stress only
DON’T: Hardcode paths or ports
// BAD: Breaks on different machines
let path = PathBuf::from("/home/user/circuits/kzgrs_test_params");
let port = 9000; // might conflict
// GOOD: Use env vars and dynamic allocation
let path = std::env::var("NOMOS_KZGRS_PARAMS_PATH")
.unwrap_or_else(|_| "assets/stack/kzgrs_test_params/kzgrs_test_params".to_string());
let port = get_available_tcp_port();
DON’T: Ignore resource limits
# BAD: Large topology without checking resources
scripts/run/run-examples.sh -v 20 -e 10 compose
# (might OOM or exhaust ulimits)
# GOOD: Scale gradually and monitor resources
scripts/run/run-examples.sh -v 3 -e 2 compose # start small
docker stats # monitor resource usage
# then increase if resources allow
Scenario Design Heuristics
Minimal viable topology
- Consensus: 3 validators (minimum for Byzantine fault tolerance)
- DA: 2+ executors (test dispersal and sampling)
- Network: Star topology (simplest for debugging)
Workload rate selection
- Start with 1-5 tx/s per user, then increase
- DA: 1-2 channels, 1-3 blobs/channel initially
- Chaos: 30s+ intervals between restarts (allow recovery)
Duration guidelines
| Test Type | Minimum Duration | Typical Duration |
|---|---|---|
| Smoke test | 30s | 60s |
| Integration test | 60s | 120s |
| Load test | 120s | 300s |
| Resilience test | 120s | 300s |
| Soak test | 600s (10m) | 3600s (1h) |
Expectation selection
| Test Goal | Expectations |
|---|---|
| Basic functionality | expect_consensus_liveness() |
| Transaction handling | expect_consensus_liveness() + custom inclusion check |
| DA correctness | expect_consensus_liveness() + DA dispersal/sampling checks |
| Resilience | expect_consensus_liveness() + recovery time measurement |
Testing the Tests
Validate scenarios before committing
- Run on Host runner first (fast feedback)
- Run on Compose runner (reproducibility check)
- Check logs for warnings or errors
- Verify cleanup (no leaked processes/containers)
- Run 2-3 times to check for flakiness
Handling flaky tests
- Increase run duration (timing-sensitive assertions need longer runs)
- Reduce workload rates (might be saturating nodes)
- Check resource limits (CPU/RAM/ulimits)
- Add debugging output to identify race conditions
- Consider if test is over-specified (too strict expectations)
See also:
- Troubleshooting for common failure patterns
- FAQ for design decisions and gotchas