Best Practices

This page collects proven patterns for authoring, running, and maintaining test scenarios that are reliable, maintainable, and actionable.

Scenario Design

State your intent

Document the goal of each scenario (throughput, resilience) so expectation choices are obvious
Use descriptive variable names that explain topology purpose (e.g., star_topology_3val_2exec vs topology)
Add comments explaining why specific rates or durations were chosen

Keep runs meaningful

Choose durations that allow multiple blocks and make timing-based assertions trustworthy
Use FAQ: Run Duration Calculator to estimate minimum duration
Avoid runs shorter than 30 seconds unless testing startup behavior specifically

Separate concerns

Start with deterministic workloads for functional checks
Add chaos in dedicated resilience scenarios to avoid noisy failures
Don’t mix high transaction load with aggressive chaos in the same test (hard to debug)

Start small, scale up

Begin with minimal topology (1-2 nodes) to validate scenario logic
Gradually increase topology size and workload rates
Use Host runner for fast iteration, then validate on Compose before production

Code Organization

Reuse patterns

Standardize on shared topology and workload presets so results are comparable across environments and teams
Extract common topology builders into helper functions
Create workspace-level constants for standard rates and durations

Example: Topology preset

pub fn standard_topology() -> GeneratedTopology {
    TopologyBuilder::new()
        .network_star()
        .nodes(3)
        .generate()
}

Example: Shared constants

pub const STANDARD_TX_RATE: f64 = 10.0;
pub const SHORT_RUN_DURATION: Duration = Duration::from_secs(60);
pub const LONG_RUN_DURATION: Duration = Duration::from_secs(300);

Debugging & Observability

Observe first, tune second

Rely on liveness and inclusion signals to interpret outcomes before tweaking rates or topology
Enable detailed logging (RUST_LOG=debug, LOGOS_BLOCKCHAIN_LOG_LEVEL=debug) only after initial failure
Use LOGOS_BLOCKCHAIN_TESTS_KEEP_LOGS=1 to persist logs when debugging failures

Use BlockFeed effectively

Subscribe to BlockFeed in expectations for real-time block monitoring
Track block production rate to detect liveness issues early
Use block statistics (block_feed.stats().total_transactions()) to verify inclusion

Collect metrics

Set up Prometheus/Grafana via scripts/setup/setup-observability.sh compose up for visualizing node behavior
Use metrics to identify bottlenecks before adding more load
Monitor mempool size, block size, and consensus timing

Environment & Runner Selection

Environment fit

Pick runners that match the feedback loop you need:
- Host: Fast iteration during development, quick CI smoke tests
- Compose: Reproducible environments (recommended for CI), chaos testing
- K8s: Production-like fidelity, large topologies (10+ nodes)

Runner-specific considerations

Runner	When to Use	When to Avoid
Host	Development iteration, fast feedback	Chaos testing, container-specific issues
Compose	CI pipelines, chaos tests, reproducibility	Very large topologies (>10 nodes)
K8s	Production-like testing, cluster behaviors	Local development, fast iteration

Minimal surprises

Seed only necessary wallets and keep configuration deltas explicit when moving between CI and developer machines
Use versions.env to pin node versions consistently across environments
Document non-default environment variables in scenario comments or README

CI/CD Integration

Use matrix builds

strategy:
  matrix:
    runner: [host, compose]
    topology: [small, medium]

Cache aggressively

Cache Rust build artifacts (target/)
Cache circuit parameters (~/.logos-blockchain-circuits/)
Cache Docker layers (use BuildKit cache)

Collect logs on failure

- name: Collect logs on failure
  if: failure()
  run: |
    mkdir -p test-logs
    find /tmp -name "nomos-*.log" -exec cp {} test-logs/ \;
- uses: actions/upload-artifact@v3
  if: failure()
  with:
    name: test-logs-${{ matrix.runner }}
    path: test-logs/

Time limits

Set job timeout to prevent hung runs: timeout-minutes: 30
Use shorter durations in CI (60s) vs local testing (300s)
Run expensive tests (k8s, large topologies) only on main branch or release tags

See also: CI Integration for complete workflow examples

Anti-Patterns to Avoid

# BAD: Will hang/timeout on proof generation
cargo run -p runner-examples --bin local_runner

DON’T: Use tiny durations

// BAD: Not enough time for blocks to propagate
.with_run_duration(Duration::from_secs(5))

// GOOD: Allow multiple consensus rounds
.with_run_duration(Duration::from_secs(60))

DON’T: Ignore cleanup failures

// BAD: Next run inherits leaked state
runner.run(&mut scenario).await?;
// forgot to call cleanup or use CleanupGuard

// GOOD: Cleanup via guard (automatic on panic)
let _cleanup = CleanupGuard::new(runner.clone());
runner.run(&mut scenario).await?;

DON’T: Mix concerns in one scenario

// BAD: Hard to debug when it fails
.transactions_with(|tx| tx.rate(50).users(100))  // high load
.chaos_with(|c| c.restart().min_delay(...))        // AND chaos

// GOOD: Separate tests for each concern
// Test 1: High transaction load only
// Test 2: Chaos resilience only

DON’T: Hardcode paths or ports

// BAD: Breaks on different machines
let path = PathBuf::from("/home/user/circuits");
let port = 9000; // might conflict

// GOOD: Use env vars and dynamic allocation
let path = std::env::var("LOGOS_BLOCKCHAIN_CIRCUITS")
    .unwrap_or_else(|_| "~/.logos-blockchain-circuits".to_string());
let port = get_available_tcp_port();

DON’T: Ignore resource limits

# BAD: Large topology without checking resources
scripts/run/run-examples.sh -n 20 compose
# (might OOM or exhaust ulimits)

# GOOD: Scale gradually and monitor resources
scripts/run/run-examples.sh -n 3 compose  # start small
docker stats  # monitor resource usage
# then increase if resources allow

Scenario Design Heuristics

Minimal viable topology

Consensus: 3 nodes (minimum for Byzantine fault tolerance)
Network: Star topology (simplest for debugging)

Workload rate selection

Start with 1-5 tx/s per user, then increase
Chaos: 30s+ intervals between restarts (allow recovery)

Duration guidelines

Test Type	Minimum Duration	Typical Duration
Smoke test	30s	60s
Integration test	60s	120s
Load test	120s	300s
Resilience test	120s	300s
Soak test	600s (10m)	3600s (1h)

Expectation selection

Test Goal	Expectations
Basic functionality	`expect_consensus_liveness()`
Transaction handling	`expect_consensus_liveness()` + custom inclusion check
Resilience	`expect_consensus_liveness()` + recovery time measurement

Testing the Tests

Validate scenarios before committing

Run on Host runner first (fast feedback)
Run on Compose runner (reproducibility check)
Check logs for warnings or errors
Verify cleanup (no leaked processes/containers)
Run 2-3 times to check for flakiness

Handling flaky tests

Increase run duration (timing-sensitive assertions need longer runs)
Reduce workload rates (might be saturating nodes)
Check resource limits (CPU/RAM/ulimits)
Add debugging output to identify race conditions
Consider if test is over-specified (too strict expectations)

See also:

Troubleshooting for common failure patterns
FAQ for design decisions and gotchas

Keyboard shortcuts

Logos Blockchain Testing Framework Book