fix: reduce steady-state memory footprint on fully synced validators by skylar-simoncelli · Pull Request #653 · midnightntwrk/midnight-node

skylar-simoncelli · 2026-02-11T17:01:53Z

Overview

This PR optimises the node for a maximum memory allocation of 8Gi, optimising all caching parameters around this memory budget

trie_cache_size: 1 GiB → 128 MiB — The trie cache fills gradually and never shrinks. At 1 GiB it was the single largest
memory consumer. 128 MiB is sufficient for 6-second block validators and frees ~900 MiB for the 8G budget.
storage_cache_size: 1M nodes (~80 MiB) → 512K nodes (~40 MiB) — Halved the midnight-ledger arena LRU cache. Synced validators rarely need more than 512K nodes hot; the tradeoff is slightly more disk reads during sync. Saves 40 MiB that matters when every MiB counts toward 8G.
max_runtime_instances: 8 → 2 — Each pooled WASM instance reserves 128 MiB of heap that is allocated on demand and never released. Default of 8 = up to 1 GiB. Validators only need 1 instance for block import and 1 for authoring. Saves ~768 MiB which is critical for fitting in 8G.
TX_VALIDATION_CACHE_MAX_CAPACITY: 1000 (no TTL) → 200 entries + 5-minute time-to-idle — Each VerifiedTransaction entry holds ZK proof data (50-200 KiB). At 1000 entries with no TTL, stale entries for old state hashes accumulated up to ~200 MiB. Reduced to 200 entries (plenty for low-traffic validator networks) and added idle eviction so quiet periods actively reclaim memory. Keeps worst-case under ~40 MiB.
pool-limit: 8192 → 1024 — Default Substrate tx pool of 8192 is sized for public networks. Our 6-validator testnets
generate minimal transactions. 1024 is more than sufficient and saves ~30 MiB of pooled transaction overhead.

These values target an estimated ~1.2 GiB baseline with ~6.8 GiB headroom for allocator fragmentation and ParityDB mmap growth — enough for multiple days of uptime before memory pressure.

Background

Fully synced validators exhibit continuous memory growth, approaching pod limits (12 Gi) within days. Heaptrack profiling confirmed the midnight_storage arena as the dominant allocator (94% of heap never freed), which was previously addressed by bounding storage_cache_size. However, three additional memory sources continue growing on synced nodes:

Change	Before	After	Savings
`trie_cache_size`	1 GiB	256 MiB	~768 MiB
`max_runtime_instances`	8 (default)	2	~768 MiB
TX validation cache TTI	none (count-only)	5 min idle eviction	variable

Combined: ~1.5 GiB reduction in steady-state memory per validator.

1. Substrate trie cache: 1 GiB → 256 MiB (`res/cfg/default.toml`)

The trie cache fills gradually as state is accessed and never shrinks. At 1 GiB it consumed a fixed ~1 GiB on every node. 256 MiB is sufficient for validator workloads with 6-second block times.

2. WASM runtime instances: 8 → 2 (`res/cfg/default.toml`)

Each pooled WASM instance reserves 128 MiB of heap that is allocated on demand and never released. The Substrate default of 8 instances = up to 1 GiB. Validators only need 1 concurrent instance for block import and 1 for authoring.

3. Moka cache TTI: add 5-minute time_to_idle (`ledger/src/versions/common/mod.rs`)

The STRICT_TX_VALIDATION_CACHE stores VerifiedTransaction objects (50–200 KiB each containing ZK proof data) keyed by (state_hash, tx_hash). Since state_hash changes every block, the same transaction gets a new cache key each block — old entries for stale state hashes accumulated until the 1000-entry cap forced LRU eviction. On low-traffic networks, entries persisted indefinitely. Adding time_to_idle(5 min) evicts stale entries during quiet periods.

Current production data (qanet, 12 Gi limit)

Validator	Memory	Uptime	Growth Rate
V1	7,399 Mi (60%)	2d4h	~3.5 Gi/day
V2	7,378 Mi (60%)	2d4h	~3.5 Gi/day
V4	7,418 Mi (61%)	2d4h	~3.5 Gi/day

At this rate, pods would OOM within ~4 days.

TODO before merging

CI build + test pass
Verify --max-runtime-instances 2 doesn't cause runtime instance contention under load

Submission Checklist

This is backward-compatible (config/CLI changes only; no runtime migration needed)
I have self-reviewed the diff
A change file has been added (changes/changed/reduce-steady-state-memory-footprint.md)
No version bump needed (config-only change)
AGENTS.md does not need updating

Testing Evidence

Deploy to qanet first and monitor memory over 48h. Expected: memory stabilizes around 3-4 Gi instead of growing past 7 Gi.

Additional tests are not needed (no logic changes, only cache configuration)

Fork Strategy

N/A

Links

Related commit: 69fe806 (fix-unbounded-ledger-storage-cache)

Three changes to cut ~1.5 GiB of steady-state memory per node: - Reduce trie_cache_size from 1 GiB to 256 MiB. The trie cache fills gradually and never shrinks; 256 MiB is sufficient for 6s block times. - Set --max-runtime-instances 2 (default 8). Each pooled WASM instance reserves 128 MiB of heap that is never released. Validators only need 1 for import and 1 for authoring. - Add 5-minute time_to_idle to moka transaction validation caches. VerifiedTransaction objects (50-200 KiB each, containing ZK proof data) were only evicted by count. On low-traffic networks stale entries for old state hashes persisted indefinitely, contributing to memory growth.

github-actions · 2026-02-11T17:03:04Z

KICS version: v2.1.16

	Category	Results
	CRITICAL	0
	HIGH	0
	MEDIUM	96
	LOW	12
	INFO	83
	TRACE	0
	TOTAL	191

Metric	Values
Files scanned	31
Files parsed	31
Files failed to scan	0
Total executed queries	73
Queries failed to execute	0
Execution time	9

- trie_cache_size: 256 MiB → 128 MiB - storage_cache_size: 1M → 512K nodes (~80 MiB → ~40 MiB) - TX validation cache: 1000 → 200 entries - Add --pool-limit 1024 (default was 8192) Brings estimated baseline to ~1.2 GiB with ~6.8 GiB headroom for allocator fragmentation and ParityDB mmap growth.

skylar-simoncelli requested a review from a team as a code owner February 11, 2026 17:01

skylar-simoncelli added the skip-changes-check-jira label Feb 11, 2026

skylar-simoncelli marked this pull request as draft February 11, 2026 17:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: reduce steady-state memory footprint on fully synced validators#653

fix: reduce steady-state memory footprint on fully synced validators#653
skylar-simoncelli wants to merge 2 commits intomainfrom
skylar/fix-memory-growth-steady-state

skylar-simoncelli commented Feb 11, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

skylar-simoncelli commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Background

1. Substrate trie cache: 1 GiB → 256 MiB (res/cfg/default.toml)

2. WASM runtime instances: 8 → 2 (res/cfg/default.toml)

3. Moka cache TTI: add 5-minute time_to_idle (ledger/src/versions/common/mod.rs)

Current production data (qanet, 12 Gi limit)

TODO before merging

Submission Checklist

Testing Evidence

Fork Strategy

Links

Uh oh!

github-actions bot commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

skylar-simoncelli commented Feb 11, 2026 •

edited

Loading

1. Substrate trie cache: 1 GiB → 256 MiB (`res/cfg/default.toml`)

2. WASM runtime instances: 8 → 2 (`res/cfg/default.toml`)

3. Moka cache TTI: add 5-minute time_to_idle (`ledger/src/versions/common/mod.rs`)

github-actions bot commented Feb 11, 2026 •

edited

Loading