fix: Add state root calculation on payload resolve #334

sieniven · 2025-11-27T04:49:28Z

📝 Summary

Context of the full issue: Cache miss on engine_newPayload to builder causes slow payload validation #333
When flashblocks stateroots calculation flag is switched to false, because state roots are never calculated in payloads, on payload resolving, engine api request to insert locally executed blocks will insert blocks without state roots
This eventually leads to cache misses when engine_newPayload is called on the op-rbuilder from rollup-boost, which leads to the full validation of this payload (even though they were locally built by the builder), including the re-execution of the payload transactions and subsequent state root calculation
This PR adds the additional stateroot calculation triggered on payload resolve, ensuring that we do not take up sequencing time re-executing the payload
Note that state root calculation regardless will block the sequencing time - however, rollup boost will obtain the cached incrementally built builder payload anyways and trigger the payload resolve asynchronously. This should be optimal when integrated with flashblocks

✅ I have completed the following steps:

[✅ ] Run make lint
[✅ ] Run make test
Added tests (if applicable)

crates/op-rbuilder/src/builders/flashblocks/payload.rs

SozinM · 2025-11-27T05:39:23Z

crates/op-rbuilder/src/builders/flashblocks/payload.rs

                    ctx = ctx.with_cancel(fb_cancel).with_extra_ctx(next_flashblocks_ctx);
                },
                _ = block_cancel.cancelled() => {
+                    self.resolve_best_payload(&mut state, &ctx, &mut info, &best_payload).await;


It's not guaranteed that rollup-boost will cancel building (send engine_getPayload)
I think correct way would be to calculate state root asynchronously after we sent last flashblock

The reason why it is placed here is because on the event where rollup-boost does not send the payload resolve, on a subsequent FCU, the payload will be resolved regardless.

Also, I placed it on resolve because we are keen in exploring using the flashblock payload builder directly on a custom reth node without the need to use rollup-boost. Maybe this implementation is better in terms of de-coupling with rollup boost service?

Also, it is hard to deterministically say for sure which is the last flashblock. As observed during our deployment, it seems like on the builder, on most cases, the payload building early cancels which results in the payload not ever reaching the target flashblock. This seems to me to be the most deterministic way of ensuring locally built payloads contain a state root.

yeah, you need to use --flashblock.leeway-time
And for external state root it should be considerable as usually it used in cases when it's slow to calculate state root, so you need some extra time to do this

As for decoupling
I think some assumptions behind state root calculation weren't designed for this case, so it's better to give it a good thought on how to change it

https://github.com/flashbots/op-rbuilder/pull/334/files#diff-20125ce86c70cfd520b09bc61e2032873f80249afe0027ee76dfbfcb503e7fb7R535-R542 - This issue should be resolved in the attached code block. We start the state root calculation immediately once the last target flashblock is built.

crates/op-rbuilder/src/builders/flashblocks/payload.rs

SozinM · 2025-11-27T05:45:43Z

Hi!
Have you tested this?
Because it seems to me it shouldn't work, as you still don't save executed block

SozinM · 2025-11-27T08:35:18Z

@sieniven could you please attach newPayload latency before and after fix?
I think this would be a good starting point of looking into the deeper

sieniven · 2025-11-27T10:46:40Z

@SozinM Hey, thanks and appreciate your time in helping review this :) Converting this PR to draft first because there are still some issues when testing - will re-open it in abit once everything is working on our end.

sieniven · 2025-11-27T13:53:19Z

@SozinM I have re-opened the PR. From the above discussion thread, I understand there is the leeway time configuration which allows us to early package flashblocks. However on the assumption that we are trying to maximize throughput of the chain, setting too big a leeway time may tradeoff chain max throughput (gas executed per block).

I am trying to think of the most optimal way of doing this - it still seems to me that unless the intent of the builder for flashbots is to always couple it with the rollup boost proxy service, perhaps we could explore the default behavior of the builder to calculate state root on payload resolve.

Giving some context here - we found the issue with the builder sync-ing with tip to have severe effects on the chain's maximum throughput (gas per second) during our stress testing on the flashblocks architecture. It was found that during our stress tests which benchmarks sending ERC20 transactions, the tps of the chain drops significantly and a huge part of the bottleneck was coming from cache misses due to the mismatch in block hash. This PR has been tested on our end to work and resolves the cache miss issue, and the current design is such that the builder will be default calculate state root on payload resolution, which will also nicely help our use case as we intend to use op-rbuilder as a dependency library crate to integrate the flashblock payload builder directly into our node.

Thanks again and appreciate the prompt review for this PR :)

sieniven · 2025-11-27T14:32:14Z

@SozinM another consideration to why this PR shouldnt have no impact with rollup boost is because engine_getPayload on rollup boost uses the underlying accumulated flashblocks' cache.

Since the payload resolution on the builder is called async, together with the expensive state root calculation of the unsafe payload with the default EL on FCU, this fix on cache misses to locally built blocks should have little impact on the intended design of rollup boost?

SozinM · 2025-11-27T14:40:27Z

@sieniven Can you please show newPayload graph with and without this fix

sieniven · 2025-12-01T04:09:52Z

@sieniven Can you please show newPayload graph with and without this fix

@SozinM We are onto this. Will provide you with the API response time from the CL soon.

SozinM · 2025-12-01T08:27:22Z

@sieniven reth already provides these metrics, you could set up their op-reth dashboard

sieniven · 2025-12-01T08:41:22Z

@sieniven reth already provides these metrics, you could set up their op-reth dashboard

We used a different approach where we used the metrics on op-node instead. Since that will be a better metric in terms of understanding the impact of this PR? Comparing engine_newPayload without state root calculations on the builder will definitely be significantly slower since we now add state root calculations at the end of the payload resolution.

SozinM · 2025-12-01T08:47:13Z

@sieniven engine_newPayload metric would suffice.
If you solution (in regard to caching) works, than we would see that new_payload latency without fix was high (prob 20+ms), and after fix became low (1-2 ms)

sieniven · 2025-12-01T08:54:17Z

@SozinM Sure, we will generate both on the EL response times and on the CL response times for you, during our stress test environment (which is an ERC20 stress test).

On a separate note, I think there are a few more issues which I have opened on rollup-boost which explains why cache misses will still happen in your current flashblock design:

Rollup-boost occasionally fails to accept new flashblock payload due to failure to handle ordering rollup-boost#449
Rollup-boost occasionally fails to accept flashblock payloads rollup-boost#448

Because the source of truth of the flashblock architecture is on the rollup-boost accumulated flashblock sequence, the consensus layer will always accept this as the unsafe payload (and subsequently calculate the external state root of that payload there). However, we cannot guarantee the state of the accumulated sequence of rollup boost and the builder to be consistent, thus resulting in cache misses on locally built payloads on engine_newPayload to the builder. We have tested and this PR is confirmed to have fixed cache misses on engine_newPayload if the payload state matches (on rollup-boost and on op-rbuilder during payload resolution). However, if the states are mismatched, payloads are still re-validated on the builder which affects the overall chain throughput.

Also, it was noticed on my end that when stress testing, payload re-validations could take more than 2 seconds on the builder. There might be a mutex issue on the payload, but I have yet to really debug this fully. I cant replicate that anymore with the fixes on the state root calculations, but might be worth adding to your todo to investigate that.

SozinM · 2025-12-01T09:24:26Z

Wdym by

locally built payloads

Is it payload built be fallback EL, or empty payload built by CL?

sieniven · 2025-12-02T06:38:18Z

Wdym by

locally built payloads

Is it payload built be fallback EL, or empty payload built by CL?

@SozinM Sorry for the confusion. What i meant is in this specific case, payload is locally built by the op-rbuilder. However, due to the accumulated payload sequence on rollup-boost state being different from the builder, the eventual accepted unsafe payload returned to the op-node sequencer is different from what was actually built on the op-rbuilder (when payload is resolved on rollup-boost). This results in the subsequent engine_newPayload to still cache miss - which is what I meant by this PR not being able to fully resolve the cache miss issue.

sieniven · 2025-12-04T03:46:54Z

@SozinM We have generated the metrics you were asking for. Note that these were generated under stress test situation with full blocks filled with ERC20 transactions from multiple addresses to simulate high chain traffic situation. The metrics captures a 15minute duration, which sufficiently captures the results. Hope this helps.

Before state root PR fixes:

After state root PR fixes:

Note that there are still occasional cache misses which was the default flashblocks design issue that I was mentioning above in this comment: https://github.com//pull/334#issuecomment-3595341837. Thus this PR still doesnt fully resolve the issue of cache misses (resulting in full validation of locally built payloads), but it significantly reduces this and should help with optimizing throughput of the default flashblocks architecture.

…x-state-root

sieniven · 2025-12-17T06:56:03Z

@SozinM hey, any updates for this PR? seems to be dangling for awhile now, would be great if we could get this merged soon. 🙏

SozinM · 2025-12-17T11:29:36Z

crates/op-rbuilder/src/builders/flashblocks/payload.rs

-    pub payload_tx: mpsc::Sender<OpBuiltPayload>,
+    /// Sender for sending built flashblock payloads to [`PayloadHandler`],
+    /// which broadcasts outgoing flashblock payloads via p2p.
+    pub built_fb_payload_tx: mpsc::Sender<OpBuiltPayload>,


Let's split p2p and main logic in different PRs?

SozinM · 2025-12-17T11:34:46Z

If i understand it correctly this PR would break base usage of op-rbuidler, because it takes 200-300ms to calculate state root on base
We shouldn't block resolve with state root calculation when we have zero state root
We could move it into spawned task to calculate state root, so they will race with canonical block update

sieniven · 2025-12-17T11:52:43Z

@SozinM Hmm may I understand how Base implements flashblocks? If their deployment is using the default flashblock architecture (rollup boost + op-rbuilder), this PR on the contrary should optimize their chain throughput by 1.5x, at least that is on our end.

The optimization should come from 2 parts:

without this PR, cache misses would happen 100% of the time on the builder which results in a blocking state root calculation during sequencing time for re-validation when committing new unsafe payloads
In the default architecture, rollup boost triggers payload resolve asynchronously on engine_getPayload, which the returned response is never handled - https://github.com/flashbots/rollup-boost/blob/bec1d42d97229f24fbaa3a7f82c98aa3005b919a/crates/rollup-boost/src/flashblocks/service.rs#L360

Would it be better if we could perhaps jump on a call to further discuss this? Maybe it might be more efficient

SozinM · 2025-12-17T11:54:17Z

I think we could simplify this PR, if we would just spawn state root calculation task when if we have zero state root flag set and this is last falshblock or resolve event

SozinM · 2025-12-17T11:56:53Z

@sieniven problem with current aproach that it will block resolve until state root is calculate (which could take a lot). In this case if flashblock stream is down (and boost would rely on getPayload) it would cause block misses (because job won't resolve in time)
Can you test it? Just run builder with boost and kill websocket connection/misconfigure it.
In rollup-boost you should see the messages that Error getting fb best payload, falling back on client

sieniven · 2025-12-17T12:47:48Z

I think we could simplify this PR, if we would just spawn state root calculation task when if we have zero state root flag set and this is last falshblock or resolve event

i think state root calculation is triggered when:

target flashblock building completes, triggers payload resolve
engine_newpayload request occurs, resulting in a payload resolve

We could possibly implement the resolve to be on a separate thread. however, i dont believe this is going to be optimal on relatively empty blocks, since new payload request to commit the unsafe payload from the CL may come in before state root calc finishes. and in fact potentially worsen the chain throughput to what we are seeing now - which is sequencing time is taken 100% of the time to re-validate locally built payloads (re-exec + state root calculation)

sieniven · 2025-12-17T12:49:20Z

@sieniven problem with current aproach that it will block resolve until state root is calculate (which could take a lot). In this case if flashblock stream is down (and boost would rely on getPayload) it would cause block misses (because job won't resolve in time)

Can you test it? Just run builder with boost and kill websocket connection/misconfigure it.

In rollup-boost you should see the messages that Error getting fb best payload, falling back on client

am i understanding somethjng wrongly? on ws connection failures, rollup boost by default will use the payload built by the default EL

sieniven · 2025-12-17T12:51:13Z

@SozinM i would really appreciate if we could speed up this back and forth if we could pair reviewing this on a call perhaps? we have already tested this in-depth in production environment and it has optimized our chain throughput by 2x

sieniven · 2025-12-17T12:53:33Z

@sieniven problem with current aproach that it will block resolve until state root is calculate (which could take a lot). In this case if flashblock stream is down (and boost would rely on getPayload) it would cause block misses (because job won't resolve in time)

Can you test it? Just run builder with boost and kill websocket connection/misconfigure it.

In rollup-boost you should see the messages that Error getting fb best payload, falling back on client

also did you read my reply here? #334 (comment). the codeblock from rollup boost in this comment sends an async engine_newPayload in a separate thread, that doesnt block the rollup boost's engine_newPayload handler response time itself

SozinM · 2025-12-17T12:53:56Z

am i understanding somethjng wrongly? on ws connection failures, rollup boost by default will use the payload built by the default EL

No, it will try to get payload from the builder via engine_getPayload

SozinM · 2025-12-17T12:55:00Z

Sure, we could set up a call

sieniven · 2025-12-17T12:55:08Z

cc/ @akundaz @avalonche

sieniven · 2025-12-17T12:56:23Z

am i understanding somethjng wrongly? on ws connection failures, rollup boost by default will use the payload built by the default EL

No, it will try to get payload from the builder via engine_getPayload

@SozinM am I missing something on your version of rollup boost? rollup-boost handles engine_getPayload through its stateful flashblocks service - and this service inherently holds the source of truth to the accumulated built flashblocks' sequence (via ws as you mentioned), so that cached sequence is returned on engine_getPayload request.

SozinM · 2025-12-17T12:57:15Z

If websocket stream is down rollup-boost will query both builder with engine_getPayload

sieniven · 2025-12-17T12:59:41Z

Sure, we could set up a call

@SozinM i will reach out to you through your gmail in abit, and lets set this up. 👍

sieniven · 2025-12-17T15:52:09Z

If websocket stream is down rollup-boost will query both builder with engine_getPayload

@SozinM I am just looking at this and i can see how this is a potential blocking vector. If thats the case I do believe we should work to make the state root calculation async.

Ideally, I would hope that we could perhaps make this configurable. What do you think if we hop on the current flashblocks.disable-state-root to instead be our configuration flag to determine whether on payload resolve, we block to calculate state root vs async calculate state root?

sieniven requested review from 0x416e746f6e, SozinM, avalonche and karim-agha as code owners November 27, 2025 04:49

Add state root calculation on payload resolve

ba5ada0

sieniven force-pushed the niven/fix-state-root branch from 5d86917 to ba5ada0 Compare November 27, 2025 05:00

SozinM requested changes Nov 27, 2025

View reviewed changes

sieniven added 5 commits November 27, 2025 13:51

Fix

f7a3933

fix: update executed block to upstream version

b670c23

Revert to previous logic on building fallback block

0ea5e3f

Remove merging transitions in statedb

c3e65b2

Fix triggering resolving payload

4e58bdb

sieniven marked this pull request as draft November 27, 2025 10:45

sieniven mentioned this pull request Nov 27, 2025

Add state root calculation on payload resolve okx/op-rbuilder#22

Merged

Fixes

18111e9

sieniven marked this pull request as ready for review November 27, 2025 13:47

sieniven added 5 commits November 27, 2025 23:13

Fix

51b161f

Fix

042b2f9

Fix state mismatch

edbc113

Fix

5f82b81

Fix

2d53279

sieniven changed the title ~~Fix Add state root calculation on payload resolve~~ fix: Add state root calculation on payload resolve Dec 4, 2025

sieniven added 3 commits December 8, 2025 11:18

Merge branch 'main' of github.com:flashbots/op-rbuilder into niven/fi…

668bb49

…x-state-root

Fix resolving payload on no tx pool flag (#45)

73f16ee

Resolve on error (#46)

148c8f5

SozinM reviewed Dec 17, 2025

View reviewed changes

fix: Add state root calculation on payload resolve #334

Are you sure you want to change the base?

fix: Add state root calculation on payload resolve #334

Conversation

sieniven commented Nov 27, 2025

📝 Summary

✅ I have completed the following steps:

Uh oh!

Uh oh!

SozinM Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

sieniven Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

sieniven Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

SozinM Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

SozinM Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

sieniven Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SozinM commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SozinM commented Nov 27, 2025

Uh oh!

sieniven commented Nov 27, 2025

Uh oh!

sieniven commented Nov 27, 2025

Uh oh!

sieniven commented Nov 27, 2025

Uh oh!

SozinM commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sieniven commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SozinM commented Dec 1, 2025

Uh oh!

sieniven commented Dec 1, 2025

Uh oh!

SozinM commented Dec 1, 2025

Uh oh!

sieniven commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SozinM commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sieniven commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sieniven commented Dec 4, 2025

Uh oh!

sieniven commented Dec 17, 2025

Uh oh!

SozinM Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SozinM commented Dec 17, 2025

Uh oh!

sieniven commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SozinM commented Dec 17, 2025

Uh oh!

SozinM commented Dec 17, 2025

Uh oh!

SozinM commented Nov 27, 2025 •

edited

Loading

SozinM commented Nov 27, 2025 •

edited

Loading

sieniven commented Dec 1, 2025 •

edited

Loading

sieniven commented Dec 1, 2025 •

edited

Loading

SozinM commented Dec 1, 2025 •

edited

Loading

sieniven commented Dec 2, 2025 •

edited

Loading

SozinM Dec 17, 2025 •

edited

Loading

sieniven commented Dec 17, 2025 •

edited

Loading

sieniven commented Dec 17, 2025 •

edited

Loading

sieniven commented Dec 17, 2025 •

edited

Loading

SozinM commented Dec 17, 2025 •

edited

Loading

sieniven commented Dec 17, 2025 •

edited

Loading