Skip to content

Conversation

@sieniven
Copy link

📝 Summary

  • Context of the full issue: Cache miss on engine_newPayload to builder causes slow payload validation #333
  • When flashblocks stateroots calculation flag is switched to false, because state roots are never calculated in payloads, on payload resolving, engine api request to insert locally executed blocks will insert blocks without state roots
  • This eventually leads to cache misses when engine_newPayload is called on the op-rbuilder from rollup-boost, which leads to the full validation of this payload (even though they were locally built by the builder), including the re-execution of the payload transactions and subsequent state root calculation
  • This PR adds the additional stateroot calculation triggered on payload resolve, ensuring that we do not take up sequencing time re-executing the payload
  • Note that state root calculation regardless will block the sequencing time - however, rollup boost will obtain the cached incrementally built builder payload anyways and trigger the payload resolve asynchronously. This should be optimal when integrated with flashblocks

✅ I have completed the following steps:

  • [✅ ] Run make lint
  • [✅ ] Run make test
  • Added tests (if applicable)

@sieniven sieniven force-pushed the niven/fix-state-root branch from 5d86917 to ba5ada0 Compare November 27, 2025 05:00
ctx = ctx.with_cancel(fb_cancel).with_extra_ctx(next_flashblocks_ctx);
},
_ = block_cancel.cancelled() => {
self.resolve_best_payload(&mut state, &ctx, &mut info, &best_payload).await;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not guaranteed that rollup-boost will cancel building (send engine_getPayload)
I think correct way would be to calculate state root asynchronously after we sent last flashblock

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why it is placed here is because on the event where rollup-boost does not send the payload resolve, on a subsequent FCU, the payload will be resolved regardless.

Also, I placed it on resolve because we are keen in exploring using the flashblock payload builder directly on a custom reth node without the need to use rollup-boost. Maybe this implementation is better in terms of de-coupling with rollup boost service?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it is hard to deterministically say for sure which is the last flashblock. As observed during our deployment, it seems like on the builder, on most cases, the payload building early cancels which results in the payload not ever reaching the target flashblock. This seems to me to be the most deterministic way of ensuring locally built payloads contain a state root.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, you need to use --flashblock.leeway-time
And for external state root it should be considerable as usually it used in cases when it's slow to calculate state root, so you need some extra time to do this

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for decoupling
I think some assumptions behind state root calculation weren't designed for this case, so it's better to give it a good thought on how to change it

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/flashbots/op-rbuilder/pull/334/files#diff-20125ce86c70cfd520b09bc61e2032873f80249afe0027ee76dfbfcb503e7fb7R535-R542 - This issue should be resolved in the attached code block. We start the state root calculation immediately once the last target flashblock is built.

@SozinM
Copy link
Collaborator

SozinM commented Nov 27, 2025

Hi!
Have you tested this?
Because it seems to me it shouldn't work, as you still don't save executed block

@SozinM
Copy link
Collaborator

SozinM commented Nov 27, 2025

@sieniven could you please attach newPayload latency before and after fix?
I think this would be a good starting point of looking into the deeper

@sieniven sieniven marked this pull request as draft November 27, 2025 10:45
@sieniven
Copy link
Author

@SozinM Hey, thanks and appreciate your time in helping review this :) Converting this PR to draft first because there are still some issues when testing - will re-open it in abit once everything is working on our end.

@sieniven sieniven marked this pull request as ready for review November 27, 2025 13:47
@sieniven
Copy link
Author

@SozinM I have re-opened the PR. From the above discussion thread, I understand there is the leeway time configuration which allows us to early package flashblocks. However on the assumption that we are trying to maximize throughput of the chain, setting too big a leeway time may tradeoff chain max throughput (gas executed per block).

I am trying to think of the most optimal way of doing this - it still seems to me that unless the intent of the builder for flashbots is to always couple it with the rollup boost proxy service, perhaps we could explore the default behavior of the builder to calculate state root on payload resolve.

Giving some context here - we found the issue with the builder sync-ing with tip to have severe effects on the chain's maximum throughput (gas per second) during our stress testing on the flashblocks architecture. It was found that during our stress tests which benchmarks sending ERC20 transactions, the tps of the chain drops significantly and a huge part of the bottleneck was coming from cache misses due to the mismatch in block hash. This PR has been tested on our end to work and resolves the cache miss issue, and the current design is such that the builder will be default calculate state root on payload resolution, which will also nicely help our use case as we intend to use op-rbuilder as a dependency library crate to integrate the flashblock payload builder directly into our node.

Thanks again and appreciate the prompt review for this PR :)

@sieniven
Copy link
Author

@SozinM another consideration to why this PR shouldnt have no impact with rollup boost is because engine_getPayload on rollup boost uses the underlying accumulated flashblocks' cache.

Since the payload resolution on the builder is called async, together with the expensive state root calculation of the unsafe payload with the default EL on FCU, this fix on cache misses to locally built blocks should have little impact on the intended design of rollup boost?

@SozinM
Copy link
Collaborator

SozinM commented Nov 27, 2025

@sieniven Can you please show newPayload graph with and without this fix

@sieniven
Copy link
Author

sieniven commented Dec 1, 2025

@sieniven Can you please show newPayload graph with and without this fix

@SozinM We are onto this. Will provide you with the API response time from the CL soon.

@SozinM
Copy link
Collaborator

SozinM commented Dec 1, 2025

@sieniven reth already provides these metrics, you could set up their op-reth dashboard

@sieniven
Copy link
Author

sieniven commented Dec 1, 2025

@sieniven reth already provides these metrics, you could set up their op-reth dashboard

We used a different approach where we used the metrics on op-node instead. Since that will be a better metric in terms of understanding the impact of this PR? Comparing engine_newPayload without state root calculations on the builder will definitely be significantly slower since we now add state root calculations at the end of the payload resolution.

@SozinM
Copy link
Collaborator

SozinM commented Dec 1, 2025

@sieniven engine_newPayload metric would suffice.
If you solution (in regard to caching) works, than we would see that new_payload latency without fix was high (prob 20+ms), and after fix became low (1-2 ms)

@sieniven
Copy link
Author

sieniven commented Dec 1, 2025

@SozinM Sure, we will generate both on the EL response times and on the CL response times for you, during our stress test environment (which is an ERC20 stress test).

On a separate note, I think there are a few more issues which I have opened on rollup-boost which explains why cache misses will still happen in your current flashblock design:

  1. Rollup-boost occasionally fails to accept new flashblock payload due to failure to handle ordering rollup-boost#449
  2. Rollup-boost occasionally fails to accept flashblock payloads rollup-boost#448

Because the source of truth of the flashblock architecture is on the rollup-boost accumulated flashblock sequence, the consensus layer will always accept this as the unsafe payload (and subsequently calculate the external state root of that payload there). However, we cannot guarantee the state of the accumulated sequence of rollup boost and the builder to be consistent, thus resulting in cache misses on locally built payloads on engine_newPayload to the builder. We have tested and this PR is confirmed to have fixed cache misses on engine_newPayload if the payload state matches (on rollup-boost and on op-rbuilder during payload resolution). However, if the states are mismatched, payloads are still re-validated on the builder which affects the overall chain throughput.

Also, it was noticed on my end that when stress testing, payload re-validations could take more than 2 seconds on the builder. There might be a mutex issue on the payload, but I have yet to really debug this fully. I cant replicate that anymore with the fixes on the state root calculations, but might be worth adding to your todo to investigate that.

@SozinM
Copy link
Collaborator

SozinM commented Dec 1, 2025

Wdym by

locally built payloads

Is it payload built be fallback EL, or empty payload built by CL?

@sieniven
Copy link
Author

sieniven commented Dec 2, 2025

Wdym by

locally built payloads

Is it payload built be fallback EL, or empty payload built by CL?

@SozinM Sorry for the confusion. What i meant is in this specific case, payload is locally built by the op-rbuilder. However, due to the accumulated payload sequence on rollup-boost state being different from the builder, the eventual accepted unsafe payload returned to the op-node sequencer is different from what was actually built on the op-rbuilder (when payload is resolved on rollup-boost). This results in the subsequent engine_newPayload to still cache miss - which is what I meant by this PR not being able to fully resolve the cache miss issue.

@sieniven
Copy link
Author

sieniven commented Dec 4, 2025

@SozinM We have generated the metrics you were asking for. Note that these were generated under stress test situation with full blocks filled with ERC20 transactions from multiple addresses to simulate high chain traffic situation. The metrics captures a 15minute duration, which sufficiently captures the results. Hope this helps.

  • Before state root PR fixes:
image
  • After state root PR fixes:
image Note that there are still occasional cache misses which was the default flashblocks design issue that I was mentioning above in this comment: https://github.com//pull/334#issuecomment-3595341837. Thus this PR still doesnt fully resolve the issue of cache misses (resulting in full validation of locally built payloads), but it significantly reduces this and should help with optimizing throughput of the default flashblocks architecture.

@sieniven sieniven changed the title Fix Add state root calculation on payload resolve fix: Add state root calculation on payload resolve Dec 4, 2025
@sieniven
Copy link
Author

@SozinM hey, any updates for this PR? seems to be dangling for awhile now, would be great if we could get this merged soon. 🙏

pub payload_tx: mpsc::Sender<OpBuiltPayload>,
/// Sender for sending built flashblock payloads to [`PayloadHandler`],
/// which broadcasts outgoing flashblock payloads via p2p.
pub built_fb_payload_tx: mpsc::Sender<OpBuiltPayload>,
Copy link
Collaborator

@SozinM SozinM Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's split p2p and main logic in different PRs?

@SozinM
Copy link
Collaborator

SozinM commented Dec 17, 2025

If i understand it correctly this PR would break base usage of op-rbuidler, because it takes 200-300ms to calculate state root on base
We shouldn't block resolve with state root calculation when we have zero state root
We could move it into spawned task to calculate state root, so they will race with canonical block update

@sieniven
Copy link
Author

sieniven commented Dec 17, 2025

@SozinM Hmm may I understand how Base implements flashblocks? If their deployment is using the default flashblock architecture (rollup boost + op-rbuilder), this PR on the contrary should optimize their chain throughput by 1.5x, at least that is on our end.

The optimization should come from 2 parts:

  1. without this PR, cache misses would happen 100% of the time on the builder which results in a blocking state root calculation during sequencing time for re-validation when committing new unsafe payloads
  2. In the default architecture, rollup boost triggers payload resolve asynchronously on engine_getPayload, which the returned response is never handled - https://github.com/flashbots/rollup-boost/blob/bec1d42d97229f24fbaa3a7f82c98aa3005b919a/crates/rollup-boost/src/flashblocks/service.rs#L360

Would it be better if we could perhaps jump on a call to further discuss this? Maybe it might be more efficient

@SozinM
Copy link
Collaborator

SozinM commented Dec 17, 2025

I think we could simplify this PR, if we would just spawn state root calculation task when if we have zero state root flag set and this is last falshblock or resolve event

@SozinM
Copy link
Collaborator

SozinM commented Dec 17, 2025

@sieniven problem with current aproach that it will block resolve until state root is calculate (which could take a lot). In this case if flashblock stream is down (and boost would rely on getPayload) it would cause block misses (because job won't resolve in time)
Can you test it? Just run builder with boost and kill websocket connection/misconfigure it.
In rollup-boost you should see the messages that Error getting fb best payload, falling back on client

@sieniven
Copy link
Author

sieniven commented Dec 17, 2025

I think we could simplify this PR, if we would just spawn state root calculation task when if we have zero state root flag set and this is last falshblock or resolve event

i think state root calculation is triggered when:

  1. target flashblock building completes, triggers payload resolve
  2. engine_newpayload request occurs, resulting in a payload resolve

We could possibly implement the resolve to be on a separate thread. however, i dont believe this is going to be optimal on relatively empty blocks, since new payload request to commit the unsafe payload from the CL may come in before state root calc finishes. and in fact potentially worsen the chain throughput to what we are seeing now - which is sequencing time is taken 100% of the time to re-validate locally built payloads (re-exec + state root calculation)

@sieniven
Copy link
Author

@sieniven problem with current aproach that it will block resolve until state root is calculate (which could take a lot). In this case if flashblock stream is down (and boost would rely on getPayload) it would cause block misses (because job won't resolve in time)

Can you test it? Just run builder with boost and kill websocket connection/misconfigure it.

In rollup-boost you should see the messages that Error getting fb best payload, falling back on client

am i understanding somethjng wrongly? on ws connection failures, rollup boost by default will use the payload built by the default EL

@sieniven
Copy link
Author

@SozinM i would really appreciate if we could speed up this back and forth if we could pair reviewing this on a call perhaps? we have already tested this in-depth in production environment and it has optimized our chain throughput by 2x

@sieniven
Copy link
Author

sieniven commented Dec 17, 2025

@sieniven problem with current aproach that it will block resolve until state root is calculate (which could take a lot). In this case if flashblock stream is down (and boost would rely on getPayload) it would cause block misses (because job won't resolve in time)

Can you test it? Just run builder with boost and kill websocket connection/misconfigure it.

In rollup-boost you should see the messages that Error getting fb best payload, falling back on client

also did you read my reply here? #334 (comment). the codeblock from rollup boost in this comment sends an async engine_newPayload in a separate thread, that doesnt block the rollup boost's engine_newPayload handler response time itself

@SozinM
Copy link
Collaborator

SozinM commented Dec 17, 2025

am i understanding somethjng wrongly? on ws connection failures, rollup boost by default will use the payload built by the default EL

No, it will try to get payload from the builder via engine_getPayload

@SozinM
Copy link
Collaborator

SozinM commented Dec 17, 2025

Sure, we could set up a call

@sieniven
Copy link
Author

cc/ @akundaz @avalonche

@sieniven
Copy link
Author

sieniven commented Dec 17, 2025

am i understanding somethjng wrongly? on ws connection failures, rollup boost by default will use the payload built by the default EL

No, it will try to get payload from the builder via engine_getPayload

@SozinM am I missing something on your version of rollup boost? rollup-boost handles engine_getPayload through its stateful flashblocks service - and this service inherently holds the source of truth to the accumulated built flashblocks' sequence (via ws as you mentioned), so that cached sequence is returned on engine_getPayload request.

@SozinM
Copy link
Collaborator

SozinM commented Dec 17, 2025

If websocket stream is down rollup-boost will query both builder with engine_getPayload

@sieniven
Copy link
Author

Sure, we could set up a call

@SozinM i will reach out to you through your gmail in abit, and lets set this up. 👍

@sieniven
Copy link
Author

If websocket stream is down rollup-boost will query both builder with engine_getPayload

@SozinM I am just looking at this and i can see how this is a potential blocking vector. If thats the case I do believe we should work to make the state root calculation async.

Ideally, I would hope that we could perhaps make this configurable. What do you think if we hop on the current flashblocks.disable-state-root to instead be our configuration flag to determine whether on payload resolve, we block to calculate state root vs async calculate state root?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants