-
Notifications
You must be signed in to change notification settings - Fork 1
Reduce dashboard update cycle cost #690
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: multi-session-status
Are you sure you want to change the base?
Conversation
relabel() clones the entire Element, triggering deep copies of all Param metadata (kdims, vdims, etc.). Passing label= directly to the HoloViews constructor avoids this overhead entirely. Benchmarks show ~40-47% speedup for LinePlotter and ImagePlotter compute(), and ~23% for Overlay1DPlotter (5 curves). Prompt: Profiling indicates that calling `relabel` in holoviews elements is a substantial fraction of the cost in PlotterCompute; help me understand why, and whether this can be avoided, either by setting the label directly, or potentially reusing objects. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The epoch and timezone offset scalars are identical across calls, so compute them once and reuse. This also collapses two array-wide additions into one per extract() call. Subsequent calls take a fast path that skips dtype, unit, and coord-existence checks since they are implied by the cached _time_origin. Prompt: Benchmarks indicate substantial cost is spent in extractors.py, mostly in _ensure_datetime_coord, which I think could easily be avoided since most of it is "static" and the same for every call? Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The data from TemporalBuffer is in chronological order, so start_time[0] and end_time[-1] give the same result as min/max without scanning the arrays. Prompt: The min and max from _ensure_datetime_coord also takes substantial time. Could we just use elements 0 and -1? Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Only call `session_layer.update_pipe()` for layers on the currently active grid tab. Since `dynamic=True` on Tabs means hidden tabs have no materialized Bokeh models, sending data to them is wasted work (88% of poll cycle time per profiling). Skipped layers keep their dirty flag set; on tab switch the next poll cycle sends the latest cached state. Also add `peek_grid()` to PlotOrchestrator, which returns the internal grid without `copy.deepcopy()`. The poll method only reads grid config, so the defensive copy was unnecessary overhead. Prompt: Implement the following plan: Skip pipe.send for hidden tabs
The modal overlay obscures all plots, making pipe updates wasted rendering work. Reuses the existing hidden-tab mechanism by forcing active_grid_id to None when the modal is open. Dirty flags are preserved so the first poll after the modal closes pushes the latest cached state. Prompt: The latest commit skips plot update in hidden tabs. Could we do something similar for when a plot config modal dialog is opened (either to add a new plot or to reconfigure an existing one)?
Wrap the periodic update body in doc.models.freeze() alongside the existing pn.io.hold(). This collapses N redundant recompute() calls into a single one at the end of the cycle. Each pipe.send() and layout mutation (e.g. updating widget lists) triggers HoloViews' hold_render decorator, which wraps the plot update in doc.models.freeze(). When multiple such operations run in sequence, each one's unfreeze reaches zero and triggers recompute() — a full BFS traversal of every Bokeh model in the document via collect_models(), costing O(M) where M is the total model count. With ~1500 models in a typical dashboard session, each recompute takes ~90 ms. For 4 active cells per update cycle this means ~350 ms (44% of the cycle) spent in _pop_freeze → recompute → collect_models, with the first N-1 results immediately invalidated by the next operation. Since Bokeh's freeze is reentrant (counter-based), adding an outer freeze() keeps the counter above zero so inner freeze/unfreeze cycles are no-ops. Only the outermost exit triggers recompute, once. Expected saving: (N-1)/N of the _pop_freeze cost (~75% for N=4). Also adds an early return with a warning log if pn.state.curdoc is None, which is a defensive guard for the doc.models.freeze() call. See docs/developer/plans/bokeh-freeze-batching.md for the full analysis. Prompt: Read bokeh-freeze-batching.md plan and implement the fix. Followed by investigation of why pn.io.hold doesn't use freeze internally, prep of a Discourse post for the Panel maintainers, and review of the staged implementation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
f8963c4 to
c3af8bc
Compare
The stale session cleanup path (heartbeat timeout) called updater.cleanup() but never stopped the periodic callback, since it had no reference to it. The callback was only stopped via pn.state.on_session_destroyed(), which doesn't fire on browser crashes or network disconnects. The orphaned callback kept firing after the session document was gone, and doc.models.freeze() (added in c3af8bc) triggered Panel's busy counter watchers which accessed state.curdoc.session_context on a None curdoc, producing an AttributeError. Fix: store the periodic callback on SessionUpdater so cleanup() can stop it. Both cleanup paths (normal session destroy and stale cleanup) already call updater.cleanup(). Prompt: "Sometimes we get this exception in the logs. Was it introduced in this branch or already in the base (multi-session): ERROR:bokeh.util.tornado:Error thrown from periodic callback: AttributeError: 'NoneType' object has no attribute 'session_context'" Follow-up: "yes, fix it"
… return The early return when pn.state.curdoc is None prevented custom handlers and notifications from running. Replace with a _batched_update context manager that conditionally applies the freeze optimization only when a document exists. Prompt: "Should we avoid the early return to fix the test? And just not call doc.models.freeze if doc is None?"
|
|
||
| def plot(self, data: sc.DataArray, data_key: ResultKey, **kwargs) -> hv.Rectangles: | ||
| def plot( | ||
| self, data: sc.DataArray, data_key: ResultKey, *, label: str = '', **kwargs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
label is missing in the docstring. (And probably elsewhere) Here in particular, can you add and explanation why the argument exists but is unused?
| @@ -0,0 +1,95 @@ | |||
| # SPDX-License-Identifier: BSD-3-Clause | |||
| # Copyright (c) 2025 Scipp contributors (https://github.com/scipp) | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # Copyright (c) 2025 Scipp contributors (https://github.com/scipp) | |
| # Copyright (c) 2026 Scipp contributors (https://github.com/scipp) |
| # Dirty flags are preserved; the first poll after the modal closes will | ||
| # push the latest cached state. | ||
| modal_is_open = self._current_modal is not None | ||
| active_grid_idx = self._tabs.active - self._static_tabs_count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks fishy. Why is the index this difference of counts? (I am assuming self._tabls.active is the number of active tabs.) Also the if 0 <= active_grid_idx < len(grid_keys) check below makes me think that this difference doesn't always correspond to an index.
| grid_keys[active_grid_idx] | ||
| if 0 <= active_grid_idx < len(grid_keys) and not modal_is_open | ||
| else None | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move all of the above into a separate function? It is logically separate from the code below.
| if dim not in data.coords: | ||
| return data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this check be swapped with the other if above? That would more closely correspond to the old code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Swapped now, but _time_origin would never have been set so the check seem redundant, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hard to say from this class alone because this depends on inputs of __init__ and extract.
| # the document via _pop_freeze → recompute → collect_models, at O(M) | ||
| # cost. The outer freeze keeps the counter above zero so that inner | ||
| # freeze/unfreeze cycles (e.g. HoloViews hold_render) are no-ops, | ||
| # collapsing N recomputes into 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment references the contents of _batched_update and so should be in that function instead of here.
- Add `label` param documentation to ROI readback plotter docstrings, explaining why it exists but is unused. - Fix copyright year (2025 → 2026) in plotter_compute_benchmark.py. - Extract active grid ID determination into `_get_active_grid_id()` method with clear docstring explaining the tab-index-to-grid-index mapping. - Swap check order in `FullHistoryExtractor._to_local_datetime` so `dim not in data.coords` is checked before using the cached time origin, matching the original guard order. - Move freeze/hold explanation comment from `_update` into the `_batched_update` docstring where it belongs. Prompt: "Please address review comments on the PR. Discuss if unclear how." Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@jl-wynen Thanks for the useful comments, addressed on latest commit! |
Background-thread status callbacks from JobService and ServiceRegistry caused ~2x recompute calls vs baseline. Each callback was serialized to MainThread via next_tick_callback with its own freeze/unfreeze cycle, producing separate recompute calls outside the batched periodic update. Convert all three affected widgets (JobStatusListWidget, WorkflowStatusWidget, BackendStatusWidget) to the polling model already used by SessionStatusWidget: register a refresh method as a custom handler with SessionUpdater, so all status updates run inside the periodic _batched_update() context (pn.io.hold + doc.models.freeze). No version counters are needed — the widgets already have internal diffing or set-on-every-cycle property updates that Panel handles as no-ops when unchanged. Also removes the subscriber/notification machinery from JobService and ServiceRegistry, fixing the callback cleanup gap (no unregister mechanism = leaked callbacks from dead sessions), and removes the temporary LIVEDATA_DISABLE_STATUS_CALLBACKS diagnostic env var. Prompt: Read status-callback-performance-investigation.md - we need to continue this investigation and develop a complete and deep understanding of everything involved. Follow-up: Do we actually need the complication of version counters in all cases? Updates in JobService are low frequency (mainly heartbeat timestamp), if we make the polling low frequency we can just always set the latest state as the widget value.
WorkflowStatusWidget previously subscribed to JobOrchestrator lifecycle events (on_staged_changed, on_workflow_committed, on_workflow_stopped) via WidgetLifecycleCallbacks. These callbacks fired synchronously from the triggering session and rebuilt ALL subscribed widgets across ALL sessions outside any batched context, causing unbatched recomputes and cross-session Bokeh model mutations. Add a version counter to WorkflowState that increments on every state mutation. The widget's refresh() method (already called inside SessionUpdater._batched_update()) checks the version and triggers a full rebuild only when it changes. This ensures all widget updates run in the correct session context with batched recomputation. Removed: WidgetLifecycleCallbacks, WidgetLifecycleSubscriptionId, subscribe_to_widget_lifecycle(), unsubscribe_from_widget_lifecycle(), _widget_subscriptions dict, cleanup() on WorkflowStatusWidget. Added: WorkflowState.version field, get_workflow_state_version() getter, version tracking in WorkflowStatusWidget.refresh(). Prompt: Implement the following plan: Convert WorkflowStatusWidget Lifecycle Callbacks to Version-Based Polling Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
This branch targets the main performance bottlenecks in the dashboard's periodic update cycle, identified through profiling.
Data preparation optimizations reduce the per-update cost of converting and extracting time-series data:
cb6e9bffAvoidhv.Element.relabel()—relabel()clones the entire Element (deep-copying all Param metadata). Passinglabel=at construction avoids this, yielding ~40-47% speedup forLinePlotter/ImagePlotter.compute()and ~23% forOverlay1DPlotter.99697fafCache time origin inFullHistoryExtractor— the epoch + timezone offset are identical across calls, so compute once and fast-path subsequent calls by skipping dtype/unit/coord checks.b0a2f831Use indexing instead ofmin/maxfor time bounds — data fromTemporalBufferis chronologically ordered, so[0]/[-1]replaces full-array scans.Rendering optimizations eliminate wasted Bokeh work when the user can't see the result:
b1777172Skippipe.sendfor hidden tabs — withdynamic=TrueonTabs, hidden tabs have no Bokeh models; sending data to them was 88% of poll cycle time per profiling. Also addspeek_grid()to avoid an unnecessarydeepcopy.bb184963Skippipe.sendwhile config modal is open — the modal obscures all plots, reuses the hidden-tab mechanism.f8963c42Batch Bokeh model-graph recomputation — wraps the entire update body indoc.models.freeze(). Eachpipe.send()triggersrecompute()(BFS over all ~1500 models, ~90 ms each). Batching collapses N recomputes into 1, saving ~(N-1)/N of the cost (~75% for 4 active cells).Further optimization
Two things need to be explored more: