Skip to content

Conversation

@SimonHeybrock
Copy link
Member

@SimonHeybrock SimonHeybrock commented Feb 9, 2026

Summary

This branch targets the main performance bottlenecks in the dashboard's periodic update cycle, identified through profiling.

Data preparation optimizations reduce the per-update cost of converting and extracting time-series data:

  • cb6e9bff Avoid hv.Element.relabel()relabel() clones the entire Element (deep-copying all Param metadata). Passing label= at construction avoids this, yielding ~40-47% speedup for LinePlotter/ImagePlotter.compute() and ~23% for Overlay1DPlotter.
  • 99697faf Cache time origin in FullHistoryExtractor — the epoch + timezone offset are identical across calls, so compute once and fast-path subsequent calls by skipping dtype/unit/coord checks.
  • b0a2f831 Use indexing instead of min/max for time bounds — data from TemporalBuffer is chronologically ordered, so [0]/[-1] replaces full-array scans.

Rendering optimizations eliminate wasted Bokeh work when the user can't see the result:

  • b1777172 Skip pipe.send for hidden tabs — with dynamic=True on Tabs, hidden tabs have no Bokeh models; sending data to them was 88% of poll cycle time per profiling. Also adds peek_grid() to avoid an unnecessary deepcopy.
  • bb184963 Skip pipe.send while config modal is open — the modal obscures all plots, reuses the hidden-tab mechanism.
  • f8963c42 Batch Bokeh model-graph recomputation — wraps the entire update body in doc.models.freeze(). Each pipe.send() triggers recompute() (BFS over all ~1500 models, ~90 ms each). Batching collapses N recomputes into 1, saving ~(N-1)/N of the cost (~75% for 4 active cells).

Further optimization

Two things need to be explored more:

  • Tune the update cycle and sleep time of the backgroudn thread and the session periodic callbacks.
  • Try free-threaded builds, provided all dependencies support it.

SimonHeybrock and others added 6 commits February 9, 2026 11:20
relabel() clones the entire Element, triggering deep copies of all
Param metadata (kdims, vdims, etc.). Passing label= directly to
the HoloViews constructor avoids this overhead entirely.

Benchmarks show ~40-47% speedup for LinePlotter and ImagePlotter
compute(), and ~23% for Overlay1DPlotter (5 curves).

Prompt: Profiling indicates that calling `relabel` in holoviews elements
is a substantial fraction of the cost in PlotterCompute; help me
understand why, and whether this can be avoided, either by setting the
label directly, or potentially reusing objects.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The epoch and timezone offset scalars are identical across calls, so
compute them once and reuse. This also collapses two array-wide
additions into one per extract() call.

Subsequent calls take a fast path that skips dtype, unit, and
coord-existence checks since they are implied by the cached
_time_origin.

Prompt: Benchmarks indicate substantial cost is spent in extractors.py,
mostly in _ensure_datetime_coord, which I think could easily be avoided
since most of it is "static" and the same for every call?

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The data from TemporalBuffer is in chronological order, so
start_time[0] and end_time[-1] give the same result as min/max
without scanning the arrays.

Prompt: The min and max from _ensure_datetime_coord also takes
substantial time. Could we just use elements 0 and -1?

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Only call `session_layer.update_pipe()` for layers on the currently
active grid tab. Since `dynamic=True` on Tabs means hidden tabs have
no materialized Bokeh models, sending data to them is wasted work
(88% of poll cycle time per profiling). Skipped layers keep their
dirty flag set; on tab switch the next poll cycle sends the latest
cached state.

Also add `peek_grid()` to PlotOrchestrator, which returns the internal
grid without `copy.deepcopy()`. The poll method only reads grid config,
so the defensive copy was unnecessary overhead.

Prompt: Implement the following plan: Skip pipe.send for hidden tabs
The modal overlay obscures all plots, making pipe updates wasted
rendering work. Reuses the existing hidden-tab mechanism by forcing
active_grid_id to None when the modal is open. Dirty flags are
preserved so the first poll after the modal closes pushes the latest
cached state.

Prompt: The latest commit skips plot update in hidden tabs. Could we do
something similar for when a plot config modal dialog is opened (either
to add a new plot or to reconfigure an existing one)?
Wrap the periodic update body in doc.models.freeze() alongside the
existing pn.io.hold(). This collapses N redundant recompute() calls
into a single one at the end of the cycle.

Each pipe.send() and layout mutation (e.g. updating widget lists)
triggers HoloViews' hold_render decorator, which wraps the plot update
in doc.models.freeze(). When multiple such operations run in sequence,
each one's unfreeze reaches zero and triggers recompute() — a full BFS
traversal of every Bokeh model in the document via collect_models(),
costing O(M) where M is the total model count.

With ~1500 models in a typical dashboard session, each recompute takes
~90 ms. For 4 active cells per update cycle this means ~350 ms (44% of
the cycle) spent in _pop_freeze → recompute → collect_models, with the
first N-1 results immediately invalidated by the next operation.

Since Bokeh's freeze is reentrant (counter-based), adding an outer
freeze() keeps the counter above zero so inner freeze/unfreeze cycles
are no-ops. Only the outermost exit triggers recompute, once. Expected
saving: (N-1)/N of the _pop_freeze cost (~75% for N=4).

Also adds an early return with a warning log if pn.state.curdoc is None,
which is a defensive guard for the doc.models.freeze() call.

See docs/developer/plans/bokeh-freeze-batching.md for the full analysis.

Prompt: Read bokeh-freeze-batching.md plan and implement the fix. Followed
by investigation of why pn.io.hold doesn't use freeze internally, prep
of a Discourse post for the Panel maintainers, and review of the staged
implementation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@SimonHeybrock SimonHeybrock force-pushed the multi-session-optimization branch from f8963c4 to c3af8bc Compare February 9, 2026 11:21
The stale session cleanup path (heartbeat timeout) called
updater.cleanup() but never stopped the periodic callback, since it
had no reference to it. The callback was only stopped via
pn.state.on_session_destroyed(), which doesn't fire on browser
crashes or network disconnects.

The orphaned callback kept firing after the session document was gone,
and doc.models.freeze() (added in c3af8bc) triggered Panel's busy
counter watchers which accessed state.curdoc.session_context on a
None curdoc, producing an AttributeError.

Fix: store the periodic callback on SessionUpdater so cleanup() can
stop it. Both cleanup paths (normal session destroy and stale cleanup)
already call updater.cleanup().

Prompt: "Sometimes we get this exception in the logs. Was it introduced
in this branch or already in the base (multi-session):
ERROR:bokeh.util.tornado:Error thrown from periodic callback:
AttributeError: 'NoneType' object has no attribute 'session_context'"
Follow-up: "yes, fix it"
… return

The early return when pn.state.curdoc is None prevented custom handlers
and notifications from running. Replace with a _batched_update context
manager that conditionally applies the freeze optimization only when a
document exists.

Prompt: "Should we avoid the early return to fix the test? And just not
call doc.models.freeze if doc is None?"

def plot(self, data: sc.DataArray, data_key: ResultKey, **kwargs) -> hv.Rectangles:
def plot(
self, data: sc.DataArray, data_key: ResultKey, *, label: str = '', **kwargs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

label is missing in the docstring. (And probably elsewhere) Here in particular, can you add and explanation why the argument exists but is unused?

@@ -0,0 +1,95 @@
# SPDX-License-Identifier: BSD-3-Clause
# Copyright (c) 2025 Scipp contributors (https://github.com/scipp)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Copyright (c) 2025 Scipp contributors (https://github.com/scipp)
# Copyright (c) 2026 Scipp contributors (https://github.com/scipp)

# Dirty flags are preserved; the first poll after the modal closes will
# push the latest cached state.
modal_is_open = self._current_modal is not None
active_grid_idx = self._tabs.active - self._static_tabs_count
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fishy. Why is the index this difference of counts? (I am assuming self._tabls.active is the number of active tabs.) Also the if 0 <= active_grid_idx < len(grid_keys) check below makes me think that this difference doesn't always correspond to an index.

grid_keys[active_grid_idx]
if 0 <= active_grid_idx < len(grid_keys) and not modal_is_open
else None
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move all of the above into a separate function? It is logically separate from the code below.

Comment on lines +126 to +127
if dim not in data.coords:
return data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this check be swapped with the other if above? That would more closely correspond to the old code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Swapped now, but _time_origin would never have been set so the check seem redundant, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hard to say from this class alone because this depends on inputs of __init__ and extract.

# the document via _pop_freeze → recompute → collect_models, at O(M)
# cost. The outer freeze keeps the counter above zero so that inner
# freeze/unfreeze cycles (e.g. HoloViews hold_render) are no-ops,
# collapsing N recomputes into 1.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment references the contents of _batched_update and so should be in that function instead of here.

- Add `label` param documentation to ROI readback plotter docstrings,
  explaining why it exists but is unused.
- Fix copyright year (2025 → 2026) in plotter_compute_benchmark.py.
- Extract active grid ID determination into `_get_active_grid_id()` method
  with clear docstring explaining the tab-index-to-grid-index mapping.
- Swap check order in `FullHistoryExtractor._to_local_datetime` so
  `dim not in data.coords` is checked before using the cached time origin,
  matching the original guard order.
- Move freeze/hold explanation comment from `_update` into the
  `_batched_update` docstring where it belongs.

Prompt: "Please address review comments on the PR. Discuss if unclear how."

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@SimonHeybrock
Copy link
Member Author

@jl-wynen Thanks for the useful comments, addressed on latest commit!

SimonHeybrock and others added 2 commits February 10, 2026 07:53
Background-thread status callbacks from JobService and ServiceRegistry
caused ~2x recompute calls vs baseline. Each callback was serialized to
MainThread via next_tick_callback with its own freeze/unfreeze cycle,
producing separate recompute calls outside the batched periodic update.

Convert all three affected widgets (JobStatusListWidget,
WorkflowStatusWidget, BackendStatusWidget) to the polling model already
used by SessionStatusWidget: register a refresh method as a custom
handler with SessionUpdater, so all status updates run inside the
periodic _batched_update() context (pn.io.hold + doc.models.freeze).

No version counters are needed — the widgets already have internal
diffing or set-on-every-cycle property updates that Panel handles as
no-ops when unchanged.

Also removes the subscriber/notification machinery from JobService and
ServiceRegistry, fixing the callback cleanup gap (no unregister
mechanism = leaked callbacks from dead sessions), and removes the
temporary LIVEDATA_DISABLE_STATUS_CALLBACKS diagnostic env var.

Prompt: Read status-callback-performance-investigation.md - we need to
continue this investigation and develop a complete and deep
understanding of everything involved.

Follow-up: Do we actually need the complication of version counters in
all cases? Updates in JobService are low frequency (mainly heartbeat
timestamp), if we make the polling low frequency we can just always set
the latest state as the widget value.
WorkflowStatusWidget previously subscribed to JobOrchestrator lifecycle
events (on_staged_changed, on_workflow_committed, on_workflow_stopped)
via WidgetLifecycleCallbacks. These callbacks fired synchronously from
the triggering session and rebuilt ALL subscribed widgets across ALL
sessions outside any batched context, causing unbatched recomputes and
cross-session Bokeh model mutations.

Add a version counter to WorkflowState that increments on every state
mutation. The widget's refresh() method (already called inside
SessionUpdater._batched_update()) checks the version and triggers a
full rebuild only when it changes. This ensures all widget updates run
in the correct session context with batched recomputation.

Removed: WidgetLifecycleCallbacks, WidgetLifecycleSubscriptionId,
subscribe_to_widget_lifecycle(), unsubscribe_from_widget_lifecycle(),
_widget_subscriptions dict, cleanup() on WorkflowStatusWidget.

Added: WorkflowState.version field, get_workflow_state_version() getter,
version tracking in WorkflowStatusWidget.refresh().

Prompt: Implement the following plan: Convert WorkflowStatusWidget
Lifecycle Callbacks to Version-Based Polling

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants