Skip to content

Conversation

@elias-ba
Copy link
Contributor

@elias-ba elias-ba commented Dec 10, 2025

Description

This PR fixes the infinite loading issue when reopening a workflow in the collaborative editor.

1. extract_lock_version not handling nil values (persistence.ex)

When a workflow is opened before first save, the Y.Doc gets lock_version: nil. The extract_lock_version/1 function didn't handle the {:ok, nil} case from Yex.Map.fetch/2, causing it to fall through incorrectly. The stale?/2 function was also refactored for clarity.

2. merge_updates applying deltas to empty doc (persistence_writer.ex)

This was the root cause of the infinite loading. When the PersistenceWriter batches multiple Yjs updates together, merge_updates/1 was applying them to an empty Yex.Doc.new(). Since Yjs deltas only contain changes (not full state), applying them to an empty doc produced corrupted 2-byte outputs, losing all workflow data.

Evidence from reproduction:

  • Before fix: merged updates = 2 bytes (corrupted/empty)
  • After fix: merged updates = 969-1742 bytes (correct full state)

The fix loads the existing persisted state (checkpoint + updates) from the database before applying new deltas on top.

How this caused the nil lock_version crash

  1. User creates new workflow → Y.Doc initialized with lock_version: nil
  2. After 2s debounce → First batch saved with full state (including lock_version: nil) ~1000 bytes ✓
  3. User saves workflow → DB gets real lock_version, Y.Doc updated via merge_saved_workflow_into_ydoc
  4. This generates a delta update → Only contains the lock_version change
  5. After 2s debouncemerge_updates merges pending deltas
  6. BUG: Deltas applied to empty doc → Produces 2-byte corrupted output, lock_version update lost
  7. User reopens workflow → Loads first update (lock_version: nil) + corrupted update (2 bytes)
  8. Result: Y.Doc has lock_version: nil, DB has real value → stale check fails → crash

Closes #4164 #4154

Validation steps

  1. Create a new workflow in collab editor and save it
  2. Add a job node (but don't save)
  3. Navigate back to workflows list
  4. Open the workflow again
  5. Verify workflow loads correctly (no infinite loading)

Additional notes for the reviewer

AI Usage

Please disclose how you've used AI in this work (it's cool, we just want to know!):

  • Code generation (copilot but not intellisense)
  • Learning or fact checking
  • Strategy / design
  • Optimisation / refactoring
  • Translation / spellchecking / doc gen
  • Other
  • I have not used AI

You can read more details in our Responsible AI Policy

Pre-submission checklist

  • I have performed a self-review of my code.
  • I have implemented and tested all related authorization policies. (e.g., :owner, :admin, :editor, :viewer)
  • I have updated the changelog.
  • I have ticked a box in "AI usage" in this PR

@github-project-automation github-project-automation bot moved this to New Issues in v2 Dec 10, 2025
@elias-ba elias-ba changed the title Fix infinite loading when reopening workflow in collaborative editor fix: infinite loading when reopening workflow in collaborative editor Dec 10, 2025
The bug was in merge_updates/1 in persistence_writer.ex. When multiple
Yjs delta updates were batched together, they were applied to an empty
Yex.Doc.new() instead of the current persisted state. Since Yjs deltas
only contain changes (not full state), applying them to an empty doc
produced corrupted 2-byte outputs, losing all workflow data including
lock_version.

The fix loads the existing persisted state (checkpoint + updates) from
the database before applying new deltas on top.

Evidence from reproduction:
- Before fix: merged updates = 2 bytes (corrupted/empty)
- After fix: merged updates = 969-1742 bytes (correct full state)

Closes #4164
@elias-ba elias-ba force-pushed the 4164-fix-nil-lock-version-crash branch from c53ecda to cf3886f Compare December 10, 2025 14:23
@codecov
Copy link

codecov bot commented Dec 10, 2025

Codecov Report

❌ Patch coverage is 96.55172% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 89.09%. Comparing base (9eb285d) to head (7578dd2).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
lib/lightning/collaboration/document_state.ex 94.11% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4175      +/-   ##
==========================================
+ Coverage   89.03%   89.09%   +0.05%     
==========================================
  Files         425      425              
  Lines       19646    19658      +12     
==========================================
+ Hits        17492    17514      +22     
+ Misses       2154     2144      -10     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Move duplicated checkpoint and update query logic from Persistence and
PersistenceWriter into DocumentState module where data access belongs.

- Add get_checkpoint_and_updates/1, get_latest_checkpoint/1,
  get_updates_since/2, apply_to_doc/3, and load_into_doc/2 functions
- Update Persistence to use shared DocumentState functions
- Update PersistenceWriter.merge_updates/2 to use DocumentState.load_into_doc/2
- Remove duplicated load_persisted_state_into_doc/2 from PersistenceWriter
- Add @type t and proper @SPEC annotations to DocumentState

This eliminates code duplication while keeping the fix focused and
avoiding conflicts with #4178.
* Fix incorrect field access in create_checkpoint

The create_checkpoint/1 function was using .data instead of .state_data
when accessing DocumentState records. This would cause a KeyError crash
when a document reached 500 updates and triggered checkpoint creation.

Closes #4176

* Fix checkpoint creation bugs and add tests

Two bugs fixed in create_checkpoint/1:
1. Used .data instead of .state_data when accessing DocumentState records
2. Used string "checkpoint" instead of atom :checkpoint for Ecto.Enum

Added tests to verify checkpoint creation works correctly:
- Test creating checkpoint from multiple updates
- Test merging existing checkpoint with new updates

Both bugs would cause crashes when a document reached 500 updates
and triggered automatic checkpoint creation.

Closes #4176
end

defp merge_updates([update]), do: update
defp merge_updates([update], _document_name), do: update
Copy link
Contributor

@doc-han doc-han Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix for my issue #4154 overlaps with what you're doing here. I want to get rid of this function all together. If there's a single update doesn't mean that

  1. we don't want to generate a delta between the current state of the document(_document_name)
  2. also, properly encode the changes using Yex.encode_state_as_update!

with 2. I don't think that just returning the single update is the same as passing it through a new doc and encoding.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On my end, I've consistently reproduced the error whenever there's a single update coming through ydoc which leads to this function. checking whether I can reproduce on this branch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome. I think this PR fixes #4154 also. just revalidated whether a single update coming through causes an issue and it actually doesn't. Hence the issue with #4154 also had to do with the corrupted bytes. Took me a long route to get to merge_updates function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: New Issues

Development

Successfully merging this pull request may close these issues.

Infinite loading when opening workflow "Any workflow 1"

3 participants