Improve speed of hypdiff for large text by phorsuedzie · Pull Request #5 · krishan/hyp_diff

phorsuedzie · 2021-06-03T16:12:36Z

If large text is diffed, Diff::LCS receives a lot of "single whitespace" sequence tokens, which have a heavy impact on execution time. The effect is not linear with the number of whitespaces, but more like O(n^2).

The solution is to hide (most) whitespace sequence items from Diff::LCS. A whitespace TextFromNode is now "merged" into its successor TextFromNode if both belong to the same XML Text Node (i.e. originate from the same split).

This helps Diff::LCS in two ways:

the number of sequence tokens is reduced
the number of equal sequence tokens is massively reduced
- Note: some less(?) critical "token hotspots" might still be present, e.g. < and alike

Note that specs are kept happy by collapsing adjacent whitespaces nodes (even across text nodes) first, and only afterwards merge a whitespace token into its successor.

- `Hash` knows how to initialize values of missing keys. - `NodeMap` returns the map and is not (anymore) called with a block.

The loop's condition is sufficient. The artifact is from a spike which tried to identify a "common tail" of insertions and deletions.

phorsuedzie · 2021-11-04T22:27:09Z

Any update on this? See https://github.com/infopark/scrivito_cms/blob/master/app/static/hyp_diff_patches.rb ;-)

phorsuedzie added 7 commits June 3, 2021 16:04

Streamline some code

0aa7af8

- `Hash` knows how to initialize values of missing keys. - `NodeMap` returns the map and is not (anymore) called with a block.

[test] aggregate_failures reduces red surprises

a6b2be6

[test] Ensure stable diff output for pseudo adjacent whitespaces

5ba85f1

[test] Should perform fast for medium sized (plain) text

221205e

Increase diff speed by hiding single whitespaces from Diff::LCS

b1e003b

[spec] Expect and edge case's outcome differently

026ec1b

Remove artifact

2af49b4

The loop's condition is sufficient. The artifact is from a spike which tried to identify a "common tail" of insertions and deletions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve speed of hypdiff for large text#5

Improve speed of hypdiff for large text#5
phorsuedzie wants to merge 7 commits intokrishan:masterfrom
phorsuedzie:faster

phorsuedzie commented Jun 3, 2021 •

edited

Loading

Uh oh!

phorsuedzie commented Nov 4, 2021 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

phorsuedzie commented Jun 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

phorsuedzie commented Nov 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

phorsuedzie commented Jun 3, 2021 •

edited

Loading

phorsuedzie commented Nov 4, 2021 •

edited

Loading