Skip to content

Conversation

@yadudoc
Copy link

@yadudoc yadudoc commented Sep 30, 2025

Currently, any checks for duplication against a pre-computed index will insert new documents into the index. For cases where we would like to maintain an frozen index, say for comparing one corpus against many others, we need to avoid insertions that would mutate the index.

This PR adds support for operations an index without new insertion via --skip-insertion flag.

In addition, there are a couple of minor type fixes.

…ents

* When `skip_insertion` is enabled, unique entries found are not inserted into the index. This enables deduplicating against an index without modifying it.
robertu94 added a commit to robertu94/data-general-text-code-web that referenced this pull request Oct 3, 2025
@robertu94
Copy link

merged into #5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants