feat: added lazy loading. #39
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Lazy loading support
First of all, most of the LOC are just to integrate it with the examples so that we can easily test regressions and if there is any memory improvement at all. the actual library diff is very tiny in comparison. I apologise in advance and we can split it into multiple PRs. We can also just revert the examples and make the feature small
Summary
Adds
FeatureStore<T>trait who abstracts feature vector storage, enabling lazy loading from disk, mmap, or custom backends/whatever you want. The default implementation usesVec<T>for backward compatibility. This PR also addsnew_with_storageandnew_with_storage_and_paramsconstructors, updates the recall examples with mmap-based storage demonstrations, and fixes a bug incopy_from_slicewhere the destination slice wasn't properly sized.Motivation
Previously, all feature vectors were stored in a
Vec<T>in memory. For large datasets, this can exceed available RAM. TheFeatureStoretrait allows users to provide custom storage backends that could:note: that all api changes are backward compatible (i think)
There was also a small bug which I also uncovered while testing the recall examples:
In
src/hnsw/hnsw_const.rs, thenearestmethod had incorrectcopy_from_slicecalls that would panic when the destination buffer was larger than the number of results found:This bug would trigger when Searching with an amount of neighbors requested but with fewer results available (not sure if this was a bug before but I checked and seems everything work separately).
Benchmarks - Recall
This stayed identical, I am going to post the gnuplots for completeness. below are the
Vecbased plotsDiscrete:
Mmap plots:
Benchmarks - Memory
Tested with 1 million 128-dimensional f32 vectors (512 bytes per feature):
To prove the usefulness of the lazy loading, I benchmarked the results on the RAM consumption. The disk-based storage reduces memory usage by ~83%. As a matter of fact, the disk-based backend, only the graph structure remains in memory, bringing total usage down to ~216 MB.
These results are expected to scale linearly. a 100MN vector index cannot be stored on a laptop if you keep it in memory as it would be 130 GB, however, with lazy loading, you can store on 32 GB machine >100MN and if you had access to 130 GB, you could store around 0.5 Billion vectors.