Skip to content

Calling store_vector with MemoryStorage on scipy.sparse.csr_matrix allocates memory when it shouldn't. #93

@Apkar029

Description

@Apkar029

I have input samples as a sparse matrix of shape (531990 samples, 85765 features).

The size of this matrix in memory is 56KB. The matrix as a numpy array is approximately 340GB.

When i use the MemoryStorage option i run out of memory. This is due to the vec = vec.tocsr() in
unitvec function. The input vectors added by store_vector are scipy.sparse.csr.csr_matrix of shape (85765, 1) as trying to store vectors as scipy.sparse.csr.csr_matrix of shape (1, 85765) gives:

File "nearpy/engine.py", line 96, in store_vector
  for bucket_key in lshash.hash_vector(v):
File "nearpy/hashes/randombinaryprojections.py", line 74, in hash_vector
  projection = self.normals_csr.dot(v)
File "scipy/sparse/base.py", line 359, in dot
  return self * other
File "scipy/sparse/base.py", line 479, in __mul__ raise ValueError('dimension mismatch')
ValueError: dimension mismatch

Removing the vec = vec.tocsr() line solves the problem for matrices of shape (85765, 1) and no extra memory is allocated. This is strange behavior and it might be a scipy bug, but what is the point of the .tocsr() conversion anyway?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions