Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move vector search from IndexInput to RandomAccessInput #13938

Open
jpountz opened this issue Oct 21, 2024 · 7 comments · May be fixed by #13981
Open

Move vector search from IndexInput to RandomAccessInput #13938

jpountz opened this issue Oct 21, 2024 · 7 comments · May be fixed by #13981

Comments

@jpountz
Copy link
Contributor

jpountz commented Oct 21, 2024

Description

Vector search currently loads vectors from disk by issuing a seek() followed by a readFloats(). We should instead:

  • Add an absolute readFloats() method to RandomAccessInput
  • Refactor the latest vector search file format to use RandomAccessInput instead of IndexInput to read vectors from disk.
@dungba88
Copy link
Contributor

Hi, I'm learning Lucene KNN and this seems to be a workable PR for beginner. Just curious about the motivation behind this change. Is it only for cleaner code, or are we also suppose to make any latency improvement on the absolute readFloats method compare to the current seek() + readFloats()?

@msokolov
Copy link
Contributor

I think this will be helpful since currently we cannot share these readers across threads -- they retain the state information about the current position. Not sure how much benefit that will be since they must still typically maintain some local temporary storage to retain the value that is read

@dungba88
Copy link
Contributor

dungba88 commented Nov 1, 2024

I think this will be helpful since currently we cannot share these readers across threads -- they retain the state information about the current position. Not sure how much benefit that will be since they must still typically maintain some local temporary storage to retain the value that is read

Gotcha, the current usage of seek + readFloats requires the Reader to keep the seek position. When we change to the RandomAccessInput, we expect the operation to have no side-effect to the Reader and thus they will be sharable.

@dungba88
Copy link
Contributor

dungba88 commented Nov 6, 2024

I looked at some implementation of RandomAccessInput, such as BufferedIndexInput. This particular class holds a single buffer for all reads, thus it cannot be shared. If we use temporary buffer (to make it shareable), then it kinda defeats the purpose of the single-buffer, which is to avoid excessive temporary buffers and GC. So it's unavoidable to have side-effects in read.

@dungba88
Copy link
Contributor

@jpountz it's only a draft (I need to add tests), but can you give some feedbacks on #13981. I'm not sure if I have fully captured the intention of this change.

@rmuir
Copy link
Member

rmuir commented Dec 4, 2024

@jpountz is this really appropriate? RandomAccessInput is to reduce the overhead when doing tiny (not bulk) reads, it was added to help move from fieldcache to docvalues, where you need to read e.g. single byte value at a specific location. it saves a bounds check for such tiny reads.

For bulk reads it isn't useful.

Basically, i think this is ok, as long as we remove bulk readFloats() method along with it.

@jpountz
Copy link
Contributor Author

jpountz commented Dec 4, 2024

I was thinking of it differently, that IndexInput is for sequential reading (possibly with skipping, like we do in postings) while RandomAccessInput is for random access like we do in doc values (except in the corner case when your query is a MatchAllDocsQuery) and vectors. But no strong feelings either way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants