Move vector search from IndexInput to RandomAccessInput #13938

jpountz · 2024-10-21T07:04:11Z

Description

Vector search currently loads vectors from disk by issuing a seek() followed by a readFloats(). We should instead:

Add an absolute readFloats() method to RandomAccessInput
Refactor the latest vector search file format to use RandomAccessInput instead of IndexInput to read vectors from disk.

The text was updated successfully, but these errors were encountered:

dungba88 · 2024-10-31T08:58:16Z

Hi, I'm learning Lucene KNN and this seems to be a workable PR for beginner. Just curious about the motivation behind this change. Is it only for cleaner code, or are we also suppose to make any latency improvement on the absolute readFloats method compare to the current seek() + readFloats()?

msokolov · 2024-10-31T12:22:44Z

I think this will be helpful since currently we cannot share these readers across threads -- they retain the state information about the current position. Not sure how much benefit that will be since they must still typically maintain some local temporary storage to retain the value that is read

dungba88 · 2024-11-01T06:11:29Z

I think this will be helpful since currently we cannot share these readers across threads -- they retain the state information about the current position. Not sure how much benefit that will be since they must still typically maintain some local temporary storage to retain the value that is read

Gotcha, the current usage of seek + readFloats requires the Reader to keep the seek position. When we change to the RandomAccessInput, we expect the operation to have no side-effect to the Reader and thus they will be sharable.

dungba88 · 2024-11-06T03:43:03Z

I looked at some implementation of RandomAccessInput, such as BufferedIndexInput. This particular class holds a single buffer for all reads, thus it cannot be shared. If we use temporary buffer (to make it shareable), then it kinda defeats the purpose of the single-buffer, which is to avoid excessive temporary buffers and GC. So it's unavoidable to have side-effects in read.

dungba88 · 2024-11-14T07:37:04Z

@jpountz it's only a draft (I need to add tests), but can you give some feedbacks on #13981. I'm not sure if I have fully captured the intention of this change.

rmuir · 2024-12-04T14:28:47Z

@jpountz is this really appropriate? RandomAccessInput is to reduce the overhead when doing tiny (not bulk) reads, it was added to help move from fieldcache to docvalues, where you need to read e.g. single byte value at a specific location. it saves a bounds check for such tiny reads.

For bulk reads it isn't useful.

Basically, i think this is ok, as long as we remove bulk readFloats() method along with it.

jpountz · 2024-12-04T14:59:49Z

I was thinking of it differently, that IndexInput is for sequential reading (possibly with skipping, like we do in postings) while RandomAccessInput is for random access like we do in doc values (except in the corner case when your query is a MatchAllDocsQuery) and vectors. But no strong feelings either way.

jpountz added the type:task label Oct 21, 2024

dungba88 added a commit to dungba88/lucene that referenced this issue Nov 8, 2024

Move vector search from IndexInput to RandomAccessInput (apache#13938)

b97aadb

This was referenced Nov 8, 2024

[DRAFT] Move vector search from IndexInput to RandomAccessInput (#13938) dungba88/lucene#28

Closed

[DRAFT] Change vector input from IndexInput to RandomAccessInput #13981

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move vector search from IndexInput to RandomAccessInput #13938

Move vector search from IndexInput to RandomAccessInput #13938

jpountz commented Oct 21, 2024

dungba88 commented Oct 31, 2024

msokolov commented Oct 31, 2024

dungba88 commented Nov 1, 2024

dungba88 commented Nov 6, 2024

dungba88 commented Nov 14, 2024

rmuir commented Dec 4, 2024

jpountz commented Dec 4, 2024

Move vector search from IndexInput to RandomAccessInput #13938

Move vector search from IndexInput to RandomAccessInput #13938

Comments

jpountz commented Oct 21, 2024

Description

dungba88 commented Oct 31, 2024

msokolov commented Oct 31, 2024

dungba88 commented Nov 1, 2024

dungba88 commented Nov 6, 2024

dungba88 commented Nov 14, 2024

rmuir commented Dec 4, 2024

jpountz commented Dec 4, 2024