Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could Lucene's default Directory (FSDirectory.open) somehow preload .vec files? #13551

Open
mikemccand opened this issue Jul 8, 2024 · 9 comments

Comments

@mikemccand
Copy link
Member

Description

This is really a "discussion" issue. I'm not sure at all that the idea is feasible:

I've been testing luceneutil with heavy KNN indexing (Cohere wikipedia en 768 dimension vectors) and one dismal part of the experience is the swap storm caused by HNSW's random access to the raw vectors stored in the .vec files for each segment on "cold start" searching.

Even when the box has plenty of RAM to hold all .vec files, the swap storm takes minutes on the nightly benchy box, even with a fast NVMe SSD holding the index. Even if the index is freshly built, the OS doesn't seem to cache the .vec files since they appear to be "write once", until the searching starts up, and then the swap storm begins. This was with FLOAT32 vectors ... I suspect the problem is less severe with int4 or int8 compressed vectors (haven't tested).

At Amazon (customer facing product search) we also see this swap storm when cold starting the searching process even after NRT segment replication has just copied the files locally: they don't stay "hot" in the OS in that case either (looks like "write once" to the OS).

Lucene already has an awesome feature to "fix" this:MMapDirectory.setPreload. It will pre-touch all pages associated with that file on open, so the OS caches them "once" on startup, much more efficiently than the random access HNSW. But this only makes sense for applications/users that know they have enough RAM (we will test this at Amazon to see if it helps our cold start problems). For my luceneutil tests, simply cat /path/to/index/*.vec >> /dev/null (basically the same as .setPreload I think) fixed/sidestepped the swap storm.

(The Linux kernel's less aggressive default readahead for memory-mapped IO vs traditional NIO is likely not helping either? Is this really a thing? I have not tested NIOFSDirectory to see if the swap storm is lessened).

Longer term we have discussed using AKNN algorithms that are more disk-friendly (e.g. DiskANN), but shorter t erm, I'm wondering if we could somehow help users with a default Directory (from FSDirectory.open) that somehow / sometimes preloads .vec files? It's not easy to do -- you wouldn't know up front that the application will do KNN searching at all. And, maybe only certain vectors in the .vec will ever be accessed and so you need to pay that long random access price to find and cache just those ones.

@rmuir
Copy link
Member

rmuir commented Jul 8, 2024

It's not easy to do -- you wouldn't know up front that the application will do KNN searching at all. And, maybe only certain vectors in the .vec will ever be accessed and so you need to pay that long random access price to find and cache just those ones.

I'm trying to understand this sentence:

  • it should be a one-liner using setPreload to preload "*.vec" if we wanted to do it either from FSDirectory.open or by default in MMapDirectory
  • if the application isn't doing KNN searching then they won't have .vec? I struggle to imagine someone indexing a ton of "extra" vectors that isnt "using" them and hasn't noticed big performance impact

@jpountz
Copy link
Contributor

jpountz commented Jul 8, 2024

It wouldn't solve the issue, only mitigate it, but hopefully cold start performance gets better when we start leveraging IndexInput#prefetch to load multiple vectors from disk concurrently (#13179).

The Linux kernel's less aggressive default readahead for memory-mapped IO vs traditional NIO

FWIW, it's not that the read-ahead is less agressive for mmap than traditional I/O, it's that since recently MMapDirectory explicitly tells the OS to not perform readahead for vectors by passing MADV_RANDOM to madvise.

@mikemccand
Copy link
Member Author

Oh sorry I used the wrong term (thank you @rmuir for clarifying!): it's not a swap storm I'm seeing, it's a page storm. The OS has plenty of free ram (reported by free), and that goes down and buff/cache goes up as the OS pulls and caches pages in for the .vec file. I don't think I'm running too many ram hogging crapplications ;)

  • it should be a one-liner using setPreload to preload "*.vec" if we wanted to do it either from FSDirectory.open or by default in MMapDirectory

+1 -- it would be a simple change. But I worry if it would do more harm than good in some cases, e.g. if there are truly cold HNSW cases where the application plans to suffer through paging to identify the subset of hot vectors? I don't know if that is even a thing -- a bunch of dark matter vectors that never get visited? I guess we do know that HNSW can and does sometimes produce disconnected graphs, but I think those dark matter islands are "typically" smallish.

  • if the application isn't doing KNN searching then they won't have .vec? I struggle to imagine someone indexing a ton of "extra" vectors that isnt "using" them and hasn't noticed big performance impact

+1! Especially because indexing them is quite costly!

It wouldn't solve the issue, only mitigate it, but hopefully cold start performance gets better when we start leveraging IndexInput#prefetch to load multiple vectors from disk concurrently (#13179).

+1 -- that would be a big help especially when paging in from fast SSDs since these devices have high IO concurrency.

The Linux kernel's less aggressive default readahead for memory-mapped IO vs traditional NIO

FWIW, it's not that the read-ahead is less agressive for mmap than traditional I/O, it's that since recently MMapDirectory explicitly tells the OS to not perform readahead for vectors by passing MADV_RANDOM to madvise.

Ahhh, that makes sense! And I think it is still correct to madvise in this way for our .vec files: it really likely (is there any locality to HNSW's access patterns?) is a random access pattern from HNSW. It does make me wonder if the scalar compression will actually help or not. I guess it might still help if that compression means multiple vectors fit into a single IO page.

@uschindler
Copy link
Contributor

I don't think we should change anything here in MMapDirectory. It is all available and easy to do for one that wants to do this. Elasticserach is doing this for some files, but we have no default on preloading anything.

For preloading you can pass MMapDirectory#setPreload and give it a lamda expression to select on which file extensions you want preloading: https://lucene.apache.org/core/9_11_0/core/org/apache/lucene/store/MMapDirectory.html#setPreload(java.util.function.BiPredicate)

The default is: https://lucene.apache.org/core/9_11_0/core/org/apache/lucene/store/MMapDirectory.html#NO_FILES

@uschindler
Copy link
Contributor

uschindler commented Jul 8, 2024

  • it should be a one-liner using setPreload to preload "*.vec" if we wanted to do it either from FSDirectory.open or by default in MMapDirectory

It is trivial:

mmapDir.setPreload((name,ctx) -> name.endsWith(".vec"));

So no change needed in the API or behaviour of Lucene.

@gautamworah96
Copy link
Contributor

@uschindler At Amazon, we implemented the mmapDir.setPreload((name,ctx) -> name.endsWith(".vec")); style fix you had suggested but later realized that this would not work for compound files that have vector data in them eg .cfs?

One approach would be to somehow limit vector files from being folded into compound files by instructing Lucene not to do it.. I was wondering if you had thoughts here/have thought about this approach before?

The slightly tricky bit here is we only want to preload the vector data from the compound files.

@mikemccand
Copy link
Member Author

...but later realized that this would not work for compound files that have vector data in them eg .cfs?

Tracing MMapDirectory.openInput in main, it does this:

    return PROVIDER.openInput(
        path,
        context,
        chunkSizePower,
        preload.test(name, context),
        groupingFunction.apply(name),
        attachment);

So it looks like it applies the preload predicate to the .cfs named file... and then passes the resulting boolean to the JDK-version-specific impl. Opening further slices within this IndexInput (what CFS filels do) no longer do any preloading, since the whole .cfs file was either preloaded or not.

@gautamworah96 -- maybe open a spinoff issue exploring whether we could offer preloading of just slices within a CFS file?

@gautamworah96
Copy link
Contributor

Opened #13967

@ChrisHegarty
Copy link
Contributor

ChrisHegarty commented Nov 12, 2024

Late to the party!!!

I want to dig a little further on the distinction between the implementation of preload and prefetch, at least with the mmap implementation. The former touches every page (effectively forcing into the cache), while the latter simply advises of the access pattern.

For sequential access, a combination of sequential advice and prefetch works really well, but not so well for random access. Random access, at least for the HNSW graph and the related vector data (be it quantised or not), really needs to be resident in memory. Preload gives us that, at least at the time of opening. Prefect does not.

For optimized vector search use cases, assuming enough RAM, I was thinking that a separate IndexInput::load could be useful. This would be similar to preload (touches every page, or better still madvise WILL_NEED), but can be called after opening. It may or may not be super helpful, it mostly depends on how things get setup, etc, and whether or not preload is good enough for your use case. The separately proposed IndexInput::updateReadAdvise is a more general solution (compared to a load method)

For optimized vector search use cases, assuming enough RAM, I was thinking that a separate IndexInput::lock could be useful. Basically a wrapper around mlock. The idea being that Vector search perf sucks when there is not enough RAM to keep the graph and vector data resident in memory, so allow to load and lock it, then leave the tradeoff to the user to decide if they want this "fail fast" rather then "search slowly" behaviour (at least for pure vector search).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants