Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lucene-monitor: make abstract DocumentBatch public #13993

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

cpoerschke
Copy link
Contributor

@cpoerschke cpoerschke commented Nov 13, 2024

Description

The static DocumentBatch.of methods are already public, if the class itself was public too that would allow applications -- e.g. see @kotman12's apache/solr#2382 -- to refer to the class e.g. in a visitor.

@cpoerschke cpoerschke marked this pull request as ready for review November 13, 2024 19:26
@romseygeek
Copy link
Contributor

DocumentBatch is really an implementation detail of the Monitor, so I'm not sure why client code would need to refer to it? The linked Solr PR is quite big so it's difficult to see where this would be helpful, could you explain in a bit more detail why it's necessary?

@kotman12
Copy link
Contributor

kotman12 commented Nov 14, 2024

Author here, so the current monitor API makes it really hard to integrate into anything more "custom" like solr. This is because it tightly seals relevant implementation details like cache and index and makes opinionated choices about them, i.e. the cache always caches everything and is a naive hashmap, the index is not exposed and can't be passed in, and the indexing and search flows both hit coarse synchronization that creates blocking.

The simple crud wrapper that monitor exposes today IMO doesn't make it amenable to be integrated into something like solr (maybe also part of the reason why elastic doesn't use it for percolator?). Tldr; The PR takes the most useful parts of monitor and uses them directly but relies on the visitor pattern. We'd like to avoid that. Not sure if there is a better way of changing the API to achieve the same thing? I welcome other thoughts and suggestions.

@romseygeek
Copy link
Contributor

I'm definitely up for making this more extensible. We already have two separate QueryIndex implementations, so maybe that's the best place to start? Cacheing of parsed queries is pretty tightly coupled to how the query index is implemented so separating them out might be trickier, but allowing clients to use their own QueryIndex implementations would allow some experimentation there.

@kotman12
Copy link
Contributor

Cacheing of parsed queries is pretty tightly coupled to how the query index is implemented so separating them out might be trickier

The current "solr monitor" PR avoids this problem entirely by just not using the monitor and internal caches altogether. Really the best part of the monitor module is the presearcher with its various optimizations. One of the problems is that to go to it you have to go through the coarse synchronization which is done to update the naive cache and index at the same time. The linked PR does no such synchronization and manages its own caffeine cache which is optimistically updated. Versioning is used to make sure the index matches the cache, and if not, the cache is updated (assuming cache version is lower).

@cpoerschke
Copy link
Contributor Author

Opened #13995 as an alternative i.e. of method visibility to match the class. WDYT?

@kotman12
Copy link
Contributor

kotman12 commented Nov 14, 2024

Following up briefly after revisiting the code, I just don't see a lot of use for a QueryIndex or a Monitor in a more "advanced" set-up such a Solr. The QueryIndex api is quite large and most of the operations aren't really necessary. The useful bits are the QueryTermFilter (and perhaps DataValues). For the common use case that doesn't have to worry about scale or replication Monitor api works fine. But when trying to adapt it to something more advanced like solr, trying to shoehorn QueryIndex seems like a large undertaking for not much gain (at least I don't perceive much). You'd have to bypass or amend SolrIndexSearcher for a workflow that really isn't all that different from regular search. The main place where it deviates is the expensive post-matching step for which solr already has abstractions.

@romseygeek
Copy link
Contributor

The monitor was initially written as a stand-alone application, and I didn't have integrations in mind at all when I was designing the API. But we can definitely re-work things a bit so that there are library functions as well as executable ones.

I guess what we really need is something that takes a Query and produces a Document, for the index-time side of things; and then a SearcherManager/IndexSearcher implementation that handles the QueryTermFilters, a Query that takes a set of Documents and rewrites itself into the presearcher query, and an extensible CollectorManager that calls an abstract method whenever it gets a candidate match. Clients can then implement their own ways of actually doing the match, whether that's re-parsing the query from serialized bytes or looking it up in a cache or whatever. Does that sound like it would fit with what you want to do with Solr?

@kotman12
Copy link
Contributor

The monitor was initially written as a stand-alone application, and I didn't have integrations in mind at all when I was designing the API. But we can definitely re-work things a bit so that there are library functions as well as executable ones.

Yes, as far as I can tell it is widely used as an application for fast query alerting and/or saved search. The API works fine for most stand-alone applications.

I guess what we really need is something that takes a Query and produces a Document, for the index-time side of things;

+

takes a set of Documents and rewrites itself into the presearcher query

I'd say these two are the most important features for the library API to have in order to easily fit into something like Solr, at least the way I was envisioning.

and an extensible CollectorManager that calls an abstract method whenever it gets a candidate match.

Having any kind of hook into the CandidateMatcher would be beneficial. I am not opposed to an IoC style I think you are suggesting. I assume the motivation for the callback-style API is because the CandidateMatcher can be async, i.e. in the ParallelMatcher? Either way, I think I'm on-board. Currently I create a new CandidateMatcher with each doc and call finish on each doc individually which is a bit cumbersome.

and then a SearcherManager/IndexSearcher implementation that handles the QueryTermFilters

Having a QueryTermFilter per IndexSearcher makes sense from a lucene perspective. Unfortunately, Solr inherits its own IndexSearcher so this wouldn't easily apply to the Solr use-case although that is not to say it is a bad idea. It might be beneficial (at least for the Solr use-case) to also expose the QueryTermFilter itself ... I know it's not a lot of code but it would be nice to not copy it. Solr could effectively manage its own QTF at the Solr searcher-level (currently I have it shared between many searchers which is not correct ..)

Obviously you are the best equipped person to make these changes and decisions on the lucene-monitor API but if you would like any help let me know! Either way thanks for taking a look.

@kotman12
Copy link
Contributor

Hey @romseygeek , this PR might not be the best forum but figure I'd mention this anyway.. I'm working on a ReverseSearchQuery (name subject to change) which consolidates the presearch and candidate matching under one query with two-phase iterator. The benefit is that you can take advantage of searcher parallelizations that have been added to lucene over the last years. I think it's especially beneficial for the still hypothetical "solr-monitor" since bolting on, say, a ParallelMatcher isn't very easy. I'm mentioning this in case you think it might have broader use. At any rate, I figured it was at least relevant to any future lucene-monitor API discussions. Also, I'd appreciate another pair of eyes on it since I have very little experience developing in this area of lucene 😃

Copy link

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Dec 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants