-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lucene-monitor: make abstract DocumentBatch public #13993
base: main
Are you sure you want to change the base?
lucene-monitor: make abstract DocumentBatch public #13993
Conversation
DocumentBatch is really an implementation detail of the Monitor, so I'm not sure why client code would need to refer to it? The linked Solr PR is quite big so it's difficult to see where this would be helpful, could you explain in a bit more detail why it's necessary? |
Author here, so the current monitor API makes it really hard to integrate into anything more "custom" like solr. This is because it tightly seals relevant implementation details like cache and index and makes opinionated choices about them, i.e. the cache always caches everything and is a naive hashmap, the index is not exposed and can't be passed in, and the indexing and search flows both hit coarse synchronization that creates blocking. The simple crud wrapper that monitor exposes today IMO doesn't make it amenable to be integrated into something like solr (maybe also part of the reason why elastic doesn't use it for percolator?). Tldr; The PR takes the most useful parts of monitor and uses them directly but relies on the visitor pattern. We'd like to avoid that. Not sure if there is a better way of changing the API to achieve the same thing? I welcome other thoughts and suggestions. |
I'm definitely up for making this more extensible. We already have two separate QueryIndex implementations, so maybe that's the best place to start? Cacheing of parsed queries is pretty tightly coupled to how the query index is implemented so separating them out might be trickier, but allowing clients to use their own QueryIndex implementations would allow some experimentation there. |
The current "solr monitor" PR avoids this problem entirely by just not using the monitor and internal caches altogether. Really the best part of the monitor module is the presearcher with its various optimizations. One of the problems is that to go to it you have to go through the coarse synchronization which is done to update the naive cache and index at the same time. The linked PR does no such synchronization and manages its own caffeine cache which is optimistically updated. Versioning is used to make sure the index matches the cache, and if not, the cache is updated (assuming cache version is lower). |
Opened #13995 as an alternative i.e. |
Following up briefly after revisiting the code, I just don't see a lot of use for a QueryIndex or a Monitor in a more "advanced" set-up such a Solr. The QueryIndex api is quite large and most of the operations aren't really necessary. The useful bits are the QueryTermFilter (and perhaps DataValues). For the common use case that doesn't have to worry about scale or replication Monitor api works fine. But when trying to adapt it to something more advanced like solr, trying to shoehorn QueryIndex seems like a large undertaking for not much gain (at least I don't perceive much). You'd have to bypass or amend SolrIndexSearcher for a workflow that really isn't all that different from regular search. The main place where it deviates is the expensive post-matching step for which solr already has abstractions. |
The monitor was initially written as a stand-alone application, and I didn't have integrations in mind at all when I was designing the API. But we can definitely re-work things a bit so that there are library functions as well as executable ones. I guess what we really need is something that takes a Query and produces a Document, for the index-time side of things; and then a SearcherManager/IndexSearcher implementation that handles the QueryTermFilters, a Query that takes a set of Documents and rewrites itself into the presearcher query, and an extensible CollectorManager that calls an abstract method whenever it gets a candidate match. Clients can then implement their own ways of actually doing the match, whether that's re-parsing the query from serialized bytes or looking it up in a cache or whatever. Does that sound like it would fit with what you want to do with Solr? |
Yes, as far as I can tell it is widely used as an application for fast query alerting and/or saved search. The API works fine for most stand-alone applications.
+
I'd say these two are the most important features for the library API to have in order to easily fit into something like Solr, at least the way I was envisioning.
Having any kind of hook into the
Having a Obviously you are the best equipped person to make these changes and decisions on the lucene-monitor API but if you would like any help let me know! Either way thanks for taking a look. |
Hey @romseygeek , this PR might not be the best forum but figure I'd mention this anyway.. I'm working on a ReverseSearchQuery (name subject to change) which consolidates the presearch and candidate matching under one query with two-phase iterator. The benefit is that you can take advantage of searcher parallelizations that have been added to lucene over the last years. I think it's especially beneficial for the still hypothetical "solr-monitor" since bolting on, say, a |
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution! |
Description
The static
DocumentBatch.of
methods are already public, if the class itself was public too that would allow applications -- e.g. see @kotman12's apache/solr#2382 -- to refer to the class e.g. in a visitor.