lucene-monitor: make abstract DocumentBatch public #13993

cpoerschke · 2024-11-13T18:50:44Z

Description

The static DocumentBatch.of methods are already public, if the class itself was public too that would allow applications -- e.g. see @kotman12's apache/solr#2382 -- to refer to the class e.g. in a visitor.

romseygeek · 2024-11-14T10:19:11Z

DocumentBatch is really an implementation detail of the Monitor, so I'm not sure why client code would need to refer to it? The linked Solr PR is quite big so it's difficult to see where this would be helpful, could you explain in a bit more detail why it's necessary?

kotman12 · 2024-11-14T11:35:49Z

Author here, so the current monitor API makes it really hard to integrate into anything more "custom" like solr. This is because it tightly seals relevant implementation details like cache and index and makes opinionated choices about them, i.e. the cache always caches everything and is a naive hashmap, the index is not exposed and can't be passed in, and the indexing and search flows both hit coarse synchronization that creates blocking.

The simple crud wrapper that monitor exposes today IMO doesn't make it amenable to be integrated into something like solr (maybe also part of the reason why elastic doesn't use it for percolator?). Tldr; The PR takes the most useful parts of monitor and uses them directly but relies on the visitor pattern. We'd like to avoid that. Not sure if there is a better way of changing the API to achieve the same thing? I welcome other thoughts and suggestions.

romseygeek · 2024-11-14T13:01:29Z

I'm definitely up for making this more extensible. We already have two separate QueryIndex implementations, so maybe that's the best place to start? Cacheing of parsed queries is pretty tightly coupled to how the query index is implemented so separating them out might be trickier, but allowing clients to use their own QueryIndex implementations would allow some experimentation there.

kotman12 · 2024-11-14T13:11:15Z

Cacheing of parsed queries is pretty tightly coupled to how the query index is implemented so separating them out might be trickier

The current "solr monitor" PR avoids this problem entirely by just not using the monitor and internal caches altogether. Really the best part of the monitor module is the presearcher with its various optimizations. One of the problems is that to go to it you have to go through the coarse synchronization which is done to update the naive cache and index at the same time. The linked PR does no such synchronization and manages its own caffeine cache which is optimistically updated. Versioning is used to make sure the index matches the cache, and if not, the cache is updated (assuming cache version is lower).

cpoerschke · 2024-11-14T13:13:12Z

Opened #13995 as an alternative i.e. of method visibility to match the class. WDYT?

kotman12 · 2024-11-14T21:53:03Z

Following up briefly after revisiting the code, I just don't see a lot of use for a QueryIndex or a Monitor in a more "advanced" set-up such a Solr. The QueryIndex api is quite large and most of the operations aren't really necessary. The useful bits are the QueryTermFilter (and perhaps DataValues). For the common use case that doesn't have to worry about scale or replication Monitor api works fine. But when trying to adapt it to something more advanced like solr, trying to shoehorn QueryIndex seems like a large undertaking for not much gain (at least I don't perceive much). You'd have to bypass or amend SolrIndexSearcher for a workflow that really isn't all that different from regular search. The main place where it deviates is the expensive post-matching step for which solr already has abstractions.

romseygeek · 2024-11-16T13:43:50Z

The monitor was initially written as a stand-alone application, and I didn't have integrations in mind at all when I was designing the API. But we can definitely re-work things a bit so that there are library functions as well as executable ones.

I guess what we really need is something that takes a Query and produces a Document, for the index-time side of things; and then a SearcherManager/IndexSearcher implementation that handles the QueryTermFilters, a Query that takes a set of Documents and rewrites itself into the presearcher query, and an extensible CollectorManager that calls an abstract method whenever it gets a candidate match. Clients can then implement their own ways of actually doing the match, whether that's re-parsing the query from serialized bytes or looking it up in a cache or whatever. Does that sound like it would fit with what you want to do with Solr?

kotman12 · 2024-11-18T01:58:08Z

The monitor was initially written as a stand-alone application, and I didn't have integrations in mind at all when I was designing the API. But we can definitely re-work things a bit so that there are library functions as well as executable ones.

Yes, as far as I can tell it is widely used as an application for fast query alerting and/or saved search. The API works fine for most stand-alone applications.

I guess what we really need is something that takes a Query and produces a Document, for the index-time side of things;

+

takes a set of Documents and rewrites itself into the presearcher query

I'd say these two are the most important features for the library API to have in order to easily fit into something like Solr, at least the way I was envisioning.

and an extensible CollectorManager that calls an abstract method whenever it gets a candidate match.

Having any kind of hook into the CandidateMatcher would be beneficial. I am not opposed to an IoC style I think you are suggesting. I assume the motivation for the callback-style API is because the CandidateMatcher can be async, i.e. in the ParallelMatcher? Either way, I think I'm on-board. Currently I create a new CandidateMatcher with each doc and call finish on each doc individually which is a bit cumbersome.

and then a SearcherManager/IndexSearcher implementation that handles the QueryTermFilters

Having a QueryTermFilter per IndexSearcher makes sense from a lucene perspective. Unfortunately, Solr inherits its own IndexSearcher so this wouldn't easily apply to the Solr use-case although that is not to say it is a bad idea. It might be beneficial (at least for the Solr use-case) to also expose the QueryTermFilter itself ... I know it's not a lot of code but it would be nice to not copy it. Solr could effectively manage its own QTF at the Solr searcher-level (currently I have it shared between many searchers which is not correct ..)

Obviously you are the best equipped person to make these changes and decisions on the lucene-monitor API but if you would like any help let me know! Either way thanks for taking a look.

kotman12 · 2024-11-26T21:57:54Z

Hey @romseygeek , this PR might not be the best forum but figure I'd mention this anyway.. I'm working on a ReverseSearchQuery (name subject to change) which consolidates the presearch and candidate matching under one query with two-phase iterator. The benefit is that you can take advantage of searcher parallelizations that have been added to lucene over the last years. I think it's especially beneficial for the still hypothetical "solr-monitor" since bolting on, say, a ParallelMatcher isn't very easy. I'm mentioning this in case you think it might have broader use. At any rate, I figured it was at least relevant to any future lucene-monitor API discussions. Also, I'd appreciate another pair of eyes on it since I have very little experience developing in this area of lucene 😃

github-actions · 2024-12-11T00:24:46Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

lucene-monitor: make abstract DocumentBatch public

ba66cc9

cpoerschke added the module:monitor label Nov 13, 2024

cpoerschke requested a review from romseygeek November 13, 2024 18:50

cpoerschke mentioned this pull request Nov 13, 2024

SOLR-4587: integrate lucene-monitor into solr apache/solr#2382

Draft

7 tasks

cpoerschke added 2 commits November 13, 2024 19:03

add class level javadocs

dd04142

Update DocumentBatch.java

f8188bd

cpoerschke marked this pull request as ready for review November 13, 2024 19:26

cpoerschke mentioned this pull request Nov 14, 2024

lucene-monitor: make static DocumentBatch.of package scope #13995

Open

github-actions bot added the Stale label Dec 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lucene-monitor: make abstract DocumentBatch public #13993

lucene-monitor: make abstract DocumentBatch public #13993

cpoerschke commented Nov 13, 2024 •

edited

Loading

romseygeek commented Nov 14, 2024

kotman12 commented Nov 14, 2024 •

edited

Loading

romseygeek commented Nov 14, 2024

kotman12 commented Nov 14, 2024

cpoerschke commented Nov 14, 2024

kotman12 commented Nov 14, 2024 •

edited

Loading

romseygeek commented Nov 16, 2024

kotman12 commented Nov 18, 2024

kotman12 commented Nov 26, 2024

github-actions bot commented Dec 11, 2024

lucene-monitor: make abstract DocumentBatch public #13993

Are you sure you want to change the base?

lucene-monitor: make abstract DocumentBatch public #13993

Conversation

cpoerschke commented Nov 13, 2024 • edited Loading

Description

romseygeek commented Nov 14, 2024

kotman12 commented Nov 14, 2024 • edited Loading

romseygeek commented Nov 14, 2024

kotman12 commented Nov 14, 2024

cpoerschke commented Nov 14, 2024

kotman12 commented Nov 14, 2024 • edited Loading

romseygeek commented Nov 16, 2024

kotman12 commented Nov 18, 2024

kotman12 commented Nov 26, 2024

github-actions bot commented Dec 11, 2024

cpoerschke commented Nov 13, 2024 •

edited

Loading

kotman12 commented Nov 14, 2024 •

edited

Loading

kotman12 commented Nov 14, 2024 •

edited

Loading