Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support both single- and multi-value string fields in facet counting (non-taxonomy based approaches) [LUCENE-9950] #10989

Closed
asfimport opened this issue May 3, 2021 · 13 comments

Comments

@asfimport
Copy link

asfimport commented May 3, 2021

Users wanting to facet count string-based fields using a non-taxonomy-based approach can use SortedSetDocValueFacetCounts, which accumulates facet counts based on a SortedSetDocValues field. This requires the stored doc values to be multi-valued (i.e., SORTED_SET), and doesn't work on single-valued fields (i.e., SORTED). In contrast, if a user wants to facet count on a stored numeric field, they can use LongValueFacetCounts, which supports both single- and multi-valued fields (and in #10987, we now auto-detect instead of asking the user to specify).

Let's update SortedSetDocValueFacetCounts to also support, and automatically detect single- and multi-value fields. Note that this is a spin-off issue from #10985, where @rmuir points out that this can essentially be a one-line change, but we may want to do some class renaming at the same time. Also note that we should do this in ConcurrentSortedSetDocValuesFacetCounts while we're at it.


Migrated from LUCENE-9950 by Greg Miller (@gsmiller), resolved May 25 2021
Pull requests: #133, apache/lucene-solr#2497, #145, apache/lucene-solr#2497

@asfimport
Copy link
Author

asfimport commented May 6, 2021

Greg Miller (@gsmiller) (migrated from JIRA)

I've started digging into this code a bit and find myself a little confused on the role of SortedSetDocValueFacetCounts and the best approach for moving forward with this idea. Taking a step back from thinking about single- vs. multi-valued support, I was a little surprised to find that SSDV facet counting makes some pretty strict assumptions about the format of the SSDV values. Specifically, it assumes that each value represents a strict two-level facet "path" in the form of "dimension/value".

In contrast to this, looking at something like LongValueFacetCounts or RangeFacetCounts, the approach makes no assumptions about the stored doc values. These facet counting implementations can be pointed to any numeric doc value field, while SortedSetDocValueFacetCounts has to be pointed at a field that's indexed in a very specific way. In fact, it looks like most users of this functionality will add SortedSetDocValuesFacetField to their document and rely on FacetsConfig#build to create the doc value field in the proper format.

With all this in mind, I wonder if it makes sense to add a new facet counting implementation that makes no assumptions about what is stored in the doc value field (other than being string content – i.e., SortedSetDocValues or SortedDocValues), and implement counting functionality similar to LongValueFacetCounts. This would assume "flat" values in each field, where the field is effectively equivalent to the "dimension" (e.g., see the approach in LongValueFacetCounts).

It seems like this idea of a general string field facet counting implementation may have been behind SortedSetDocValueFacetCounts originally (#5860)? Maybe it evolved into something different that makes these assumptions about the contents of the stored field? One advantage of its implementation is that many different "dimensions" can be stored in the same field for efficient counting, but it loses the flexibility to just dynamically count against any string doc value field. This also makes me wonder a little bit at the use-cases it's designed for, given the existence of taxonomy-based facet counting. It seems like the only advantage it might offer over a taxonomy-based approach is not requiring the side-car index?

Anyway, back to the main point: I would propose adding a new type of facet counting implementation (something like "StringValueFacetCounts" as proposed by @rmuir), that has no requirements on the content stored in the field (which could be single- or multi-valued), and simply counts unique values in the same way LongValueFacetCounts does. Thoughts on this approach?

@asfimport
Copy link
Author

Robert Muir (@rmuir) (migrated from JIRA)

+1 to add a "normal" facet method here, and deprecate this existing approach (it sounds really bad) for eventual removal.
No need to encode "dimension" in any value, that's why different fields are for...

@asfimport
Copy link
Author

Greg Miller (@gsmiller) (migrated from JIRA)

Thanks @rmuir. I'm working up a new facets collector implementation now that doesn't make any assumptions around the string payload (e.g., "dimensions" and such). Will see if I can get a PR out sometime in the coming week.

@asfimport
Copy link
Author

Greg Miller (@gsmiller) (migrated from JIRA)

Just posted a PR for a new facet counting implementation as discussed here. I didn't go so far as to mark SortedSetDocValueFacetCounts as deprecated at this point, thinking there may still be some use-cases for "packing" multiple "dimensions" into one field (like the taxonomy-based approach) while not requiring a taxonomy index.

@asfimport
Copy link
Author

ASF subversion and git services (migrated from JIRA)

Commit ade50f0 in lucene's branch refs/heads/main from Greg Miller
https://gitbox.apache.org/repos/asf?p=lucene.git;h=ade50f0

LUCENE-9950: New facet counting implementation for general string doc value fields (#133)

Co-authored-by: Greg Miller <[email protected]>

@asfimport
Copy link
Author

Greg Miller (@gsmiller) (migrated from JIRA)

Thanks @mikemccand for having a look at the PR! I think we can also backport this to 8.9, so I went ahead and worked up a PR for that, along with one more PR for moving the CHANGES entry in the apache/lucene repo.

@asfimport
Copy link
Author

ASF subversion and git services (migrated from JIRA)

Commit 5e99b8b in lucene-solr's branch refs/heads/branch_8x from Greg Miller
https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=5e99b8b

LUCENE-9950: New facet counting implementation for general string doc value fields (#2497)

Co-authored-by: Greg Miller <[email protected]>

@asfimport
Copy link
Author

Alexander Lukyanchikov (@sqshq) (migrated from JIRA)

Thank you for adding the new facet implementation, @gsmiller!

It seems like the only advantage it might offer over a taxonomy-based approach is not requiring the side-car index

A couple of SSDVFF advantages we found is the ability to perform fast index merge operation, since it is a regular index and does not require global ordinals translation logic (regular index merge with HardlinkCopyDirectoryWrapper takes 3 minutes in our tests, while main+taxonomy pairs merge is about 85 minutes for ~ 200Gb index size). Also, SSDVFF indexing performance is better and unlike the Taxonomy approach, scales with added threads. These advantages tipped the scales in favor of SSDVFF in our case, although Taxonomy provides a bit better query performance and allows hierarchical faceting.

— I didn't go so far as to mark SortedSetDocValueFacetCounts as deprecated at this point, thinking there may still be some use-cases for "packing" multiple "dimensions" into one field (like the taxonomy-based approach) while not requiring a taxonomy index.

I wonder what use cases do you have in mind for that? Also, any chance that you have done some performance comparison with SortedSetDocValuesFacetField implementation? I remember reading somewhere that facet dimensions stored in a single field can provide better performance (e.g due to CPU reference locality), but not sure how big the difference can be.

@asfimport
Copy link
Author

Greg Miller (@gsmiller) (migrated from JIRA)

Thanks for adding your additional insights and use-case details @sqshq! As for specific use cases and performance comparison to SSDVFC, I'll describe what I was thinking (but unfortunately don't have any benchmark data for performance). In cases where a large majority of all possible dims need to be counted, it should be more efficient to pack them into a single field, allowing the matching docs to be iterated a single time (counting all dims/values along the way and probably getting some locality benefits as you mention). On the other hand, if only a small percentage of all available dims need to be counted, a lot of wasteful counting/computation takes place during the single counting iteration. In these cases, using a separate field-per-dimension means iterating the hits multiple times but not doing any unnecessary dim/value counting. Seems like there could be use-cases for both approaches.

 

To be honest, I approached this more from the standpoint of being able to count doc value fields with any string data in them (unlike SSDVFC which assumes a specific format involving a dim). This is a nice functional complement to something like LongValueFacetCounts. It would be really interesting to do some performance benchmarking though!

@asfimport
Copy link
Author

Julie Tibshirani (@jtibshirani) (migrated from JIRA)

I've noticed a few failures in the new TestStringValueFacetCounts class. I didn't have the chance to dig into them. This line reproduces reliably on main:

./gradlew test --tests TestStringValueFacetCounts.testRandom -Dtests.seed=B6D7A1E1813518E3 -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=nus-SS -Dtests.timezone=Asia/Pyongyang -Dtests.asserts=true -Dtests.file.encoding=UTF-8

@asfimport
Copy link
Author

Greg Miller (@gsmiller) (migrated from JIRA)

Hmm, interesting. Thanks for calling this out @jtibshirani! Seems like random testing might have caught some sort of bug here. I'll have a look as soon as I'm able to.

@asfimport
Copy link
Author

asfimport commented Jun 4, 2021

Greg Miller (@gsmiller) (migrated from JIRA)

I think I found the issue. It's with the test itself. I created #11030 to track. Thanks!

@asfimport
Copy link
Author

Mayya Sharipova (@mayya-sharipova) (migrated from JIRA)

Closing after the 8.9.0 release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant