Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: index numerical and date fields in Solr with appropriate types + more targeted search result highlighting #10887

Open
wants to merge 2 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions conf/solr/schema.xml
Original file line number Diff line number Diff line change
Expand Up @@ -814,6 +814,8 @@
<!-- KD-tree versions of date fields -->
<fieldType name="pdate" class="solr.DatePointField" docValues="true"/>
<fieldType name="pdates" class="solr.DatePointField" docValues="true" multiValued="true"/>

<fieldType name="date_range" class="solr.DateRangeField"/>

<!--Binary data type. The data should be sent/retrieved in as Base64 encoded Strings -->
<fieldType name="binary" class="solr.BinaryField"/>
Expand Down
11 changes: 11 additions & 0 deletions doc/release-notes/10887-solr-field-types.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
This release enhances how numerical and date fields are indexed in Solr. Previously, all fields were indexed as English text (text_en), but with this update:

* Integer fields are indexed as `plong`
* Float fields are indexed as `pdouble`
* Date fields are indexed as `date_range` (`solr.DateRangeField`)

This enables range queries via the search bar or API, such as `exampleIntegerField:[25 TO 50]` or `exampleDateField:[2000-11-01 TO 2014-12-01]`.

To activate this feature, Dataverse administrators must update their Solr schema.xml (manually or by rerunning `update-fields.sh`) and reindex all datasets.

Additionally, search result highlighting is now more accurate, ensuring that only fields relevant to the query are highlighted in search results. If the query is specifically limited to certain fields, the highlighting is now limited to those fields as well.
9 changes: 4 additions & 5 deletions src/main/java/edu/harvard/iq/dataverse/DatasetFieldType.java
Original file line number Diff line number Diff line change
Expand Up @@ -531,15 +531,14 @@ public String getDisplayName() {
public SolrField getSolrField() {
SolrField.SolrType solrType = SolrField.SolrType.TEXT_EN;
if (fieldType != null) {

/**
* @todo made more decisions based on fieldType: index as dates,
* integers, and floats so we can do range queries etc.
*/
if (fieldType.equals(FieldType.DATE)) {
solrType = SolrField.SolrType.DATE;
} else if (fieldType.equals(FieldType.EMAIL)) {
solrType = SolrField.SolrType.EMAIL;
} else if (fieldType.equals(FieldType.INT)) {
solrType = SolrField.SolrType.INTEGER;
} else if (fieldType.equals(FieldType.FLOAT)) {
solrType = SolrField.SolrType.FLOAT;
}

Boolean parentAllowsMultiplesBoolean = false;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1061,6 +1061,8 @@ public SolrInputDocuments toSolrDocs(IndexableDataset indexableDataset, Set<Long
// no-op. we want to keep email address out of Solr per
// https://github.com/IQSS/dataverse/issues/759
} else if (dsfType.getSolrField().getSolrType().equals(SolrField.SolrType.DATE)) {
// we index dates as full strings (YYYY, YYYY-MM or YYYY-MM-DD)
// for use in facets, we index only the year (YYYY)
String dateAsString = "";
if (!dsf.getValues_nondisplay().isEmpty()) {
dateAsString = dsf.getValues_nondisplay().get(0);
Expand All @@ -1080,7 +1082,7 @@ public SolrInputDocuments toSolrDocs(IndexableDataset indexableDataset, Set<Long
logger.fine("YYYY only: " + datasetFieldFlaggedAsDate);
// solrInputDocument.addField(solrFieldSearchable,
// Integer.parseInt(datasetFieldFlaggedAsDate));
solrInputDocument.addField(solrFieldSearchable, datasetFieldFlaggedAsDate);
solrInputDocument.addField(solrFieldSearchable, dateAsString);
if (dsfType.getSolrField().isFacetable()) {
// solrInputDocument.addField(solrFieldFacetable,
// Integer.parseInt(datasetFieldFlaggedAsDate));
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -271,7 +271,7 @@ public SolrQueryResponse search(
List<DatasetFieldType> datasetFields = datasetFieldService.findAllOrderedById();
Map<String, String> solrFieldsToHightlightOnMap = new HashMap<>();
if (addHighlights) {
solrQuery.setHighlight(true).setHighlightSnippets(1);
solrQuery.setHighlight(true).setHighlightSnippets(1).setHighlightRequireFieldMatch(true);
Integer fragSize = systemConfig.getSearchHighlightFragmentSize();
if (fragSize != null) {
solrQuery.setHighlightFragsize(fragSize);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ public enum SolrType {
* support range queries) in
* https://github.com/IQSS/dataverse/issues/370
*/
STRING("string"), TEXT_EN("text_en"), INTEGER("int"), LONG("long"), DATE("text_en"), EMAIL("text_en");
STRING("string"), TEXT_EN("text_en"), INTEGER("plong"), FLOAT("pdouble"), DATE("date_range"), EMAIL("text_en");

private String type;

Expand Down
Loading