This project contains:
- a stemmer for the Latin language,
- a filter that converts roman numerals into arabic ones, and
- a value source that correctly sorts strings with numbers.
Usage example in conf/schema.xml
:
<fieldType name="text_la_stem" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="de.uni_koeln.capitularia.lucene_tools.LatinStemFilterFactory"
preserveOriginal="true" minNounSize="3" minVerbSize="3"/>
</analyzer>
</fieldType>
The stemmer uses an algorithm by Schinke et al.
See:
Schinke R, Greengrass M, Robertson AM and Willett P (1996) A stemming algorithm for Latin text databases. Journal of Documentation, 52: 172-187.
The filter will convert roman XLII
to arabic 42
.
Usage example in conf/schema.xml
:
<fieldType name="text_la_stem" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="de.uni_koeln.capitularia.lucene_tools.RomanNumeralsFilterFactory"
preserveOriginal="true"/>
</analyzer>
</fieldType>
The value source generates strings that sort correctly when used as keys, like this:
- paris-bn-lat-4638
- paris-bn-lat-10528
instead of alphabetically, like this:
- paris-bn-lat-10528
- paris-bn-lat-4638
Usage example in conf/solrconfig.xml
:
<config>
<valueSourceParser
name="strnumsort"
class="de.uni_koeln.capitularia.lucene_tools.StringNumberSortValueSourceParser"
/>
...
</config>
In the query set the sort
parameter to: strnumsort(my_alphanum_id) asc