In german language the "Ergänzungsstrich" is very common and popular. For example you often see something like:
- "Im- und Export" <- "Import und Export"
- "Tablet-, PC- und Smartphonebasis" <- "Tabletbasis, Pc-Basis und Smartphonebasis"
- "Gesellschafts- und sonstigen Vermögensbeteiligungen" <- "Gesellschaftsbeteiligungen und sonstigen Vermögensbeteiligungen"
- "Haarschneide- und -scherapparate" <- "Haarschneideapparate und Haarscherapparate"
This becomes a huge burden for automatic text recognition (e.g. NLP) and all kinds of maschine learning. So this is a PROOF-OF-CONCEPT to remove these Suplemental lines from texts. The first version will work on the first 2 examples but not on the last 2 more difficult onces.
PS: PROOF-OF-CONCEPT in this context means that I am not a trained programmer and that the published source code does not meet any requirements in terms of performance or readability. It shows an example of a way to solve the problem described, as I have not been able to find any other solution so far. Use of the idea for implementation in software is welcome.
Splits a German compound into its body and head, e.g.
Autobahnraststätte -> Autobahn - Raststätte
Implementation of the method described in the appendix of the thesis:
Tuggener, Don (2016). Incremental Coreference Resolution for German. University of Zurich, Faculty of Arts.
The method calculates probabilities of ngrams occurring at the beginning, end and in the middle of words and identifies the most likely position for a split.
The method achieves ~95% accuracy for head detection on the Germanet compound test set.
A model is provided, trained on 1 Mio. German nouns from Wikipedia.
$ python char_split_train.py <your_train_file>
where <your_train_file>
contains one word (noun) per line.
From command line:
$ python char_split.py <word>
Outputs all possible splits, ranked by their score, e.g.
$ python char_split.py Autobahnraststätte
0.84096566854 Autobahn Raststätte
-0.54568851959 Auto Bahnraststätte
-0.719082070993 Autobahnrast Stätte
...
As a module:
$ python
>>> import char_split
>>> char_split.split_compound('Autobahnraststätte')
[[0.7945872450631273, 'Autobahn', 'Raststätte'],
[-0.7143290887876655, 'Auto', 'Bahnraststätte'],
[-1.1132332878581173, 'Autobahnrast', 'Stätte'],
[-1.4010051533086552, 'Aut', 'Obahnraststätte'],
[-2.3447843979244944, 'Autobahnrasts', 'Tätte'],
[-2.4761904761904763, 'Autobahnra', 'Ststätte'],
[-2.4761904761904763, 'Autobahnr', 'Aststätte'],
[-2.5733333333333333, 'Autob', 'Ahnraststätte'],
[-2.604651162790698, 'Autobahnras', 'Tstätte'],
[-2.7142857142857144, 'Autobah', 'Nraststätte'],
[-2.730248306997743, 'Autobahnrastst', 'Ätte'],
[-2.8033113109925973, 'Autobahnraststä', 'Tte'],
[-3.0, 'Autoba', 'Hnraststätte']]
From command line:
$ python doc_split.py <dict>
Reads everything from standard input
and writes out the same, with the best splits
separated by the middle dot character ·
.
Each word is split as many times as possible based on the file , which contains German words one per line (comment lines beginning with # are allowed).
The name of the default dictionary is in the file doc_config.py
.
Note that the doc_split
module retains a cache of words already split,
so long documents will typically be processed proportionately faster
than short ones.
The cache is discarded when the program ends.
$ python sentence1.txt
Um die in jeder Hinsicht zufriedenzustellen, tüftelt er einen Weg aus,
sinnlose Bürokratie wie Ladenschlußgesetz und Nachtbackverbot auszutricksen.
$ python doc_split.py <sentence1.txt
Um die in jeder Hinsicht zufriedenzustellen, tüftelt er einen Weg aus,
sinnlose Bürokratie wie Laden·schluß·gesetz und Nacht·back·verbot auszutricksen.
As a module:
$ python
>>> import doc_split
>>> # Constant containing a middle dot
>>> doc_split.MIDDLE_DOT
'·'
>>> # Split a word as much as possible, return a list
>>> doc_split.maximal_split('Verfassungsschutzpräsident')
['Verfassungs', 'Schutz', 'Präsident']
>>> # Split a word as much as possible, return a word with middle dots
'Verfassungs·schutz·präsident'
>>> # Split all splittable words in a sentence
>>> doc_split.doc_split('Der Marquis schlug mit dem Handteller auf sein Regiepult.')
Der Marquis schlug mit dem Hand·teller auf sein Regie·pult.
Because of the startup time, you can run the document splitter as a simple server, and the responses will be quicker.
$ python doc_server [ -d ] <dict> <port>
The server will load <dict>
and listen on <port>
.
The client must
send the raw data in UTF-8 encoding to the port
and close the write side of the port, and the
server will return the split data.
The option -d
causes the server to return a sorted dictionary
of split words instead. Each word is on a single line,
with the original word followed by a tab character followed by the split word.
Because of Python restrictions, the server is single-threaded.
The default dictionary and port are in the file doc_config.py
.
A trivial client is provided:
$ python doc_client <port> <host>
Reads a document from standard input,
send it to the server running on <host>
and <port>
,
and send the server's output to standard output.
Thus it has the same interface as doc_split
(except that the dictionary cannot be specified),
but should run somewhat faster.
The default host and port are in the file doc_config.py
.
To download German and Dutch dictionaries for doc_split
and doc_server
:
$ cd dicts
$ sh getdicts
This will download the spelling plugins from the LibreOffice site,
extract the wordlists, and write five files into the current directory.
It leaves a good many files in /tmp
, which are not needed further.
- The dictionaries
de-DE.dic
,de-AT.dic
, andde-CH.dic
are fairly extensive (about 250,000 words each) and provide current German, Austrian, and Swiss spelling. - The file
de-1901.dic
provides the spelling used between 1901 and 1996. - The file
misc.dic
is a collection of nouns that are mis-split and are therefore included in the dictionary so that they won't be split. - The file
legal.dic
contains legal terms. Remove it before running getdicts if you don't want it to be included. - The file
de-mixed.dic
is a merger of all of the other files. - The file
nl-NL.dic
is from OpenOffice and provides Dutch spelling (not currently used).
You can add your own wordlists before running getdicts
if you want.
They must be plain UTF-8 text with one word per line
and begin with the correct language code (de
for German).
If the program is not splitting hard enough for your purposes, you may want to find and use a smaller dictionary.