Skip to content

macocu/documentation

Repository files navigation

Software developed for the MaCoCu Action

Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages (MaCoCu) is an action funded by the European Union's Connecting Europe Facility 2014-2020 - CEF Telecom, under Grant Agreement No. INEA/CEF/ICT/A2020/2278341. The project focuses on collecting monolingual and parallel data from the Internet, specially for under-resourced languages and DSI-specific data. All the software developed for this action will be released under free/open-source licenses.

At the moment, the second data release has been published for the ten languages targeted in the Action (Albanian, Bulgarian, Croatian, Icelandic, Maltese, Macedonian, Montenegrin, Serbian, Slovene, Turkish) plus a bonus language: Bosnian. The steps followed to collect this data started with a top-level domain crawling process. In this way, the national top-level domains of the countries where the targeted languages are official plus several generic top-level domains, such as .eu, .com, or .org, were crawled for a few months, and all the documents downloaded in the targeted languages (and also in English) were stored. The tool used for this purpose was the MaCoCu crawler. The output of this tool is produced in the prevertical format, which can be processed using the python module prevert.

Crawled documents were then processed using two pipelines: Monotextor, aimed at building monolingual corpora, Bitextor, aimed at building parallel corpora. The specific versions of the pipelines used to produce the corpora included in the first data release, are:

The main purpose of Bitextor is to preprocess, identify and align parallel content within a collection of multilingual documents. However, it also includes several curation components aimed at improving the quality of the final corpora. These components are:

Apart from these components integrated in the pipeline, two independent tools were also applied on the clean corpora to enrich them with metadata. These two components are:

The Monofixer pipeline is aimed at building monolingual corpora from a collection of multilingual documents. One of the key components of this pipeline is: Monocleaner, a tool that aims at detecting disfluent sentences in a monolingual corpus. For the first data release of the project, BiROAMer was also used on machine-translated content to identify fragments of text containing personal data.

Finally, the second data release of the project includes, for each segment pair, the probability distribution of belonging to the DSIs targeted in the MaCoCu project: e-Health, e-Justice, Online Dispute Resolution, Europeana, Open Data Portal, Business Registers Interconnection System, e-Procurement, Safer Internet, Cybersecurity, and Electronic Exchange of Social Security Information. These probabilities are obtained by using the tool MaCoCu DSI.

Disclaimer

The contents of this document are the sole responsibility of the members of the MaCoCu consortium, and do not necessarily reflect the opinion of the European Union.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published