A pipeline of Python programs to retrieve EU regulatory documents from the Eur-Lex portal (https://eur-lex.europa.eu).
The program starts by sending a given SPARQL query (defining, for example, the types, the language, and the domain of the documents to retrieve) to the EU Sparql endpoint to retrieve CELLAR records. The CELLAR ids are then collected and sent to the EU CELLAR endpoint to download the corresponding .xml
and .html
files. The program can also check which files have already been downloaded so as to download only new files. Finally, the text of the .xml
and .html
files is extracted, cleaned up, and output in new text files.
For further details, see the comments in the code of the Python files.
- In
get_cellar_docs.py
specify the following paths:- path to SPARQL query (
sparql_query
). - path to directory containing already downloaded files to check for existing downloads (
dir_to_check
) - path to directory to store downloaded files (
dwnld_folder_path
) - path to directory to store the text files (
txt_folder_path
)
- path to SPARQL query (
- Run
get_cellar_docs.py
to send the SPARQL query to the EU Sparql endpoint, download the files corresponding to the returned CELLAR ids, and output the clean text intxt
files.
The SPARQL query in the sparql_queries/
directory was designed to retrieve EU regulatory documents in the financial domain using EuroVoc concept ids. It can be used as a template to create new queries for other domains, languages, types of documents, etc.
- The information retrieved from the SPARQL endpoint is stored by default under
sparql_query_results/query_results_<date>-<time>.json
(e.g.,sparql_query_results/query_results_query_results_20201203-145051.json
). - The list of files already downloaded is stored by default under
in_dir_lists/in_dir_<date>-<time>.txt
(e.g.,in_dir_lists/in_dir_20201214-155143.txt
). - The list of new CELLAR ids to send to the EU CELLAR server is stored by default under
new_cellar_ids/new_cellar_ids_<date>-<time>.txt
(e.g.,new_cellar_ids/new_cellar_ids_20201214-155143.txt
). - The retrieved
.xml
and.html
files are downloaded to a new directory named by defaultdata/cellar_files_<date>-<time>/<CELLAR_ID>/
(e.g.,data/cellar_files_20201214-155143/39ca1c1c-3091-11eb-b27b-01aa75ed71a1/
). - The generated
.txt
files are stored by default underdata/text_files_<download_date>-<download_time>.txt
(e.g.,data/text_files_20201214-155143/
).
- The downloaded HTML files are renamed with their CELLAR id (e.g.,
data/cellar_files_20201214-155143/1e4dc7cb-903d-11ea-812f-01aa75ed71a1/1e4dc7cb-903d-11ea-812f-01aa75ed71a1.html
) - The downloaded XML files are not renamed, as a single CELLAR id may apply to several documents (e.g.,
data/cellar_files_20201214-155143/39ca1c1c-3091-11eb-b27b-01aa75ed71a1/C_2020411EN.01050002.xml
) - The generated
.txt
files are all renamed with the name of the original.xml
or.html
file (e.g.,data/text_files_20201214-155143/C_2020411EN.01050002.txt
ordata/text_files_20201214-155143/1e4dc7cb-903d-11ea-812f-01aa75ed71a1.txt
)
- When extracting the text from XML files, footnotes are currently removed to avoid them being inserted in the middle of a sentence in the text file.
- The text from nested tables in HTML files is repeated, as shown in this example.
Portugal — Região Autónoma da Madeira (Autonomous Region of Madeira) — Região Autónoma dos Açores (Autonomous Region of Azores) — Municipalities
— Região Autónoma da Madeira (Autonomous Region of Madeira)
— Região Autónoma dos Açores (Autonomous Region of Azores)
— Municipalities
- Some CELLAR ids point to HTML files that contain URIs instead of content. The linked contents are currently not retrieved (e.g., http://publications.europa.eu/resource/cellar/d4661dab-51b2-11e7-a5ca-01aa75ed71a1)
- The number of CELLAR ids to be downloaded might be different from the number of files actually downloaded due to the fact that a single CELLAR id can correspond to multiple
.xml
files. - When subtracting the list of files that had been downloaded on a previous occasion (e.g.,
14949
) from the list of CELLAR ids retrieved by the query (e.g.,15175
), we obtain the number of CELLAR ids to be sent to the EU CELLAR endpoint to download the corresponding files (e.g.,15175 - 14949 = 226
). However, the number thus obtained may be different from the number of actual downloads (e.g.,242
). This seems to be due to the fact that some of the pre-existing CELLAR ids are not present on the newly retrieved CELLAR id list (e.g.,16 + 226 = 242
).
Selja Seppälä (selja.seppala [at] ucc.ie)
This program was developed as part of the reseearch project titled "RegDef: A Computer-assisted Definition Authoring and Formalisation System for Legal Experts", and supported by the Marie Skłodowska-Curie Career-FIT Fellowship MF20180003, awarded by the Marie Skłodowska-Curie Actions and Enterprise Ireland. Career-FIT has received funding from the European Union’s Horizon2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 713654.
Further details on the RegDef project are available at https://seljaseppala.wordpress.com/research/regdef/.