Scripts and final data from harvesting §20 questions from the Danish Paralament 2021-2024
PDF files omitted for obvious reasons
- A platform that can host Java, wget, pdftotext, jq and curl
- Time (it takes several hours to download the PDFs)
- Disk space (a gig is best)
Use Java program BuildDownloadScript
to create a script to download the initial JSON files from the Danish Parament's Open Data platform
These files contain metadata about the p20 questions and links to the PDFs for questions and answers
Collect the output from the Java code in a script file (I used getit.sh
). Make it executable with chmod
. Execute and it will download the initial JSON in chunks.
Extract the list of files to download using jq ".value[].Dokument.Fil[].filurl" file*.json | sort -n | grep -v '""' > filelist.txt
Use Java program BuildWgetScript
to read the filelist.txt
and generate a script that will download the PDF files (downloadpdf.sh
).
Make it executable with chmod
. Execute and it will download the PDF files and extract the text in them using pdftotext
Generate list of metadata to include in big json using jq ".value[].Dokument.Fil[] | .titel, .versionsdato, .filurl" file*.json > titlerdatofiler.txt
Use Java program GenerateJson
to read the titlerdatofiler.txt
and generate huge JSON file called bigjson.json
This Java program will create a big JSON file that for each question contains
- Link to the PDF it came from
- Date of the file
- Full text extracted from the file
- A cleaned up version of that text which only contains the actual question
- A list of 0..n questions, for each 5.1 Link to the PDF it came from 5.2. Date of the file 5.3 Full text extracted from the file 5.4 4. A cleaned up version of that text which only contains the actual question
Use jq . bigjson.json prettyjson.json
to make that JSON more human-friendly