-
Notifications
You must be signed in to change notification settings - Fork 30
Dealing with duplication in council exports
Will Roper edited this page Feb 15, 2024
·
4 revisions
It's 2021 and a lot of the Xpress exports are giving us a record per election type...
This significantly slows down imports in large councils, so here's how I'm dealing with it.
I'm using miller but I'm sure these commands can be adapted to csvkit, xsv, rows, awk, q etc.
Check the csv and get it's length:
$ mlr --csv check Democracy_Club__06May2021.CSV && wc -l Democracy_Club__06May2021.CSV
691117 Democracy_Club__06May2021.CSV
Look at how it's broken down
mlr --itsv --opprint --barred count-distinct -f ElectionType Democracy_Club__06May2021.CSV
+-------------------------------+--------+
| ElectionType | count |
+-------------------------------+--------+
| Police and Crime Commissioner | 247457 |
| County | 247456 |
| Parish | 196203 |
+-------------------------------+--------+
Pick one of the ElectionTypes
to filter on:
mlr --csv filter '$ElectionType=="County"' Democracy_Club__06May2021.CSV > durham_deduped.csv
Finally check if there are any uprns in the original file that aren't in the new one:
mlr --icsv --oxtab join -f 'durham_deduped.csv' -r Property_URN -l Property_URN -j Property_URN --np --ur Democracy_Club__06May2021.CSV
As long as it's just one or two I've been excepting it.
Bonus points... check there aren't loads of uprn rounding errors:
mlr --tsv count-distinct -f Property_URN then sort -nr count then head Democracy_Club__06May2021.tsv