Dealing with duplication in council exports

It's 2021 and a lot of the Xpress exports are giving us a record per election type...

This significantly slows down imports in large councils, so here's how I'm dealing with it.

I'm using miller but I'm sure these commands can be adapted to csvkit, xsv, rows, awk, q etc.

Check the csv and get it's length:

$ mlr --csv check Democracy_Club__06May2021.CSV && wc -l Democracy_Club__06May2021.CSV

691117 Democracy_Club__06May2021.CSV

Look at how it's broken down

mlr --itsv --opprint --barred count-distinct -f ElectionType Democracy_Club__06May2021.CSV

+-------------------------------+--------+
| ElectionType                  | count  |
+-------------------------------+--------+
| Police and Crime Commissioner | 247457 |
| County                        | 247456 |
| Parish                        | 196203 |
+-------------------------------+--------+

Pick one of the ElectionTypes to filter on:

mlr --csv filter '$ElectionType=="County"' Democracy_Club__06May2021.CSV > durham_deduped.csv

Finally check if there are any uprns in the original file that aren't in the new one:

mlr --icsv --oxtab join -f 'durham_deduped.csv' -r Property_URN -l Property_URN -j Property_URN --np --ur Democracy_Club__06May2021.CSV

As long as it's just one or two I've been excepting it.

Bonus points... check there aren't loads of uprn rounding errors:

mlr --tsv count-distinct -f Property_URN then sort -nr count then head Democracy_Club__06May2021.tsv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dealing with duplication in council exports

Clone this wiki locally