Course material for a 3-day training course in data cleaning during the International Week at Prague University of Economics and Business, Faculty of Informatics and Statistics.
Deliverd by Mark van der Loo
To get started: download the zipfile with materials and unzip.
- Data Quality
- Processing data in production: the statistical value chain
- Techniques for cleaning text data with R
- Data validation with the validate R package.
- Rule-based data cleaning with R package dcmodify
- Error localization with R package errorlocate
- Imputation with R package simputation
- Process monitoring with R package lumberjack
Assignment: build a data cleaning system.
- Teams will spent one part of the day building a small production system that cleans a data set (to be provided) and estimates statistics.
- Groups can download their individual data and assignments here:
- Quality Assurance Framework of the European Statistical System pdf.
- T de Waal, Pannekoek, J and Scholtus, S (2011) Handbook of Statistical Data Editing and Imputation. John Wiley & Sons. link
- MPJ van der Loo and De Jonge, E (2018) Statistical Data Cleaning with Applications in R. John Wiley & Sons link
- MPJ van der Loo, ten Bosch, KO (2023) The Data Validation Cookbook. Free Online Book.
- MPJ van der Loo, E de Jonge (2021). Data Validation Infrastructure for R. Journal of Statistical Software 1--22 97. pdf
- MPJ van der Loo, E de Jonge (2020). Data Validation. In Wiley StatsRef: Statistics Reference Online, pages 1-7. American Cancer Society. pdf.
- MPJ van der Loo (2021). Monitoring data in R with the lumberjack package. Journal of Statistical Software 98 1--11. pdf
- M.P.J. van der Loo (2014). The stringdist Package for Approximate String Matching. The R Journal 6 111--122 pdf
This work is licensed under a Creative Commons Attribution 4.0 International License.