You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
📍 Lesson 3 - Identifying Duplicate Records and Data cleaning
Record Counts
SELECTCOUNT(*)
FROMhealth.user_logs;
Unique Records
SELECTCOUNT(DISTINCT ID)
FROMhealth.user_logs;
Single Column Frequency Counts By Measure
SELECT
measure,
COUNT(*) AS frequency,
ROUND(
100*COUNT(*) /SUM(COUNT(*)) OVER (),
2
) AS percentage
FROMhealth.user_logsGROUP BY measure
ORDER BY frequency DESC;
Single Column Frequency Counts by Id
SELECT
id,
COUNT(*) AS frequency,
ROUND(
100*COUNT(*) /SUM(COUNT(*)) OVER (),
2
) AS percentage
FROMhealth.user_logsGROUP BY id
ORDER BY frequency DESCLIMIT10;
Individual Column Distributions
Measure
SELECT
measure_value,
COUNT(*) AS frequency
FROMhealth.user_logsGROUP BY1ORDER BY2DESCLIMIT10;
Systolic
SELECT
systolic,
COUNT(*) AS frequency
FROMhealth.user_logsGROUP BY1ORDER BY2DESCLIMIT10;
Diastolic
SELECT
Diastolic
COUNt(*) AS frequency
FROMhealth.user_logsGROUP BY1ORDER BY2DESCLIMIT10;
Exercises
Which id value has the most number of duplicate records in the health.user_logs table?
Select id,
count(id) as id_count
fromhealth.user_logsgroup by id
order by id_count DESCLIMIT1;
Which log_date value had the most duplicate records after removing the max duplicate id value from question 1?
select
log_date,
count(log_date)
fromhealth.user_logswhere
id != (
Select
id
fromhealth.user_logsgroup by
id
order bycount(id) DESCLIMIT1
)
group by
log_date
order bycount(log_date) desclimit1;
Which measure_value had the most occurences in the health.user_logs value when measure = 'weight'?
select
measure_value,
count(measure_value) as measurevalue_count
fromhealth.user_logswhere
measure ='weight'group by
measure_value
order by
measurevalue_count desclimit1;
How many single duplicated rows exist when measure = 'blood_pressure' in the health.user_logs? How about the total number of duplicate records in the same table?
WITH groupby_counts AS (
SELECT
id,
log_date,
measure,
measure_value,
systolic,
diastolic,
COUNT(*) AS frequency
FROMhealth.user_logsWHERE measure ='blood_pressure'GROUP BY
id,
log_date,
measure,
measure_value,
systolic,
diastolic
)
SELECTCOUNT(*) as single_duplicate_rows,
SUM(frequency) as total_duplicate_records
FROM groupby_counts
WHERE frequency >1;
What percentage of records measure_value = 0 when measure = 'blood_pressure' in the health.user_logs table? How many records are there also for this same condition?
WITH all_measure_values AS (
SELECT
measure_value,
COUNT(*) AS total_records,
SUM(COUNT(*)) OVER () AS overall_total
FROMhealth.user_logsWHERE measure ='blood_pressure'GROUP BY1
)
SELECT
measure_value,
total_records,
overall_total,
ROUND(100* total_records::NUMERIC/ overall_total, 2) AS percentage
FROM all_measure_values
WHERE measure_value =0;
What percentage of records are duplicates in the health.user_logs table?
WITH groupby_counts AS (
SELECT
id,
log_date,
measure,
measure_value,
systolic,
diastolic,
COUNT(*) AS frequency
FROMhealth.user_logsGROUP BY
id,
log_date,
measure,
measure_value,
systolic,
diastolic
)
SELECT
ROUND(
100*SUM(CASE
WHEN frequency >1 THEN frequency -1
ELSE 0 END
)::NUMERIC/SUM(frequency),
2
) AS duplicate_percentage
FROM groupby_counts;
Additional Notes
Remove all duplicate records from a dataset using DISTINCT
Use Common Table Expressions and subqueries to calculate unique record counts
Clean up existing temporary tables using the DROP TABLE IF EXISTS
Create temporary tables using the results from a SELECT statement
Detect the presence of duplicates by comparing basic record counts with unique counts
Identify exact duplicate records using a GROUP BY on all columns in a table
Calculate the number of occurences a record appears in a table
Filter records from a SELECT statement with a GROUP BY using the HAVING clause