📍 Lesson 3 - Identifying Duplicate Records and Data cleaning

Record Counts

SELECT COUNT(*)
FROM health.user_logs;

Unique Records

SELECT COUNT(DISTINCT ID) 
FROM health.user_logs;

Single Column Frequency Counts By Measure

SELECT
  measure,
  COUNT(*) AS frequency,
  ROUND(
    100 * COUNT(*) / SUM(COUNT(*)) OVER (),
    2
  ) AS percentage
FROM health.user_logs
GROUP BY measure
ORDER BY frequency DESC;

Single Column Frequency Counts by Id

SELECT
  id,
  COUNT(*) AS frequency,
  ROUND(
    100 * COUNT(*) / SUM(COUNT(*)) OVER (),
    2
  ) AS percentage
FROM health.user_logs
GROUP BY id
ORDER BY frequency DESC
LIMIT 10;

Individual Column Distributions
- Measure

SELECT
  measure_value,
  COUNT(*) AS frequency
FROM health.user_logs
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10;

Systolic

SELECT
  systolic,
  COUNT(*) AS frequency
FROM health.user_logs
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10;

Diastolic

SELECT
  Diastolic
  COUNt(*) AS frequency
FROM health.user_logs
GROUP BY 1
ORDER BY 2 DESC 
LIMIT 10;

Exercises

Which id value has the most number of duplicate records in the health.user_logs table?

      Select id, 
      count(id) as id_count 
      from health.user_logs 
      group by id 
      order by id_count DESC 
      LIMIT 
      1;

Which log_date value had the most duplicate records after removing the max duplicate id value from question 1?

select
  log_date,
  count(log_date)
from
  health.user_logs
where
  id != (
    Select
      id
    from
      health.user_logs
    group by
      id
    order by
      count(id) DESC
    LIMIT
      1
  )
group by
  log_date
order by
  count(log_date) desc
limit
  1;

Which measure_value had the most occurences in the health.user_logs value when measure = 'weight'?

select
  measure_value,
  count(measure_value) as measurevalue_count
from
  health.user_logs
where
  measure = 'weight'
group by
  measure_value
order by
  measurevalue_count desc
limit
  1;

How many single duplicated rows exist when measure = 'blood_pressure' in the health.user_logs? How about the total number of duplicate records in the same table?

WITH groupby_counts AS (
  SELECT
    id,
    log_date,
    measure,
    measure_value,
    systolic,
    diastolic,
    COUNT(*) AS frequency
  FROM health.user_logs
  WHERE measure = 'blood_pressure'
  GROUP BY
    id,
    log_date,
    measure,
    measure_value,
    systolic,
    diastolic
)
SELECT
  COUNT(*) as single_duplicate_rows,
  SUM(frequency) as total_duplicate_records
FROM groupby_counts
WHERE frequency > 1;

What percentage of records measure_value = 0 when measure = 'blood_pressure' in the health.user_logs table? How many records are there also for this same condition?

WITH all_measure_values AS (
  SELECT
    measure_value,
    COUNT(*) AS total_records,
    SUM(COUNT(*)) OVER () AS overall_total
  FROM health.user_logs
  WHERE measure = 'blood_pressure'
  GROUP BY 1
)
SELECT
  measure_value,
  total_records,
  overall_total,
  ROUND(100 * total_records::NUMERIC / overall_total, 2) AS percentage
FROM all_measure_values
WHERE measure_value = 0;

What percentage of records are duplicates in the health.user_logs table?

WITH groupby_counts AS (
  SELECT
    id,
    log_date,
    measure,
    measure_value,
    systolic,
    diastolic,
    COUNT(*) AS frequency
  FROM health.user_logs
  GROUP BY
    id,
    log_date,
    measure,
    measure_value,
    systolic,
    diastolic
)
SELECT
  ROUND(
    100 * SUM(CASE
        WHEN frequency > 1 THEN frequency - 1
        ELSE 0 END
    )::NUMERIC / SUM(frequency),
    2
  ) AS duplicate_percentage
FROM groupby_counts;

Additional Notes

Remove all duplicate records from a dataset using DISTINCT
Use Common Table Expressions and subqueries to calculate unique record counts
Clean up existing temporary tables using the DROP TABLE IF EXISTS
Create temporary tables using the results from a SELECT statement
Detect the presence of duplicates by comparing basic record counts with unique counts
Identify exact duplicate records using a GROUP BY on all columns in a table
Calculate the number of occurences a record appears in a table
Filter records from a SELECT statement with a GROUP BY using the HAVING clause

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

duplicate data and data cleaning.md

duplicate data and data cleaning.md

📍 Lesson 3 - Identifying Duplicate Records and Data cleaning

Exercises

Additional Notes

Files

duplicate data and data cleaning.md

Latest commit

History

duplicate data and data cleaning.md

File metadata and controls

📍 Lesson 3 - Identifying Duplicate Records and Data cleaning

Exercises

Additional Notes