Skip to content

Commit

Permalink
Merge pull request the-turing-way#3818 from the-turing-way/missingdat…
Browse files Browse the repository at this point in the history
…a-updates

Updating Data Missingness Subchapter
  • Loading branch information
Zeena-Shawa committed Sep 25, 2024
2 parents bd19269 + 7034a28 commit bf2d55a
Show file tree
Hide file tree
Showing 4 changed files with 38 additions and 4 deletions.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,12 @@

An alternative way of characterising missing data, known as structured missingness (SM), has been pioneered by researchers of the [Turing-Roche Partnership](https://www.turing.ac.uk/research/research-projects/alan-turing-institute-roche-strategic-partnership). SM arises in data that is MCAR, MAR or MNAR, and whose missingness has some structure or pattern {cite:ps}`Mitra2023structuredmissingness`. Specifically, standard definitions of missinginess mechanisms (such as those introduced in {ref}`pd-missing-data-structures`) assume that the missingness of one variable is independent of the missingness in another, when conditioning on the relevant data. In contrast, the missingness of a variable can depend on the data *and* the missingness of other variables in SM {cite:ps}`Jackson2023structuredmissingness`.


This is common in research contexts where data is combined from multiple studies or sources. For instance, many large-scale healthcare studies are multimodal and attempt to include a diverse set of patients, therefore capturing data for a heterogeneous group of individuals. Therefore, data is often collected at multiple time points and multiple sites, where different measurements may be taken, such as clinical, genomic or imaging measures. Our example dataset (introduced in {ref}`pd-missing-data-structures`) is also an example of SM.

> **Example**: The missing values in the blood test results, blood pressure readings, and cognitive scores are all examples of SM. The blood test results (MCAR) are due to batch failure. The cognitive score missing values (MNAR) are missing in participants with significant cognitive decline. The blood pressure readings (MAR) are missing in participants that could not attend the clinic due to being older and having more motor dysfunction. Therefore, the missingness in all these variables are *not* equally likely for all individuals, even after conditioning on the relevant data. The missingness has some information that can be leveraged in further analyses and this would be also considered as SM.
> **Example**: The missing values in the blood test results, blood pressure readings, and cognitive scores are all examples of SM. The blood test results (MCAR) are due to batch failure. The cognitive score missing values (MNAR) are missing in participants with significant cognitive decline. The blood pressure readings (MAR) are missing in participants that could not attend the clinic due to being older and having more motor dysfunction. Therefore, the missingness in all these variables are *not* equally likely for all individuals, even after conditioning on the relevant data. The missingness has some information that can be leveraged in further analyses and this would be also considered as SM.
>
> Moreover, the missingness present here is directly related to digital health equity and fairness, as certain participants were unable to attend due to differences in accessibility. Therefore, this example also demonstrates potential ethical consequences and importance of analysing data missingness by interrogating the missingness structure or patterns.
>
> | Participant Number | Age | Diastolic Blood Pressure | Systolic Blood Pressure | Blood Test Result | Motor Score | Cognitive Score |
> |--------------------|-----|--------------------------|-------------------------|---------------------------------------------------|-------------|-------------------------------------------------|
Expand All @@ -24,6 +27,17 @@ This is common in research contexts where data is combined from multiple studies

Many datasets, fusing data from multiple sites and modalities, do take care to follow a certain design and data collection process. However, machine learning methods perform best with large datasets. It is common practice for a machine learning model to include data from many studies, often with different designs and variables. Missing values may therefore include information in and of themselves; they may be related to sampling methodologies or reflect population characteristics. Traditional imputation methods, such as those introduced in ref{`pd-missing-data-methods`}, frequently are not appropriate for handling SM and do not take advantage of the information inherent in SM {cite:ps}`Mitra2023structuredmissingness`. SM also has consequences for downstream analyses; if there is bias to the SM mechanisms, the fairness of the model would be in question. Further research is required to identify appropriate methods for universally handling SM and in defining SM within the MCAR, MAR, and MNAR framework {cite:ps}`Jackson2023structuredmissingness`.

```{figure} ../../figures/missing-data-structured-missingness.png
---
height: 500px
name: missing-data-structured-missingness
alt:
---
Overview of the structured missingness (SM) life cycle [Text adapted from: {cite:ps}`Mitra2023structuredmissingness`]. **A.** In a given dataset, data may come from different sources and modalities. Some examples are electronic medical records (EMR), wearable devices, or social media. Data may not be collected at the same time or for the same individuals in each case, which may result in both random and SM when joining these data sources together. Researchers are working on developing tools to minimize the effects of SM on any downstream analysis. **B.** Unique models can be built using different combinations or portions of datasets. However, SM may affect how models learn from data and cause bias. Researchers similarly need to develop tools that handle and adapt SM appropriately. **C.** These models can then be used to perform analysis, inferences, and predictions. Effective data imputation can have a large impact on results. Therefore, it is important to evaluate and benchmark the consequences of data imputation. **D.** Often the end goal of a scientific exploration is to understand causality between different variables. Inferences obtained can provide the foundation for determining causality and counterfactuals, but can be compromised by the presence of SM. The SM life cycle then repeats as these insights can be used to further understand the missingness of the data.
Therefore, developing tools that address SM at every step of the cycle is important so that insights and analysis is unbiased. Researchers are trying to solve these different SM challenges.
```


(pd-missing-data-structured-missingness-summary)=
## Summary
Expand Down
26 changes: 23 additions & 3 deletions contributors.md
Original file line number Diff line number Diff line change
Expand Up @@ -1588,7 +1588,7 @@ what it would cover.
* TPS Staff (2021-Present)
* Core Member, Reviewers & Editors Working Group (2022)
* Book Dash Participant (2021-2023)
* GitHub id: Shelton
* GitHub id: [vhellon](https://github.com/vhellon)
* Twitter: @vickyhellon

* Short bio:
Expand Down Expand Up @@ -1639,5 +1639,25 @@ what it would cover.
## Y--->


<!---Z
## Z--->

## Z

### Zeena Shawa

* Role:
* PhD student (2021-Present)
* Book Dash Participant (June 2024)
* Turing-Roche Community Scholar (October 2023-2024)
* GitHub id: [Zeena-Shawa](https://github.com/Zeena-Shawa)

* Short bio:.
> Zeena Shawa is PhD student in the i4Health Medical Imaging CDT programme at University College London. She is part of the Progression of Neurodegenerative Diseases (POND) Group in the Centre for Medical Image Computing (CMIC), supervised by Dr. Neil Oxtoby and Dr. Rimona Weil. Her PhD project aims at understanding Parkinson’s disease progression using machine learning approaches developed within POND, with a focus on medical imaging data. The insights obtained from this can aid in understanding disease mechanisms, identifying biomarkers associated with disease progression and thus potentially providing targets for therapeutic development.
>
> Zeena was an Enrichment Student at the Alan Turing Institute, where she first engaged with the Turing-Roche Partnership and started looking at imputing missing data. She was also part of the 2023/2024 Turing-Roche Partnership Community Scholar Scheme.



* Personal highlights:
> The main aim of my Turing-Roche Community Scholar project, was to create the {ref}`pd-missing-data` Chapter in the Turing Way handbook, consolidating what I learned during the [Turing Enrichment Scheme](https://www.turing.ac.uk/work-turing/studentships/enrichment) (and more) with current research of the [Turing-Roche Partnership](https://www.turing.ac.uk/research/research-projects/alan-turing-institute-roche-strategic-partnership). Imputing and data missingness is an important aspect of a lot of research, due to missing data being a common problem, especially in large cross-cohort multimodal datasets.
> A main highlight of this project was completing the chapter and then doing a live merge in a Collaboration Cafe, which also happened to be a Book Dash Q&A session. It was also great to have the Chapter reviewed by [Vicky Hellon](https://github.com/vhellon).
Binary file added image.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit bf2d55a

Please sign in to comment.