Skip to content

Commit

Permalink
docs preprocessing1
Browse files Browse the repository at this point in the history
  • Loading branch information
francesco-vaselli committed Aug 30, 2023
1 parent 69bc6d1 commit 23ba6e9
Show file tree
Hide file tree
Showing 2 changed files with 25 additions and 1 deletion.
25 changes: 25 additions & 0 deletions docs/preprocessing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
Our dataset is composed of multiple .json files per patient. Each patient is identified by a unique ID.
In the following we describe the codebase which we have used to extract the raw data from the files and the optional upsampling operations.
In the end, we have about a hundred patients to build the train dataset, and we leave out 4 patients to experiment with transfer learning.
The large amount of patients data means we end up with a rather large dataset (5 million time series) when compared to the reference work.

## Overview of Codebase Architecture for Patient Data Aggregation and Preprocessing

This codebase comprises a sequence of modular components, each fulfilling a specific role in the pipeline of aggregating and preprocessing Continuous Glucose Monitoring System (CGMS) time-series data. The pipeline has been developed to enhance usability, modularity, and extensibility. Below is a detailed breakdown of each component.

### Configuration Management via YAML Files

The first component is dedicated to the centralized management of various configurations required for data preprocessing and model training. This is accomplished through a YAML file that contains an organized hierarchy of parameters, such as data directory paths, data scaling options, and smoothing parameters. The adoption of an external YAML configuration file not only enhances the ease of management but also allows for a more flexible system configuration.

### CGMSDataSeg Class: Data Preprocessing

The second core component is the *CGMSDataSeg* class, explicitly designed to handle the segmentation and preprocessing of raw CGMS data. The class is equipped with multiple functionalities like data slicing, scaling, and smoothing. It also offers optional data augmentation techniques, including Gaussian noise and MixUp, to improve the robustness of the resulting dataset. The class thus serves as a comprehensive toolkit for turning raw CGMS time-series data into a refined, machine-learning-ready dataset.
<!-- The *_build_dataset* method in CGMSDataSeg constructs time-series windows from raw glucose readings for machine learning models. Specifically, it takes continuous glucose measurements and slices them into overlapping 'windows' of fixed lengths, defined by sampling_horizon and prediction_horizon. These windows serve as input features (x) and corresponding targets (y) for supervised learning. The method allows for different padding strategies to adjust the shape of the output, catering to the needs of various types of temporal models. -->

### DataReader Utility: Data Collection.

Our third component is the *DataReader* utility class, which has been created to efficiently read and parse JSON files containing patient-specific time-series data. The utility converts the raw JSON data into Python lists, thus making it far more manageable and ready for subsequent preprocessing stages. It boasts the capability of not just reading, but also smartly interpreting the data based on specific attributes and time intervals.

### Dataset Aggregator: Data Compilation and Transformation

The fourth component is a stand-alone script that functions as a dataset aggregator. Leveraging the DataReader and CGMSDataSeg classes, this script successfully merges data from multiple patients based on their unique identifiers. It is engineered to handle multiple JSON files for each patient and gracefully manage such cases. Post-aggregation, the entire dataset is saved as a NumPy array, making it readily accessible for future machine learning applications.
1 change: 0 additions & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,6 @@ theme:
warning: material/alert

site_name: GlucoseGuard
# "GlucoseGuard: A Time-Series Approach to Predicting Blood Sugar Levels"
site_url: https://francesco-vaselli.github.io/GlucoseGuard/

nav:
Expand Down

0 comments on commit 23ba6e9

Please sign in to comment.