diff --git a/docs/preprocessing.md b/docs/preprocessing.md index e69de29..ffb6235 100644 --- a/docs/preprocessing.md +++ b/docs/preprocessing.md @@ -0,0 +1,25 @@ +Our dataset is composed of multiple .json files per patient. Each patient is identified by a unique ID. +In the following we describe the codebase which we have used to extract the raw data from the files and the optional upsampling operations. +In the end, we have about a hundred patients to build the train dataset, and we leave out 4 patients to experiment with transfer learning. +The large amount of patients data means we end up with a rather large dataset (5 million time series) when compared to the reference work. + +## Overview of Codebase Architecture for Patient Data Aggregation and Preprocessing + +This codebase comprises a sequence of modular components, each fulfilling a specific role in the pipeline of aggregating and preprocessing Continuous Glucose Monitoring System (CGMS) time-series data. The pipeline has been developed to enhance usability, modularity, and extensibility. Below is a detailed breakdown of each component. + +### Configuration Management via YAML Files + +The first component is dedicated to the centralized management of various configurations required for data preprocessing and model training. This is accomplished through a YAML file that contains an organized hierarchy of parameters, such as data directory paths, data scaling options, and smoothing parameters. The adoption of an external YAML configuration file not only enhances the ease of management but also allows for a more flexible system configuration. + +### CGMSDataSeg Class: Data Preprocessing + +The second core component is the *CGMSDataSeg* class, explicitly designed to handle the segmentation and preprocessing of raw CGMS data. The class is equipped with multiple functionalities like data slicing, scaling, and smoothing. It also offers optional data augmentation techniques, including Gaussian noise and MixUp, to improve the robustness of the resulting dataset. The class thus serves as a comprehensive toolkit for turning raw CGMS time-series data into a refined, machine-learning-ready dataset. + + +### DataReader Utility: Data Collection. + +Our third component is the *DataReader* utility class, which has been created to efficiently read and parse JSON files containing patient-specific time-series data. The utility converts the raw JSON data into Python lists, thus making it far more manageable and ready for subsequent preprocessing stages. It boasts the capability of not just reading, but also smartly interpreting the data based on specific attributes and time intervals. + +### Dataset Aggregator: Data Compilation and Transformation + +The fourth component is a stand-alone script that functions as a dataset aggregator. Leveraging the DataReader and CGMSDataSeg classes, this script successfully merges data from multiple patients based on their unique identifiers. It is engineered to handle multiple JSON files for each patient and gracefully manage such cases. Post-aggregation, the entire dataset is saved as a NumPy array, making it readily accessible for future machine learning applications. diff --git a/mkdocs.yml b/mkdocs.yml index 6fbdf3a..22f0375 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -6,7 +6,6 @@ theme: warning: material/alert site_name: GlucoseGuard -# "GlucoseGuard: A Time-Series Approach to Predicting Blood Sugar Levels" site_url: https://francesco-vaselli.github.io/GlucoseGuard/ nav: