Skip to content

Tensorflow based training, inference and feature engineering pipelines used in OSIC Kaggle Competition

Notifications You must be signed in to change notification settings

5m0k3/osic-pulmonary-fibrosis-tf

Repository files navigation

TensorFlow based Quantile Regression solution - OSIC Pulmonary Fibrosis

A complete tensorflow pipeline for training, inference and feature extraction notebooks used in Kaggle competition OSIC Pulmonary Fibrosis (July-Oct 2020)

Table of Contents

Brief overview of the competition data

The data contained of dicom (images + metadata) data of chest X-Ray of patients along with tabular data like smoking status, age, Forced Vital Capacity (FVC) values etc.
Slices preview of chest X-Ray of a patient are as:
head
Lung mask segmentation process deployed was (3rd image - final mask)

3D plot of stacked 2D segmented masks to form a lung produces

Apart from the dicom data the tabular data was as follows

Notebooks description

A brief content description is provided here, for detailed descriptions check the notebook

Feature Engineering notebook

A major task was engineering and extracting features from the dcm slices
In total I engineered 5 features as follows

  1. Chest Volume:
    - Calculated through numpy.trapz() integration over all 2D slices using pixel count, sliceThickness and pixelSpacing (Voxel spacing) metadata in the dcm file
    - Dealt with the inconsistencies in the data and final distplot produced was

  2. Chest Area:
    - Maximum area of chest calculated using the average of 3 middle most slices in same fashion as Chest Volume
    - distplot

  3. Lung - Tissue ratio:
    - Ratio of pixel area of segmented lung mask to the total tissue pixel area as in original dcm file
    - The ideology behind being this feature was to detect lung shrinkage inside chest
    - distplot

  4. Chest Height:
    - Chest height calculated using sliceThickness and number of slices forming the lung
    - distplot

  5. Height of the Patient:
    - Approximate height calculated using FVC values and age of a patient according to formulaes and observations made from external medical research data
    - distplot

Plots of Features vs FVC / Percent

[TRAIN] notebook

EffNet train notebook described below, Custom tf tabular data only model listed in [INFERENCE] itself

  1. Pre-Processing:
    - Handled the various sizes and missing slices issues
    - Stratified 5 fold split based on PatientID

  2. Augmentations:
    - Albumentations - RandomSizedCrop, Flips, Gaussian Blur, CoarseDropout, Rotate (0-90)

  3. Configurations:
    - Optimizer - NAdam
    - LR Scheduler - ReduceLRonPlateau (initial LR = 0.0005, patience = 5, factor = 0.5)
    - Model - EfficientNet B5
    - Input Size - 512 * 512

[INFERENCE] Submission notebook

Contains custom tabular data model training and inference too

  1. Custom Net:
    - A tiny net using given tabular data and engineered features on swish activated dense layers
    - Pinball loss function for multiple quantiles was used, the difference in first and last quantiles was used as uncertainty measure

  2. Ensemble:
    - Final submission made using ensemble of both effnet image and custom model

How to use

Just change the directories according to your environment.

Google Colab deployed versions are available for
[TRAIN] Effnet Open In Colab
[TRAIN] Base Custom Net Open In Colab