pylidc
is a python library intended to enhance workflow associated with the LIDC dataset, including utilities for both querying by attributes (e.g., collecting all annotations where malignancy is labeled as greater than 3 and spiculation is labeled a value equal to 1), and for common functional routines that act on the associated data (e.g., estimating the diameter or volume of a nodule from one of its annotations).
Routines for visualizing the annotations, both atop the CT data and as a surface in 3D, are implemented. These functionalities are implemented via an object relational mapping (ORM), using sqlalchemy
to an sqlite database containing the annotation information from the XML files provided by the LIDC dataset.
pylidc
has been used on Linux, Mac, and Windows, and on Python 2 and 3.
The package can be installed via pip
:
pip install pylidc
While pylidc
has many functions for analyzing and querying only annotation data (which do not require DICOM image data access), pylidc
also has many functions that do require access to the DICOM files associated with the LIDC dataset. pylidc
looks for a special configuration file that tells it where DICOM data is located on your system. You can use pylidc
without creating this configuration file, but of course, any functions that depend on CT image data will not be usable.
pylidc
looks in your home folder for a configuration file called, .pylidcrc
on Mac and Linux, or pylidc.conf
on Windows. You must create this file. On Linux and Mac, the file should be located at /home/[user]/.pylidcrc
. On Windows, the file should be located at C:\Users\[User]\pylidc.conf
.
The configuration file should be formatted as follows:
[dicom]
path = /path/to/big_external_drive/datasets/LIDC-IDRI
warn = True
If you want to use pylidc
without utilizing the DICOM data (for say, querying annotation attributes, etc.), you can remove path
and set warn
to False
, i.e.,
[dicom]
warn = False
and the module won't bother you about it each time you import the module.
The expected folder hierarchy in the specified path
is: PatientID
> StudyInstanceUID
> SeriesInstanceUID
> *.dcm
. If you downloaded the data from the TCIA site, the folder hierarchy will (probably!) already be formatted in this way.
If you find pylidc
helpful to your research, you could cite it as:
Matthew C. Hancock. Pylidc - An object relational mapping for the LIDC dataset using sqlalchemy. https://github.com/pylidc/pylidc/ (2016)
If you want to cite something more formal, pylidc
was developed for, and first mentioned in, the following publication:
Matthew C. Hancock, Jerry F. Magnan. Lung nodule malignancy classification using only radiologist quantified image features as inputs to statistical learning algorithms: probing the Lung Image Database Consortium dataset with two statistical learning methods. SPIE Journal of Medical Imaging. Dec. 2016. http://dx.doi.org/10.1117/1.JMI.3.4.044504
There are three data models: Scan
, Annotation
, and Contour
. The relationships are "one to many" for each model going left to right, i.e., Scan
's have many Annotation
's, and Annotation
's have many Contour
's. The main models to query are the Scan
and Annotation
models.
The main workhorse for querying is the pylidc.query
function. This function just wraps the sqlalchemy.query
function.
Here's some example usage for querying scan objects.
import pylidc as pl
qu = pl.query(pl.Scan).filter(pl.Scan.slice_thickness <= 1)
print(qu.count())
# => 97
scan = qu.first()
print(scan.patient_id, scan.pixel_spacing, scan.slice_thickness)
# => LIDC-IDRI-0066, 0.63671875, 0.6
print(len(scan.annotations))
# => 11
print(scan.get_path_to_dicom_files())
# '/path/to/big_external_drive/datasets/LIDC-IDRI/LIDC-IDRI-0066/1.3.6.1.4.1.14519.5.2.1.6279.6001.143774983852765282237869625332/1.3.6.1.4.1.14519.5.2.1.6279.6001.430109407146633213496148200410'
You can engage an interactive slice view by calling:
scan.visualize()
Note that calling visualize
on a scan object doesn't include its annotation information -- you must call the visualize_in_scan
member function of an Annotation
object to do this.
Let's grab the first annotation from the Scan
object above:
ann = scan.annotations[0]
print(ann.scan.patient_id)
# => LIDC-IDRI-0066
print(ann.spiculation, ann.Spiculation())
# => 3, Medium Spiculation
print(ann.estimate_diameter(), ann.estimate_volume())
# => 15.4920358194, 888.052284241
ann.print_formatted_feature_table()
# => Feature Meaning #
# => - - -
# => Subtlety | Obvious | 5
# => Internalstructure | Soft Tissue | 1
# => Calcification | Absent | 6
# => Sphericity | Ovoid | 3
# => Margin | Poorly Defined | 1
# => Lobulation | Near Marked Lobulation | 4
# => Spiculation | Medium Spiculation | 3
# => Texture | Solid | 5
# => Malignancy | Moderately Suspicious | 4
from pylidc.Annotation import feature_names as fnames
fvals, fstrings = ann.feature_vals(return_str=True)
print(fnames[0].title(), fstrings[0], fvals[0])
# => Subtlety, Obvious, 5
Let's try a different query on the annotations directly:
qu = pl.query(pl.Annotation).filter(pl.Annotation.lobulation > 3, pl.Annotation.malignancy == 5)
print(qu.count())
# => 183
ann = qu.first()
print(ann.lobulation, ann.Lobulation(), ann.malignancy, ann.Malignancy())
# => 4, Near Marked Lobulation, 5, Highly Suspicious
print(len(ann.contours))
# => 8
print(ann.contours_to_matrix().shape)
# => (671, 3)
print(ann.contours_to_matrix().mean(axis=0) - ann.centroid())
# => [ 0. 0. 0.]
You can engage an interactive slice viewer that displays annotation values and the radiologist-drawn contours:
ann.visualize_in_scan()
You can also view the nodule contours in 3d by calling:
ann.visualize_in_3d()
One common objective for data exploration is to grab a random instance from some query. You can accomplish this by import func
from sqlalchemy
, and using random
.
from sqlalchemy import func
scan = pl.query(pl.Scan).filter(pl.Scan.contrast_used == True).order_by(func.random()).first()
ann = pl.query(pl.Annotation).filter(pl.Annotation.malignancy == 5).order_by(func.random()).first()
The first query grabs a random Scan
instance where contrast is used. The second query grabs a random Annotation
instance where malignancy is equal to 5.
Another common objective is to query for an Annotation
object which is constrained by its corresponding Scan
in some way. For example:
anns = pl.query(pl.Annotation).join(pl.Scan).filter(pl.Scan.slice_thickness < 1, pl.Annotation.malignancy != 3)
The Annotation
member function, uniform_cubic_resample
, takes a cubic region of interest with the centroid at the center of the volume. The corresponding CT value volume is resampled to have voxel spacing of 1 millimeter and a side length as given by the functions side_length
parameter. Along with the uniformly resampled, cubic CT image volume, a corresponding boolean-valued volume is also returned that is 1 where the nodule exists in the resampled CT volume and 0 otherwise.
import pylidc as pl
import matplotlib.pyplot as plt
from skimage.measure import find_contours
ann = pl.query(pl.Annotation).first()
vol, seg = ann.uniform_cubic_resample(side_length = 100)
print(vol.shape, seg.shape)
# => (101, 101, 101) (101, 101, 101)
# View middle slice of interpolated volume (pixel spacing now = 1mm)
plt.imshow(vol[:,:,50], cmap=plt.cm.gray)
# View middle slice of interpolated segmentation volume as contours
# atop the interpolated image.
contours = find_contours(seg[:,:,50], 0.5)
for contour in contours:
plt.plot(contour[:,1], contour[:,0], '-r')
plt.show()
The LIDC dataset doesn't assign unique global identifiers to the physical nodules. For a given physical nodule, there may exist up to 4 annotations that refer to it. The annotations are anonymous, so even if it is known that 4 annotations refer to the same nodule, it is impossible to tell which annotator provided each annotation across multiple nodules consistently.
However, we can estimate when annotations refer to the same physical nodule in a scan by examining the properties of the annotations and clustering them based on the properties. pylidc
provides a number of distance metrics between annotations based on the annotation contour coordinates. The Scan
model provides a cluster_annotations
function which then clusters annotations by determining the connected components of the adjacency graph associated with a chosen distance-between-annotations metric and a chosen distance tolerance.
Here's an example:
import pylidc as pl
scan = pl.query(pl.Scan).first()
nods = scan.cluster_annotations()
print "Scan is estimated to have", len(nods), "nodules."
for i,nod in enumerate(nods):
print "Nodule", i+1, "has", len(nod), "annotations."
for j,ann in enumerate(nod):
print "-- Annotation", j+1, "centroid:", ann.centroid()
Output:
Scan is estimated to have 4 nodules.
Nodule 1 has 4 annotations.
-- Annotation 1 centroid: [ 331.90680101 312.30982368 1480.44962217]
-- Annotation 2 centroid: [ 328.60546875 309.91796875 1479.73046875]
-- Annotation 3 centroid: [ 327.91666667 309.88293651 1479.01785714]
-- Annotation 4 centroid: [ 332.55660377 313.88050314 1479.94339623]
Nodule 2 has 4 annotations.
-- Annotation 1 centroid: [ 360.81122449 169.19642857 1542.10459184]
-- Annotation 2 centroid: [ 360.82233503 169.21319797 1542.14720812]
-- Annotation 3 centroid: [ 361.05243446 168.86142322 1542.34269663]
-- Annotation 4 centroid: [ 361.25501433 171. 1542.80659026]
Nodule 3 has 1 annotations.
-- Annotation 1 centroid: [ 336.41666667 348.83333333 1545.75 ]
Nodule 4 has 4 annotations.
-- Annotation 1 centroid: [ 340.54020979 245.07692308 1606.14160839]
-- Annotation 2 centroid: [ 341.29061103 244.65275708 1605.90834575]
-- Annotation 3 centroid: [ 341.75417299 244.03490137 1606.95827011]
-- Annotation 4 centroid: [ 341.53110048 245.58532695 1606.5 ]
You can supply annotation clusters (variable, nods
, above) to the scan.visualize
function, and arrows will annotate where the nodules are present in the scan.