pip install MLProvCodeGen
Our goal in this research was to find out, whether provenance data can be used to support the end-to-end reproducibility of machine learning experiments.
In short, provenance data is data that contains information about a specific datapoint; how, when, and by whom it was conceived, and by which processes (functions, methods) it was generated.
The functionalities of MLProvCodeGen can be split into 2 parts:
MLProvCodeGen's original purpose was to automatically generate code for training machine learning (ML) models, providing users multiple different options for machine learning tasks, datasets, model parameters, training parameters and evaluation metrics. We then extended MLProvCodeGen to generate code according to real-world provenance data models, to automatically capture provenance data from the generated experiments, and to take provenance data files that were captured with MLProvCodeGen as input to generate one-to-one reproductions of the original experiments. MLProvCodeGen can also generate relational graphs of the captured provenance data, allowing for visual representation of the implemented experiments.
The specific use-cases for this project are twofold:
- Image Classification
- We can generate code to train a ML model on image input files to classify handwritten digits (MNIST),
clothing articles (FashionMNIST), and a mix of vehicles and animals (CIFAR10).
- Multiclass Classification
- We can generate code to train a ML model on tabular data (.csv) to classify different species of iris flowers
and to also test different models using 'toy datasets' which are fake datasets specifically designed to mimic patterns that could occur in real-world data such as spirals.
Please open MLProvCodeGen by using the Binder Button at the top of this page. This opens a virtual installation.
The JupyterLab interface should look like this:
Please proceed by pressing the 'MLProvCodeGen' button located in the 'other' section to open the extension.
Here is an example interface:
And generated notebooks look like this:
If you are seeing the frontend extension, but it is not working, check that the server extension is enabled:
jupyter server extension list
If the server extension is installed and enabled, but you are not seeing the frontend extension, check the frontend extension is installed:
jupyter labextension list
Note: You will need NodeJS to build the extension package.
The jlpm
command is JupyterLab's pinned version of
yarn that is installed with JupyterLab. You may use
yarn
or npm
in lieu of jlpm
below.
# Clone the repo to your local environment
# Change directory to the MLProvCodeGen directory
# Install package in development mode
pip install -e .
# Link your development version of the extension with JupyterLab
jupyter labextension develop . --overwrite
# Rebuild extension Typescript source after making changes
jlpm run build
You can watch the source directory and run JupyterLab at the same time in different terminals to watch for changes in the extension's source and automatically rebuild the extension.
# Watch the source directory in one terminal, automatically rebuilding when needed
jlpm run watch
# Run JupyterLab in another terminal
jupyter lab
With the watch command running, every saved change will immediately be built locally and available in your running JupyterLab. Refresh JupyterLab to load the change in your browser (you may need to wait several seconds for the extension to be rebuilt).
By default, the jlpm run build
command generates the source maps for this extension to make it easier to debug using the browser dev tools. To also generate source maps for the JupyterLab core extensions, you can run the following command:
jupyter lab build --minimize=False
The following steps must be taken to add a new ML experiment to this extension:
- Have an existing Python script for your machine learning experiment.
- Paste the code into a Jupyter notebook and split it into cells following the execution order of your experiment.
- Create a Jinja template for each cell and wrap if-statements around the Python code depending on which variables are important. Refer to existing modules for what the provenance data of your experiment might look like.
- Load the templates in a Python procedure that also creates a new notebook element and write their rendered outputs to the notebook.
- Expect every local variable for the procedure to be extracted from a dictionary input.
- Add HTML input elements to the user interface based on your provenance variables.
- Combine the variable values into a JavaScript/TypeScript dictionary.
- Create a new server request for your module and pass the dictionary through it as “stringified” JSON data.
- Once the frontend, backend, and server connection work, your module has been added successfully.
Note that while these steps might seem complicated, most of them only require copy-pasting already existing code. The only new part for most users is templating through Jinja. However, Jinja has good documentation, and its syntax is very simple, requiring only if-loops.
pip uninstall MLProvCodeGen