-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] a proposal to document all datasets and models #163
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,258 @@ | ||
# Datasets, Kernels, Models, and Problems | ||
|
||
As we start publishing more datasets and models, it is important to keep in mind why we're doing this. | ||
|
||
> We publish datasets because we want to contribute back to the Open Source and Machine Learning communities. | ||
|
||
We consider datasets and models to be good when they are: | ||
- discoverable, | ||
- reproducible, and | ||
- reusable. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is meant by "reusable", it is not a typical concept. I would say that what makes a good dataset is:
|
||
|
||
Keeping all of this in mind, let me propose a way to write documentation for these. | ||
|
||
## A Common Vocabulary | ||
|
||
It seems to be quite established that the relationship between datasets, models, and other concepts is somehow expressed in the following graph. | ||
|
||
![dataset graph](graph.png) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "kernel" is not a typical word. It may be used in NN contexts, but in general, it is confusing. I would replace it with a "training algorithm". Usually, it trains the model, not generates it. The same thing with "inferencer", I would replace it with "application". Awesome chart, I like it a lot! |
||
<!-- | ||
To rebuild the graph above, run: | ||
|
||
$ dot -Tpng -o graph.png | ||
|
||
And give the following as input: | ||
|
||
digraph G { | ||
dataset -> kernel [ label = "feeds" ]; | ||
{kernel dataset} -> model [ label = "generates" ]; | ||
model -> problem [ label = "solves" ]; | ||
inferencer -> model [ label = "uses" ]; | ||
inferencer -> problem [ label = "solves" ]; | ||
} | ||
--> | ||
|
||
The following sections get into more detail on each concept, | ||
but let me give a quick intro of all of these concepts. | ||
|
||
### Problems | ||
|
||
Everything we do at source{d} is around solving problems and | ||
making predictions. Problems are the starting motivation | ||
and ending point of most of our Machine Learning processes. | ||
|
||
Problems have a clear objective, and a measure of success that | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This oxford comma really confuses me. I was about to propose changing "let" to "lets" but then realized it was about both things. |
||
let us rank different solutions to any problem in an objective | ||
way. Think about accuracy, recall, etc. | ||
|
||
An example problem could be predicting what is the next key | ||
a developer will press given what they've written so far. | ||
|
||
### Models | ||
|
||
Problems are solved using Models. Models are trained | ||
to solve a specific problem by feeding Dataset to a | ||
Kernel that optimizes a set of parameters. | ||
These parameters, once optimized, are what models are made of. | ||
|
||
Models can be considered as a black box, where the only thing | ||
we care about is the input and output formats. This provides | ||
the possibility of reusing a model, to solve the same problem, | ||
or to somehow feed into a different model (by knowledge | ||
transfer or other techniques). | ||
|
||
Given the previous problem of predicting the next key pressed, | ||
a model could get as an input the sequence of all keys pressed | ||
so far, as ASCII codes, and the output could be a single ASCII | ||
code with the prediction. | ||
|
||
A secondary goal of models is to be reproducible, meaning that | ||
someone could try to repeat the same process we went through and | ||
expect to obtain a similar result. If the kernel that generated | ||
the dataset requires metaparameters (such as learning rate), | ||
these values should also be documented. | ||
|
||
This is normally documented in research papers, with references | ||
to what datasets and kernels were used, as well as how much | ||
training time it took to obtain the resulting model. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So this is the thing: there is a huge difference between a research paper and what we want to achieve. Papers are always limited in size and the authors desperately try to squeeze as much information as possible. This often leads to excluding important descriptions, which are not strictly necessary but simplify the reproduction. Think of it as a physical experiment. A paper includes: initial conditions; methodology; observations throughout the experiment lifetime; the empirical results; explanations and conclusions. Unfortunately, not ML papers. So a dream model documentation should have:
|
||
|
||
### Kernels | ||
|
||
Kernels are algorithms that feed from datasets and | ||
generate models. These algorithms are responsible for describing | ||
the model architecture chosen to solve a problem, e.g. RNN, | ||
CNN, etc, and what metaparamaters were used | ||
|
||
### Datasets | ||
|
||
Datasets contain information retrieved from one or more | ||
data sources, then pre-processed so it can easily be used to | ||
answer questions, solve problems, train models, or even as | ||
the data source to another dataset. | ||
|
||
The most important aspects of a dataset are its format, how to | ||
download it, reproduce it, and what version contains what | ||
exactly. | ||
|
||
Datasets evolve over time, and it's important to have versions | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Some of them do, others don't. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That is okay. v1 can stay v1 ;) or be deprecated at some point |
||
that can be explicitly referred to from trained models. | ||
|
||
### Inferencers | ||
|
||
The last piece of the puzzle is what I call inferencer. | ||
An inferencer uses a model (sometimes more, sometimes no model | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Well, there is always a model. It can be hardcoded using the domain knowledge. |
||
at all) to predict the answer to a question given some input. | ||
|
||
For instance, given a model trained with a large dataset of | ||
the keystrokes of thousands of developers, we could write an | ||
inferencer that uses that trained model to create predictions. | ||
That would be a pretty decent inferencer. | ||
|
||
But we could also use a simple function that outputs random | ||
ASCII codes, ignoring any other information available. This | ||
inferencer would probably have a lower accuracy for the given | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The accuracy is going to be exactly 1/ 128, no doubts on this. |
||
problem. | ||
|
||
## Documenting these Artifacts | ||
|
||
So far we've documented models and some datasets to a certain | ||
extent, but I think it's time to provide a framework for all | ||
of these elements to be uniformly documented to improve the | ||
discoverability, reproducibility, and reusability of our | ||
results. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. amen! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. amen! |
||
|
||
We will evolve our documentation over time, into something that | ||
hopefully will delight every one of our engineers and users. | ||
But for now, let's keep it realistic and propose a reduced set | ||
of measure we can start applying today to evolve towards that | ||
perfect solution. | ||
|
||
## Current status | ||
|
||
Currently we document only datasets and models in two different | ||
repositories: github.com/src-d/datasets and | ||
github.com/src-d/models. | ||
|
||
We also have a modelforge tool that is intended to provide a way | ||
to discover and download existing models. | ||
|
||
### Datasets | ||
|
||
We currently have only one public dataset: Public Git Archive. | ||
For this dataset we document: | ||
|
||
- how to download the current version of the dataset with the `pga` CLI tool | ||
- how to reproduce the dataset with borges and GHTorrent | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. aaaand |
||
|
||
What are we missing? | ||
|
||
- versioning of the resulting dataset, how to download this an previous versions? | ||
- format of the dataset | ||
- what other datasets (and versions) were used to generate this? | ||
- what models have been trained with this dataset | ||
- LICENSE (the tools and scripts are licensed, but not the datasets?) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Some of the issues are to be resolved by attaching the paper. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, but we can't assume having a paper for everything. It won't be feasible from a time perspective. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Absolutely. I meant solely PGA here. |
||
|
||
### Models | ||
|
||
Models are already documented following some structure, following the | ||
efforts put in place for [modelforge](https://github.com/src-d/modelforge). | ||
|
||
Currently models have an ID, which looks like a long random string like | ||
`f64bacd4-67fb-4c64-8382-399a8e7db52a`. | ||
|
||
Models are accompanied by an example on how to use them, unfortunately the | ||
examples are a bit simpler than expected. They mostly look like this: | ||
|
||
```python | ||
from ast2vec import DocumentFrequencies | ||
df = DocumentFrequencies().load() | ||
print("Number of tokens:", len(df)) | ||
``` | ||
|
||
What are we missing? | ||
- Versioned models, corresponding to versioned datasets. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is versioning, though not reflected in src-d/models. Models can derive, either with the relation to parent or not. E.g. it is a common situation when our data engineering and filtering are not perfect and we miss data or pass in garbage. In that case, the relation to the previous model is saved. Sometimes it is just a regular update without the relation to the previous one. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the parent relation, makes @campoy's analogy to containers even strong here. There is a lot we can/should learn from how Docker registry tackled this. |
||
- Reference to the code (kernel) that was used to generate the model. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another important notice: models contain dependencies to upstream models which were used in the generation process. Datasets are also models in this terminology and should have a UUID (yeah, this is confusing, I know). The only way which I see to reference the code is to record the whole Python package dependency tree. This still misses the actual custom calling code in many cases, and I need to apply some dark Python alchemy to discover and record it. We also need to store it somewhere in the model file. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since each model references the code it was created from. The dependency tree is there already. |
||
- Technical sheet with accuracy, recall, etc for the given model and dataset | ||
- Format of input and output of the model | ||
- At least one example using the model to make a prediction | ||
|
||
## My Proposal | ||
|
||
Since we care about individual versioning of datasets and models, | ||
it seems like it's an obvious choice to use a git repository per dataset, | ||
and model. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We will die. Seriously. I tried it and it is completely out of maintenance. Special pain belongs to adding new models and being blocked for a few days until the repository is created. I am strongly against this idea. There is also the central registry of our models in src-d/models which is of strong necessity as the only way to fetch the index and automatically download models in downstream apps. We already have 5 models to date, and the only reason why it is so few is that we did not have data. Now that we have PGA, we will bake new models like pies, with tens of different architectures and problems. Models are not code repositories, there is no point in contributing to existing ones, it is always about adding smth new. Besides, we need to solve the problem with the community, because we want to allow external people to push models into our registry. Think of it as DockerHub for models. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
What I believe @campoy is proposing is DockerHub for models, datasets, training algorithms, applications etc. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Model is an artifact of a training algorithm. Algorithm is code and we can improve it, fix it, etc. So the algorithms should be on GitHub/Git, separate from the model storage. |
||
|
||
Problems, inferencers, and kernels can, for now, be documented directly with | ||
a model. If we see that we start to have too much repetition because we have | ||
many models for a single problem we will reassess this decision. | ||
|
||
As any other source{d} repository, we need to follow the guidelines in | ||
[Documentation at source{d}](https://github.com/src-d/guide/blob/master/engineering/documentation.md). | ||
This includes having a CONTRIBUTING.md, Code of Conduct, etc. | ||
|
||
This is an initial list of the information required per repository. | ||
|
||
### Dataset Repository | ||
|
||
A dataset repository should contain the following information: | ||
|
||
- short description | ||
- long description and links to papers and blog posts | ||
- technical sheet | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It might seem obvious, but we should also include:
|
||
- size of dataset | ||
- schema(s) of the dataset | ||
- download link | ||
- using the dataset: | ||
- downloading the dataset | ||
- related tools | ||
- reproducing the dataset: | ||
- link to the original data sources | ||
- related tools | ||
|
||
### Model Repository | ||
|
||
A dataset repository should contain the following information: | ||
|
||
- short description | ||
- long description and links to papers and blog posts | ||
- technical sheet | ||
- size of model | ||
- input/output schemas | ||
- download link | ||
- datasets used to train the model (including versions) | ||
- using the model: | ||
- downloading the model | ||
- loading the model | ||
- prerequisites (tensorflow? keras?) | ||
- quick guide: making a prediction | ||
- reproducing the model: | ||
- link to the original dataset | ||
- kernel used to train the model | ||
- training process | ||
- hardware and time spent | ||
- metaparameters if any | ||
- any other relevant details | ||
|
||
### Versioning and Releases | ||
|
||
Every time a new version of a dataset or model is released a new tag and | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we want to standarize dataset version to something? Maybe just date, or semver? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. semver sounds pretty good, but it's a solution so I don't wanna decide on it just yet |
||
associated release should be created in the repository. | ||
|
||
The release should include links to anything that has changed since the | ||
previous relaease: such as a new version of the datasets or changes in | ||
the kernel. | ||
|
||
### github.com/src-d/datasest and github.com/src-d/models | ||
|
||
These two repositories should simply contain what is common to all datasets, | ||
or to all models. They will also provide all the tooling build on top of | ||
the documentation for datasets and models. | ||
|
||
Since we imagine these tools extracting information from the repositories | ||
automatically, it is important to keep formatting in mind. | ||
|
||
I'm currently considering whether a `toml` file should be defined containing | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This will not work. All the metadata should be generated automatically from the self-contained ASDF files. Otherwise this will be a nightmare to support. |
||
the data common to all the datasets and models. | ||
For instance, we could have the download size for each dataset and model, | ||
as well as the associated schemas. A simple tool could then generate | ||
documentation based on these values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not only contributing back as it happens with code. We also want to increase popularity of MLonCode and attract more people. A dataset is always the starting point of any DS research. No data => no research.