This is the official repository for the paper Co-Training for Commit Classification published at the Seventh Workshop on Noisy User-generated Text (W-NUT 2021). This repository contains the dataset used in the paper. poster.pdf
is our poster.
The 900Repo dataset is contained in negative+CC-900repos.csv
and positive+CC-900repos.csv
. These can be used directly to replicate experiments.
To replicate the construction of the 900Repo dataset, follow these steps.
-
Get dataset by [RA21].
$ git clone https://github.com/TQRG/security-patches-dataset.git
Set git head to commit we used:$ git reset --hard ebcbfc8cdc1e1f3d1dfb97b6c5e75804b20c079f
. -
At this commit, the
security-patches-dataset
repository contains about 8,000 positive and 110,000 negative samples (seesecurity-patches-dataset/dataset/negative.csv
andpositive.csv
). However, code diffs are not provided -- we'll have to download these ourselves. -
Run
data-get.ipynb
. This notebook (a) downloads code diffs for the commit samples via the Github API, and (b) randomly selects a handful of negative commit samples out of the 110,000 provided by [RA21]. The .csv files saved will be similar tonegative+CC-900repos.csv
andpositive+CC-900repos.csv
.
If our work has been useful, we would be grateful if you would consider citing
@inproceedings{lee2021co,
title={Co-training for Commit Classification},
author={Lee, Jian Yi David and Chieu, Hai Leong},
booktitle={Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)},
pages={389--395},
year={2021}
}
[RA21] Sofia Oliveira Reis and Rui Abreu. 2021. A ground-truth dataset of real security patches.