Co-Training for Commit Classification

Overview

This is the official repository for the paper Co-Training for Commit Classification published at the Seventh Workshop on Noisy User-generated Text (W-NUT 2021). This repository contains the dataset used in the paper. poster.pdf is our poster.

The 900Repo dataset is contained in negative+CC-900repos.csv and positive+CC-900repos.csv. These can be used directly to replicate experiments.

Constructing 900Repo Dataset

To replicate the construction of the 900Repo dataset, follow these steps.

Get dataset by [RA21]. $ git clone https://github.com/TQRG/security-patches-dataset.git Set git head to commit we used: $ git reset --hard ebcbfc8cdc1e1f3d1dfb97b6c5e75804b20c079f.
At this commit, the security-patches-dataset repository contains about 8,000 positive and 110,000 negative samples (see security-patches-dataset/dataset/negative.csv and positive.csv). However, code diffs are not provided -- we'll have to download these ourselves.
Run data-get.ipynb. This notebook (a) downloads code diffs for the commit samples via the Github API, and (b) randomly selects a handful of negative commit samples out of the 110,000 provided by [RA21]. The .csv files saved will be similar to negative+CC-900repos.csv and positive+CC-900repos.csv.

Cite

If our work has been useful, we would be grateful if you would consider citing

@inproceedings{lee2021co,
  title={Co-training for Commit Classification},
  author={Lee, Jian Yi David and Chieu, Hai Leong},
  booktitle={Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)},
  pages={389--395},
  year={2021}
}

References

[RA21] Sofia Oliveira Reis and Rui Abreu. 2021. A ground-truth dataset of real security patches.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitattributes		.gitattributes
data-get.ipynb		data-get.ipynb
negative+CC-900repos.csv		negative+CC-900repos.csv
positive+CC-900repos.csv		positive+CC-900repos.csv
poster.pdf		poster.pdf
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Co-Training for Commit Classification

Overview

Constructing 900Repo Dataset

Cite

References

About

Releases

Packages

Languages

davidleejy/wnut21-cotrain

Folders and files

Latest commit

History

Repository files navigation

Co-Training for Commit Classification

Overview

Constructing 900Repo Dataset

Cite

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages