Curated Email-Based Code Reviews Datasets

Code review is an important practice that improves the overall quality of a proposed patch (i.e. code changes). While much research focused on tool-based code reviews (e.g. a Gerrit code review tool, GitHub), many traditional open-source software (OSS) projects still conduct code reviews through emails. However, due to the nature of unstructured email-based data, it can be challenging to mine email-based code reviews, hindering researchers from delving into the code review practice of such long-standing OSS projects. Therefore, this paper presents large-scale datasets of email-based code reviews of 167 projects across three OSS communities (i.e. Linux Kernel, OzLabs, and FFmpeg). We mined the data from Patchwork, a web-based patch-tracking system for email-based code review, and curated the data by grouping a submitted patch and its revised versions and grouping email aliases. Our datasets include a total of 4.2M patches with 2.1M patch groups and 169K email addresses belonging to 141K individuals. Our published artefacts include the datasets as well as a tool suite to crawl, curate, and store Patchwork data. With our datasets, future work can directly delve into an email-based code review practice of large OSS projects without additional effort in data collection and curation.

Introduction

This project provides a suite of tools for mining and further processing Patchwork data. It consists of three parts:

Scrapy framework for crawling data
Django REST framework and Python application for accessing and processing the data
MongoDB database for storing the data

Table of Content

Approaches and evaluation results
Provided dataset
1. Get provided dataset
2. Use provided dataset
Data crawling
Data dictionary
1. Application filter
BibTeX Citation

1. Approaches and evaluation results

A sample for processing the raw crawled data, including identity grouping and patch grouping, and another for importing processed data to the database are provided in Jupyter notebook, which can be found in folder app.

1.1 Patch grouping approach constraints

Two approaches, Exact Bags-of-Words (BoW) Grouping and One-word Difference Grouping, are implemented for patch grouping. Below are the constraints for the approaches.

Exact BoW Grouping

The bag-of-words of the summary phrases of the patches are the same
The patches do not belong to the same series

One-word Difference Grouping

The bag of words of a group should be different from that of another group by one word
The different word should not be "revert"
Version references of both groups should not be intersected
Both groups contain at least one common patch submitter

1.2 Evaluation results

Accuracy of grouping

For our manual evaluation, a patch grouping is considered correct if all patches in the group are related to the same review process by investigating the content of each patch (e.g., commit message, related comments, code changes). Similarly, we consider an individual identification as correct if all the identities in the group are certainly from the same individual by examining whether 1) the identities have submitted patches in the same group, 2) the identities have commented on the same patches, and 3) the identities share other characteristics such as the organisation email addresses. Finally, we compute the grouping accuracy using the following calculation:

$Correctness = \frac{\text{\#Correct groups}}{\text{\#Evaluated groups}}$

where correct groups refer to 1) groups of patches that belong to the same code review process or 2) groups of identities that belong to the same individual that are correctly identified; and evaluated groups refer to the sampled groups that are manually evaluated.

Below is the correctness and Cohen's Kappa results of each approach applied to the selected five projects. (N/A indicates both raters agree to exclude zero sampled groups)

Exact BoW grouping

Projects	Correctness	Agreement	Cohen's κ	Interpretation
FFmpeg	97.80%(±5%)	100%	N/A	N/A
QEMU	99.48%(±5%)	100%	N/A	N/A
U-Boot	98.42%(±5%)	98.95%	N/A	N/A
Linux Arm Kernel	99.48%(±5%)	100%	N/A	N/A
Netdev + BPF	98.41%(±5%)	96.81%	N/A	N/A

One-word difference grouping

Projects	Correctness	Agreement	Cohen's κ	Interpretation
FFmpeg	82.27%(±5%)	91.43%	0.6809	substantial agreement
QEMU	88.10%(±5%)	94.87%	0.7709	substantial agreement
U-Boot	84.39%(±5%)	91.04%	0.7141	substantial agreement
Linux Arm Kernel	86.38%(±5%)	93.83%	0.7265	substantial agreement
Netdev + BPF	82.95%(±5%)	92.59%	0.6301	substantial agreement

Individual grouping

Projects	Correctness	Agreement	Cohen's κ	Interpretation
FFmpeg	81.94%(±5%)	88.89%	0.6786	substantial agreement
QEMU	82.35%(±5%)	91.67%	0.7115	substantial agreement
U-Boot	84.55%(±5%)	87.27%	0.5276	moderate agreement
Linux Arm Kernel	82.47%(±5%)	94.52%	0.8016	substantial agreement
Netdev + BPF	87.37%(±5%)	92.00%	0.4565	moderate agreement

2. Provided dataset

We provide both database dump and data in a plain JSON format, both of which contain data of all projects from the three OSS communities, including Linux Kernel, Ozlabs, and FFmpeg, until 30/06/2023. There are ten collections in which Project, Identity, Series, Patch, Comment, and MailingList store the original crawled data (some fields will be updated during further processing), and Individual, Change1, Change2, and NewSeries record the results of processing.

2.1. Get provided dataset

The compressed complete datasets and the plain JSON data can be downloaded here. Decompress the downloaded file in root folder of the project to use in the following steps.

2.2. Use provided dataset

To use the provided dataset, simply run docker containers without migrating database by using following commands in the terminal.

cd app/docker

# build docker images
docker-compose -f docker-compose-non-migrate.yml build

# run docker containers
docker-compose -f docker-compose-non-migrate.yml up -d

To run certain services in docker, specify service names, i.e. docker-compose -f docker-compose-non-migrate.yml up <service_name> <service_name> -d

Then, restore the database by running following command.

docker exec -i mongodb_docker_container sh -c 'exec mongorestore --archive --nsInclude=code_review_db.*' < /path/to/code_review_db.archive

After restoration process is done, Mongodb database will be available for local access at mongodb://localhost:27017.

2.2.1. Access database directly

The sample analyses on code review metrics in a Jupyter notebook and their outputs can be found in folder analysis.

2.2.2. Access database via Python application

Data stored in the MongoDB database can be retrieved through Django REST API by simply using the retrieve_data method in the Python application either as the whole set of data in a collection or as a specific set by using the filters.

Retrieve the whole collection

The item type of the data to be retrieved has to be specified. Available item types include project, identity, series, patch, comment, newseries, change1, change2, mailinglist, and individual.

from application_layer import AccessData

access_data = AccessData()
item_type = 'project'
retrieved_data = access_data.retrieve_data(item_type)

Retrieve a specific set of data

Similarly, the item type also needs to be specified. To filter data, specify the filter in the arguments.

from application_layer import AccessData

access_data = AccessData()
item_type = 'series'

# filter series data which belong to the FFmpeg project and whose cover letter message content contains the word "improve"
filter = 'project_original_id=ffmpeg-project-1&cover_letter_content_contain=improve'
retrieved_data = access_data.retrieve_data(item_type, filter)

All available filters can be found in section Application filter.

3. Data crawling

To crawl new data from the source, apply the Scrapy framework in the suite. Retrieved data will be first stored in jsonlines files. The file content can be imported into the database with the help of the application layer.

There are three spiders for crawling patchwork data. Their spider names are patchwork_project, patchwork_series, and patchwork_patch.

patchwork_project crawls patchwork projects and corresponding maintainer accounts data.
patchwork_series crawls patchwork series and corresponding series submitter accounts data.
patchwork_patch crawls patchwork patches, comments and corresponding submitter accounts data.

The retrieved data will be stored under /docker/scrapy_docker_app/retrieved_data.

3.1. Basic steps for crawling

There are three ways to run spiders.

Schedule spiders on scrapyd
Run spiders from a script
Run spiders using commands.

3.1.1. Schedule spiders on scrapyd

Run the docker containers by entering following commands in terminal.

cd docker

# build docker images
docker-compose -f docker-compose.yml build

# run docker containers
docker-compose -f docker-compose.yml up -d

To run certain services in docker, specify service names, i.e. docker-compose -f docker-compose.yml up <service_name> <service_name> -d

Then, run the following command to schedule and run a spider in scapyd. Multiple spiders can be run and manage by scrapyd.

curl http://localhost:6800/schedule.json -d project=default -d spider=<spider-name>

3.1.2. Run spiders from script

A basic structure for running from the script is provided in /docker/scrapy_docker_app/patchwork_crawler/spiders/patchwork_api.py.

if __name__ == '__main__':

    configure_logging()
    runner = CrawlerRunner(settings={
        'ITEM_PIPELINES': {'patchwork_crawler.pipelines.PatchworkExporterPipeline': 300},
        'HTTPERROR_ALLOWED_CODES': [404, 500],
        'SPIDER_MODULES': ['patchwork_crawler.spiders'],
        'NEWSPIDER_MODULE': 'patchwork_crawler.spiders'
    })

    @defer.inlineCallbacksx
    def crawl():
        yield runner.crawl(PatchworkProjectSpider)
        yield runner.crawl(PatchworkSeriesSpider)
        yield runner.crawl(PatchworkPatchSpider)
        reactor.stop()

    crawl()
    reactor.run()

To run the spider, run the following command under /scrapy_docker_app/patchwork_crawler in the container terminal.

python -m patchwork_crawler.spiders.patchwork_api

For more information, visit the scrapy documentation

3.1.3. Run spiders using commands

Run the following command in the container terminal.

scrapy crawl <spider-name>

3.2. Customise spiders

Each spider crawls patchwork api web page by item id (e.g. patch id -> https://patchwork.ffmpeg.org/api/patches/1/). It automatically increases the item id to crawl the next web page until the id number reaches the default limit or the specified limit. The start id and the endpoint to be crawled can be specified, if necessary.

3.2.1. Customise spiders in scrapyd

Pass arguments in the command. Each argument should follow an option -d.

# crawl projects
curl http://localhost:6800/schedule.json -d project=default -d spider=<spider-name> -d start_project_id=<specified-id> -d end_project_id=<specified-id> -d endpoint_type=<endpoint-name>

# crawl series
curl http://localhost:6800/schedule.json -d project=default -d spider=<spider-name> -d start_series_id=<specified-id> -d end_series_id=<specified-id> -d endpoint_type =<endpoint-name>

# crawl patches
curl http://localhost:6800/schedule.json -d project=default -d spider=<spider-name> -d start_patch_id=<specified-id> -d end_patch_id=<specified-id> -d endpoint_type =<endpoint-name>

3.2.2. Customise spiders from script

In the provided structure, specify the argument in the crawlers.

@defer.inlineCallbacks
    def crawl():
        yield runner.crawl(PatchworkProjectSpider, start_project_id=..., end_project_id=..., endpoint_type=...)
        yield runner.crawl(PatchworkSeriesSpider, start_series_id=..., end_series_id=..., endpoint_type=...)
        yield runner.crawl(PatchworkPatchSpider, start_patch_id=..., end_patch_id=..., endpoint_type=...)
        reactor.stop()

3.2.3. Customise spiders run using commands

Similar to customisation in scrapyd, with the argument option -a

# take crawling project as an example
scrapy crawl <spider-name> -a start_project_id=<specified-id> -a end_project_id=<specified-id> -a endpoint_type=<endpoint-name>

3.3. Data Process and Import

After crawling data from Patchwork, data can be processed and imported to the database with the help of the application layer.

3.3.1. Data process

The application layer falls into two parts - AccessData for 1) importing data into the database, 2) accessing JSON files in the local machine, and 3) querying data from the database; ProcessIdentity, ProcessMailingList, and ProcessPatch for processing data, where ProcessIdentity identifies individuals (i.e. unique developers) within each project, ProcessMailingList sorts the mailing list data, and ProcessPatch groups related code review activities of the same proposed patch.

It is adivised to run the automated approaches, i.e. identity (email aliases) grouping, exact bags-of-words grouping and one-word difference grouping, provided in ProcessPatch before data import. An implementation example can be found in implementation.ipynb.

Note that identity data come from multiple sources, including project, series, patch, and comment, and thus they are stored into different files when data crawling is completed. Therefore, separated identity files are supposed to be merged into one befre processing and importing data.

In addition, newseries, change1, change2, and individual are newly generated data and they do not have original id as those crawled from Patchwork (See data attributes in the data dictionary), so their initial original ids are required to be specified if the approaches are not run at the first time.

To specify the original ids, corresponding maximum original id can be retrieved from the database by db.collection-name.count_documents({}), which returns the number of records in a collection. ProcessIdentity and ProcessPatch accept arguments to specify the starting original id.

import pymongo, application layer

client = pymongo.MongoClient("mongodb://localhost:27017/")
db = client["code_review_db"]

individual_original_id = db.patchwork_individual.count_documents({}) + 1

process_identity = application_layer.main.ProcessIdentity(individual_original_id)

Alternatively a specific number can be specified.

process_identity = application_layer.main.ProcessIdentity(individual_original_id=10)

3.3.2. Data import

Data can be imported to the database by using the insert_data() function provided in AccessData. Specifically, specify the data to be imported or the location of the data to be imported, and its corresponding item type. The item type include identity, project, mailinglist, individual, series, newseries, change1, change2, patch, and comment.

Note that for the msg_content field in the patch and comment datasets and content field in mailinglist dataset, the null character is converted to its unicode, i.e. \u0000, when crawling data, and thus it has to be converted back.

# data has been loaded before import
access_data.insert_data(data=project_data, item_type="project")

# data has not been loaded before import
access_data.insert_data(data="path/to/project/data", item_type="project")

However, the import of each item type should follow a specific order: identity -> project -> mailinglist-> individual -> series -> newseries -> change1 -> change2 -> patch -> comment, unless it is confirmed that related foreign key data in the data to be imported are already in the database (See the complete ER diagram).

4. Data dictionary

This section describes the high-level structure of the dataset. (Note that fields named change1 refer to ExactBoWGroup and those named change2 refer to OWDiffGroup)

Below is a complete ER diagram depicting the database structure, which can also be accessed in dbdiagram.

Identity

Fields	Description
_id	Object id created by MongoDB
id	Auto-increment id generated by Django
original_id	A combination of OSS community name, item type, and original Django id presented in the api data, e.g. `ffmpeg-people-1`; Note that for maintainer identities of a project, the item type is `user`, e.g. `ffmpeg-user-1`
email	Email of the identity
name	Name of the identity
api_url	API URL for retrieving the original data in patchwork (Authentication needed as patchwork blocks the access)

Individual

Fields	Description
_id	Object id created by MongoDB
id	Auto-increment id generated by Django
original_id	A combination of OSS community name, item type, and an auto-generated index, e.g. `ffmpeg-individual-1`
project	The project in which this individual has submitted patches, comments, etc.
identity	Idenities which belong to the same individual in a project

Project

Fields	Description
_id	Object id created by MongoDB
id	Auto-increment id generated by Django
original_id	A combination of OSS community name, item type, and original Django id presented in the api data, e.g. `ffmpeg-project-1`
name	Name of the project
repository_url	URL of the git repository of the project if applicable
api_url	API URL for retrieving the project data in patchwork
web_url	URL of the project web page (if applicable)
list_id	Id of the mailing list of the project
list_address	Email address of the mailing list of the project
maintainer_identity	Maintainers' corresponding identity detail

Series

Fields	Description
_id	Object id created by MongoDB
id	Auto-increment id generated by Django
original_id	A combination of OSS community name, item type, and original Django id presented in the api data, e.g. `ffmpeg-series-1`
name	Name of the series (maybe null)
date	Created date of the series
version	Version of the series
total	Number of patches submitted under the series
received_total	Number of patches submitted under the series and received by the mailing list
cover_letter_msg_id	Message id of the cover letter email (the email that all following patch emails reply to)
cover_letter_content	Content of the cover letter email (maybe null)
api_url	API URL for retrieving the series data in patchwork
web_url	URL of the series in patchwork
project	Detail of the project that this series belong to
submitter_identity	Submitter's identity detail
submitter_individual	Submitter's individual detail

Patch

Fields	Description
_id	Object id created by MongoDB
id	Auto-increment id generated by Django
original_id	A combination of OSS community name, item type, and original Django id presented in the api data, e.g. `ffmpeg-patch-1`
name	Name of the patch
state	Status of the patch (accepted, superseded, new, etc.)
date	Date when the patch is submitted
msg_id	Message id of the patch, which can be referenced to message id of the original email in the mailing list
msg_content	Message content of the patch
code_diff	Differences of the code changes in the patch
api_url	API URL for retrieving the patch data in patchwork
web_url	URL of the patch in patchwork
commit_ref	Commit id in the corresponding git repository
in_reply_to	Message id of the email that the patch replies to (in most cases, it is the msg_id of a cover letter)
change1	Referencing the `original_id` in the ExactBoWGroup collection (generated by Exact BoW Grouping)
change2	Referencing the `original_id` in the OWDiffGroup collection (generated by One-word Difference Grouping)
mailinglist	Referencing the `original_id` in the mailinglist collection
series	Referencing the `original_id` in the series collection
newseries	Referencing the `original_id` in the newseries collection
submitter_identity	Referencing the `original_id` in the identity collection
submitter_individual	Referencing the `original_id` in the individual collection
project	Referencing the `original_id` in the project collection

Comment

Fields	Description
_id	Object id created by MongoDB
id	Auto-increment id generated by Django
original_id	A combination of OSS community name, item type, and original Django id presented in the api data, e.g. `ffmpeg-comment-1`
msg_id	Message id of the comment, which can be referenced to message id of the original email in the mailing list
msg_content	Content of the comment
date	Date when the comment is submitted
subject	Email subject of the comment
in_reply_to	Message id of the patch that the comment replies to
web_url	URL of the comment in patchwork
change1	Referencing the `original_id` in the ExactBoWGroup collection (generated by Exact BoW Grouping)
change2	Referencing the `original_id` in the OWDiffGroup collection (generated by One-word Difference Grouping)
mailinglist	Referencing the `original_id` in the mailinglist collection
submitter_identity	Referencing the `original_id` in the identity collection
submitter_individual	Referencing the `original_id` in the individual collection
patch	Referencing the `original_id` in the patch collection
project	Referencing the `original_id` in the project collection

NewSeries

Fields	Description
_id	Object id created by MongoDB
id	Auto-increment id generated by Django
original_id	A combination of OSS community name, item type, and an auto-generated index, e.g. `ffmpeg-newseries-1`
cover_letter_msg_id	In-reply-to id of patches; Referenced by the `reply_to_msg_id` in the patches collection
project	Referencing the `original_id` in the project collection
submitter_identity	Referencing the `original_id` in the identity collection
submitter_individual	Referencing the `original_id` in the individual collection
series	Referencing the `original_id` in the series collection
inspection_needed	Indicating the corresponding data item might has some problems and manually checking might be needed

ExactBoWGroup / OWDiffGroup

Fields	Description
_id	Object id created by MongoDB
id	Auto-increment id generated by Django
original_id	A combination of OSS community name, item type, and an auto-generated index, e.g. `ffmpeg-change1-1`, `ffmpeg-change2-1`
is_accepted	Whether the improved patch in the change set is accepted
parent_commit_id	Commit id of the previous version before this change is merged into the git repository
merged_commit_id	Commit id of the merge of this change into the git repository
commit_date	Date of commit
project	Referencing the `original_id` in the projects collection
submitter_identity	Referencing the `original_id` in the accounts collection
submitter_individual	Referencing the `original_id` in the users collection
series	Referencing the `original_id` in the series collection
newseries	Referencing the `original_id` in the newseries collection
inspection_needed	Indicating the corresponding data item might has some problems and manually checking might be needed
patches	detail of list of related patches
comments	detail of list of related comments

MailingList

Fields	Description
_id	Object id created by MongoDB
id	Auto-increment id generated by Django
original_id	A combination of OSS community name, item type, and an auto-generated index, e.g. `ffmpeg-mailinglist-1`
msg_id	Message id of the email, referencing the `msg_id` of patches/comments
subject	Email subject
content	Email content
date	Date when the email is sent
sender_name	Name of the sender of the email
web_url	URL of the original email in the mailing list
project	Referencing the `original_id` in the projects collection

4.1. Application filter

When accessing dataset via Python application, the fields that are available for filtering in each collection are different.

For filter type exact and -, directly use the field name to filter. For example, id=1.

For filter type icontains, gt, and lt, the filter type with two underlines in the front has to be added right after the field name. For instance, original_id__icontains=ffmpeg.

Identity

Available field	Available filter type
id	exact
original_id	exact, icontains
username	exact, icontains
email	exact, icontains
user_original_id	exact

Individual

Available field	Available filter type
id	exact
original_id	exact, icontains

Project

Available field	Available filter type
id	exact
original_id	exact, icontains
name	exact, icontains

Series

Available field	Available filter type
id	exact
original_id	exact, icontains
name	exact, icontains
date	gt, lt
cover_letter_content_contain	-
project_original_id	exact
submitter_account_original_id	exact
submitter_user_original_id	exact

Patch

Available field	Available filter type
id	exact
original_id	exact, icontains
name	exact, icontains
state	exact
date	gt, lt
msg_content_contain	-
code_diff_contain	-
commit_ref	exact
change1_original_id	exact
change2_original_id	exact
mailing_list_original_id	exact
series_original_id	exact
new_series_original_id	exact
project_original_id	exact
submitter_account_original_id	exact
submitter_user_original_id	exact

Comment

Available field	Available filter type
id	exact
original_id	exact, icontains
subject	exact, icontains
date	gt, lt
msg_content_contain	-
change1_original_id	exact
change2_original_id	exact
mailing_list_original_id	exact
patch_original_id	exact
project_original_id	exact
submitter_account_original_id	exact
submitter_user_original_id	exact

NewSeries

Available field	Available filter type
id	exact
original_id	exact, icontains
project_original_id	exact
inspection_needed	exact

ExactBoWGroup / OWDiffGroup

Available field	Available filter type
id	exact
original_id	exact, icontains
is_accepted	exact
parent_commit_id	exact
merged_commit_id	exact
project_original_id	exact
inspection_needed	exact

MailingList

Available field	Available filter type
id	exact
original_id	exact, icontains
msg_id	exact
subject	exact
date	gt, lt
sender_name	exact
project_original_id	exact

BibTeX Citation

@inproceedings{Liang2024,
  bibtex_show = {true},
  title = {Curated Email-Based Code Reviews Datasets},
  author = {Liang, Mingzhao and Charoenwet, Wachiraphan and Thongtanunam, Patanamon},
  booktitle = {Proceedings of the IEEE/ACM International Conference on Mining Software Repositories},
  abbr = {MSR},
  pages = {to appear},
  year = {2024},
  doi = {},
  html = {},
  pdf = {https://www.researchgate.net/publication/378711871_Curated_Email-Based_Code_Reviews_Datasets}
}

Name		Name	Last commit message	Last commit date
Latest commit History 191 Commits
app		app
evaluation		evaluation
example		example
figures		figures
.dockerignore		.dockerignore
.gitignore		.gitignore
README.md		README.md

awsm-research/patchwork-mining-tool-suite

Folders and files

Latest commit

History

Repository files navigation

Curated Email-Based Code Reviews Datasets

Introduction

Table of Content

1. Approaches and evaluation results

1.1 Patch grouping approach constraints

1.2 Evaluation results

2. Provided dataset

2.1. Get provided dataset

2.2. Use provided dataset

2.2.1. Access database directly

2.2.2. Access database via Python application

Retrieve the whole collection

Retrieve a specific set of data

3. Data crawling

3.1. Basic steps for crawling

3.1.1. Schedule spiders on scrapyd

3.1.2. Run spiders from script

3.1.3. Run spiders using commands

3.2. Customise spiders

3.2.1. Customise spiders in scrapyd

3.2.2. Customise spiders from script

3.2.3. Customise spiders run using commands

3.3. Data Process and Import

3.3.1. Data process

3.3.2. Data import

4. Data dictionary

Collections

Identity

Individual

Project

Series

Patch

Comment

NewSeries

ExactBoWGroup / OWDiffGroup

MailingList

4.1. Application filter

Identity

Individual

Project

Series

Patch

Comment

NewSeries

ExactBoWGroup / OWDiffGroup

MailingList

BibTeX Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages