Code review is an important practice that improves the overall quality of a proposed patch (i.e. code changes). While much research focused on tool-based code reviews (e.g. a Gerrit code review tool, GitHub), many traditional open-source software (OSS) projects still conduct code reviews through emails. However, due to the nature of unstructured email-based data, it can be challenging to mine email-based code reviews, hindering researchers from delving into the code review practice of such long-standing OSS projects. Therefore, this paper presents large-scale datasets of email-based code reviews of 167 projects across three OSS communities (i.e. Linux Kernel, OzLabs, and FFmpeg). We mined the data from Patchwork, a web-based patch-tracking system for email-based code review, and curated the data by grouping a submitted patch and its revised versions and grouping email aliases. Our datasets include a total of 4.2M patches with 2.1M patch groups and 169K email addresses belonging to 141K individuals. Our published artefacts include the datasets as well as a tool suite to crawl, curate, and store Patchwork data. With our datasets, future work can directly delve into an email-based code review practice of large OSS projects without additional effort in data collection and curation.
This project provides a suite of tools for mining and further processing Patchwork data. It consists of three parts:
- Scrapy framework for crawling data
- Django REST framework and Python application for accessing and processing the data
- MongoDB database for storing the data
A sample for processing the raw crawled data, including identity grouping and patch grouping, and another for importing processed data to the database are provided in Jupyter notebook, which can be found in folder app.
Two approaches, Exact Bags-of-Words (BoW) Grouping and One-word Difference Grouping, are implemented for patch grouping. Below are the constraints for the approaches.
Exact BoW Grouping
- The bag-of-words of the summary phrases of the patches are the same
- The patches do not belong to the same series
One-word Difference Grouping
- The bag of words of a group should be different from that of another group by one word
- The different word should not be "revert"
- Version references of both groups should not be intersected
- Both groups contain at least one common patch submitter
Accuracy of grouping
For our manual evaluation, a patch grouping is considered correct if all patches in the group are related to the same review process by investigating the content of each patch (e.g., commit message, related comments, code changes). Similarly, we consider an individual identification as correct if all the identities in the group are certainly from the same individual by examining whether 1) the identities have submitted patches in the same group, 2) the identities have commented on the same patches, and 3) the identities share other characteristics such as the organisation email addresses. Finally, we compute the grouping accuracy using the following calculation:
where correct groups refer to 1) groups of patches that belong to the same code review process or 2) groups of identities that belong to the same individual that are correctly identified; and evaluated groups refer to the sampled groups that are manually evaluated.
Below is the correctness and Cohen's Kappa results of each approach applied to the selected five projects. (N/A indicates both raters agree to exclude zero sampled groups)
Exact BoW grouping
Projects | Correctness | Agreement | Cohen's Îş | Interpretation |
---|---|---|---|---|
FFmpeg | 97.80%(±5%) | 100% | N/A | N/A |
QEMU | 99.48%(±5%) | 100% | N/A | N/A |
U-Boot | 98.42%(±5%) | 98.95% | N/A | N/A |
Linux Arm Kernel | 99.48%(±5%) | 100% | N/A | N/A |
Netdev + BPF | 98.41%(±5%) | 96.81% | N/A | N/A |
One-word difference grouping
Projects | Correctness | Agreement | Cohen's Îş | Interpretation |
---|---|---|---|---|
FFmpeg | 82.27%(±5%) | 91.43% | 0.6809 | substantial agreement |
QEMU | 88.10%(±5%) | 94.87% | 0.7709 | substantial agreement |
U-Boot | 84.39%(±5%) | 91.04% | 0.7141 | substantial agreement |
Linux Arm Kernel | 86.38%(±5%) | 93.83% | 0.7265 | substantial agreement |
Netdev + BPF | 82.95%(±5%) | 92.59% | 0.6301 | substantial agreement |
Individual grouping
Projects | Correctness | Agreement | Cohen's Îş | Interpretation |
---|---|---|---|---|
FFmpeg | 81.94%(±5%) | 88.89% | 0.6786 | substantial agreement |
QEMU | 82.35%(±5%) | 91.67% | 0.7115 | substantial agreement |
U-Boot | 84.55%(±5%) | 87.27% | 0.5276 | moderate agreement |
Linux Arm Kernel | 82.47%(±5%) | 94.52% | 0.8016 | substantial agreement |
Netdev + BPF | 87.37%(±5%) | 92.00% | 0.4565 | moderate agreement |
We provide both database dump and data in a plain JSON format, both of which contain data of all projects from the three OSS communities, including Linux Kernel, Ozlabs, and FFmpeg, until 30/06/2023. There are ten collections in which Project, Identity, Series, Patch, Comment, and MailingList store the original crawled data (some fields will be updated during further processing), and Individual, Change1, Change2, and NewSeries record the results of processing.
The compressed complete datasets and the plain JSON data can be downloaded here. Decompress the downloaded file in root folder of the project to use in the following steps.
To use the provided dataset, simply run docker containers without migrating database by using following commands in the terminal.
cd app/docker
# build docker images
docker-compose -f docker-compose-non-migrate.yml build
# run docker containers
docker-compose -f docker-compose-non-migrate.yml up -d
To run certain services in docker, specify service names, i.e. docker-compose -f docker-compose-non-migrate.yml up <service_name> <service_name> -d
Then, restore the database by running following command.
docker exec -i mongodb_docker_container sh -c 'exec mongorestore --archive --nsInclude=code_review_db.*' < /path/to/code_review_db.archive
After restoration process is done, Mongodb database will be available for local access at mongodb://localhost:27017
.
The sample analyses on code review metrics in a Jupyter notebook and their outputs can be found in folder analysis.
Data stored in the MongoDB database can be retrieved through Django REST API by simply using the retrieve_data
method in the Python application either as the whole set of data in a collection or as a specific set by using the filters.
The item type of the data to be retrieved has to be specified. Available item types include project, identity, series, patch, comment, newseries, change1, change2, mailinglist, and individual.
from application_layer import AccessData
access_data = AccessData()
item_type = 'project'
retrieved_data = access_data.retrieve_data(item_type)
Similarly, the item type also needs to be specified. To filter data, specify the filter in the arguments.
from application_layer import AccessData
access_data = AccessData()
item_type = 'series'
# filter series data which belong to the FFmpeg project and whose cover letter message content contains the word "improve"
filter = 'project_original_id=ffmpeg-project-1&cover_letter_content_contain=improve'
retrieved_data = access_data.retrieve_data(item_type, filter)
All available filters can be found in section Application filter.
To crawl new data from the source, apply the Scrapy framework in the suite. Retrieved data will be first stored in jsonlines files. The file content can be imported into the database with the help of the application layer.
There are three spiders for crawling patchwork data. Their spider names are patchwork_project, patchwork_series, and patchwork_patch.
- patchwork_project crawls patchwork projects and corresponding maintainer accounts data.
- patchwork_series crawls patchwork series and corresponding series submitter accounts data.
- patchwork_patch crawls patchwork patches, comments and corresponding submitter accounts data.
The retrieved data will be stored under /docker/scrapy_docker_app/retrieved_data
.
There are three ways to run spiders.
- Schedule spiders on scrapyd
- Run spiders from a script
- Run spiders using commands.
Run the docker containers by entering following commands in terminal.
cd docker
# build docker images
docker-compose -f docker-compose.yml build
# run docker containers
docker-compose -f docker-compose.yml up -d
To run certain services in docker, specify service names, i.e. docker-compose -f docker-compose.yml up <service_name> <service_name> -d
Then, run the following command to schedule and run a spider in scapyd. Multiple spiders can be run and manage by scrapyd.
curl http://localhost:6800/schedule.json -d project=default -d spider=<spider-name>
A basic structure for running from the script is provided in /docker/scrapy_docker_app/patchwork_crawler/spiders/patchwork_api.py
.
if __name__ == '__main__':
configure_logging()
runner = CrawlerRunner(settings={
'ITEM_PIPELINES': {'patchwork_crawler.pipelines.PatchworkExporterPipeline': 300},
'HTTPERROR_ALLOWED_CODES': [404, 500],
'SPIDER_MODULES': ['patchwork_crawler.spiders'],
'NEWSPIDER_MODULE': 'patchwork_crawler.spiders'
})
@defer.inlineCallbacksx
def crawl():
yield runner.crawl(PatchworkProjectSpider)
yield runner.crawl(PatchworkSeriesSpider)
yield runner.crawl(PatchworkPatchSpider)
reactor.stop()
crawl()
reactor.run()
To run the spider, run the following command under /scrapy_docker_app/patchwork_crawler
in the container terminal.
python -m patchwork_crawler.spiders.patchwork_api
For more information, visit the scrapy documentation
Run the following command in the container terminal.
scrapy crawl <spider-name>
Each spider crawls patchwork api web page by item id (e.g. patch id -> https://patchwork.ffmpeg.org/api/patches/1/
). It automatically increases the item id to crawl the next web page until the id number reaches the default limit or the specified limit. The start id and the endpoint to be crawled can be specified, if necessary.
Pass arguments in the command. Each argument should follow an option -d
.
# crawl projects
curl http://localhost:6800/schedule.json -d project=default -d spider=<spider-name> -d start_project_id=<specified-id> -d end_project_id=<specified-id> -d endpoint_type=<endpoint-name>
# crawl series
curl http://localhost:6800/schedule.json -d project=default -d spider=<spider-name> -d start_series_id=<specified-id> -d end_series_id=<specified-id> -d endpoint_type =<endpoint-name>
# crawl patches
curl http://localhost:6800/schedule.json -d project=default -d spider=<spider-name> -d start_patch_id=<specified-id> -d end_patch_id=<specified-id> -d endpoint_type =<endpoint-name>
In the provided structure, specify the argument in the crawlers.
@defer.inlineCallbacks
def crawl():
yield runner.crawl(PatchworkProjectSpider, start_project_id=..., end_project_id=..., endpoint_type=...)
yield runner.crawl(PatchworkSeriesSpider, start_series_id=..., end_series_id=..., endpoint_type=...)
yield runner.crawl(PatchworkPatchSpider, start_patch_id=..., end_patch_id=..., endpoint_type=...)
reactor.stop()
Similar to customisation in scrapyd, with the argument option -a
# take crawling project as an example
scrapy crawl <spider-name> -a start_project_id=<specified-id> -a end_project_id=<specified-id> -a endpoint_type=<endpoint-name>
After crawling data from Patchwork, data can be processed and imported to the database with the help of the application layer.
The application layer falls into two parts - AccessData for 1) importing data into the database, 2) accessing JSON files in the local machine, and 3) querying data from the database; ProcessIdentity, ProcessMailingList, and ProcessPatch for processing data, where ProcessIdentity identifies individuals (i.e. unique developers) within each project, ProcessMailingList sorts the mailing list data, and ProcessPatch groups related code review activities of the same proposed patch.
It is adivised to run the automated approaches, i.e. identity (email aliases) grouping, exact bags-of-words grouping and one-word difference grouping, provided in ProcessPatch before data import. An implementation example can be found in implementation.ipynb.
Note that identity data come from multiple sources, including project, series, patch, and comment, and thus they are stored into different files when data crawling is completed. Therefore, separated identity files are supposed to be merged into one befre processing and importing data.
In addition, newseries, change1, change2, and individual are newly generated data and they do not have original id as those crawled from Patchwork (See data attributes in the data dictionary), so their initial original ids are required to be specified if the approaches are not run at the first time.
To specify the original ids, corresponding maximum original id can be retrieved from the database by db.collection-name.count_documents({})
, which returns the number of records in a collection. ProcessIdentity and ProcessPatch accept arguments to specify the starting original id.
import pymongo, application layer
client = pymongo.MongoClient("mongodb://localhost:27017/")
db = client["code_review_db"]
individual_original_id = db.patchwork_individual.count_documents({}) + 1
process_identity = application_layer.main.ProcessIdentity(individual_original_id)
Alternatively a specific number can be specified.
process_identity = application_layer.main.ProcessIdentity(individual_original_id=10)
Data can be imported to the database by using the insert_data()
function provided in AccessData. Specifically, specify the data to be imported or the location of the data to be imported, and its corresponding item type. The item type include identity, project, mailinglist, individual, series, newseries, change1, change2, patch, and comment.
Note that for the msg_content field in the patch and comment datasets and content field in mailinglist dataset, the null character is converted to its unicode, i.e. \u0000
, when crawling data, and thus it has to be converted back.
# data has been loaded before import
access_data.insert_data(data=project_data, item_type="project")
# data has not been loaded before import
access_data.insert_data(data="path/to/project/data", item_type="project")
However, the import of each item type should follow a specific order: identity -> project -> mailinglist-> individual -> series -> newseries -> change1 -> change2 -> patch -> comment, unless it is confirmed that related foreign key data in the data to be imported are already in the database (See the complete ER diagram).
This section describes the high-level structure of the dataset. (Note that fields named change1 refer to ExactBoWGroup and those named change2 refer to OWDiffGroup)
Below is a complete ER diagram depicting the database structure, which can also be accessed in dbdiagram.
Fields | Description |
---|---|
_id | Object id created by MongoDB |
id | Auto-increment id generated by Django |
original_id | A combination of OSS community name, item type, and original Django id presented in the api data, e.g. ffmpeg-people-1 ; Note that for maintainer identities of a project, the item type is user , e.g. ffmpeg-user-1 |
Email of the identity | |
name | Name of the identity |
api_url | API URL for retrieving the original data in patchwork (Authentication needed as patchwork blocks the access) |
Fields | Description |
---|---|
_id | Object id created by MongoDB |
id | Auto-increment id generated by Django |
original_id | A combination of OSS community name, item type, and an auto-generated index, e.g. ffmpeg-individual-1 |
project | The project in which this individual has submitted patches, comments, etc. |
identity | Idenities which belong to the same individual in a project |
Fields | Description |
---|---|
_id | Object id created by MongoDB |
id | Auto-increment id generated by Django |
original_id | A combination of OSS community name, item type, and original Django id presented in the api data, e.g. ffmpeg-project-1 |
name | Name of the project |
repository_url | URL of the git repository of the project if applicable |
api_url | API URL for retrieving the project data in patchwork |
web_url | URL of the project web page (if applicable) |
list_id | Id of the mailing list of the project |
list_address | Email address of the mailing list of the project |
maintainer_identity | Maintainers' corresponding identity detail |
Fields | Description |
---|---|
_id | Object id created by MongoDB |
id | Auto-increment id generated by Django |
original_id | A combination of OSS community name, item type, and original Django id presented in the api data, e.g. ffmpeg-series-1 |
name | Name of the series (maybe null) |
date | Created date of the series |
version | Version of the series |
total | Number of patches submitted under the series |
received_total | Number of patches submitted under the series and received by the mailing list |
cover_letter_msg_id | Message id of the cover letter email (the email that all following patch emails reply to) |
cover_letter_content | Content of the cover letter email (maybe null) |
api_url | API URL for retrieving the series data in patchwork |
web_url | URL of the series in patchwork |
project | Detail of the project that this series belong to |
submitter_identity | Submitter's identity detail |
submitter_individual | Submitter's individual detail |
Fields | Description |
---|---|
_id | Object id created by MongoDB |
id | Auto-increment id generated by Django |
original_id | A combination of OSS community name, item type, and original Django id presented in the api data, e.g. ffmpeg-patch-1 |
name | Name of the patch |
state | Status of the patch (accepted, superseded, new, etc.) |
date | Date when the patch is submitted |
msg_id | Message id of the patch, which can be referenced to message id of the original email in the mailing list |
msg_content | Message content of the patch |
code_diff | Differences of the code changes in the patch |
api_url | API URL for retrieving the patch data in patchwork |
web_url | URL of the patch in patchwork |
commit_ref | Commit id in the corresponding git repository |
in_reply_to | Message id of the email that the patch replies to (in most cases, it is the msg_id of a cover letter) |
change1 | Referencing the original_id in the ExactBoWGroup collection (generated by Exact BoW Grouping) |
change2 | Referencing the original_id in the OWDiffGroup collection (generated by One-word Difference Grouping) |
mailinglist | Referencing the original_id in the mailinglist collection |
series | Referencing the original_id in the series collection |
newseries | Referencing the original_id in the newseries collection |
submitter_identity | Referencing the original_id in the identity collection |
submitter_individual | Referencing the original_id in the individual collection |
project | Referencing the original_id in the project collection |
Fields | Description |
---|---|
_id | Object id created by MongoDB |
id | Auto-increment id generated by Django |
original_id | A combination of OSS community name, item type, and original Django id presented in the api data, e.g. ffmpeg-comment-1 |
msg_id | Message id of the comment, which can be referenced to message id of the original email in the mailing list |
msg_content | Content of the comment |
date | Date when the comment is submitted |
subject | Email subject of the comment |
in_reply_to | Message id of the patch that the comment replies to |
web_url | URL of the comment in patchwork |
change1 | Referencing the original_id in the ExactBoWGroup collection (generated by Exact BoW Grouping) |
change2 | Referencing the original_id in the OWDiffGroup collection (generated by One-word Difference Grouping) |
mailinglist | Referencing the original_id in the mailinglist collection |
submitter_identity | Referencing the original_id in the identity collection |
submitter_individual | Referencing the original_id in the individual collection |
patch | Referencing the original_id in the patch collection |
project | Referencing the original_id in the project collection |
Fields | Description |
---|---|
_id | Object id created by MongoDB |
id | Auto-increment id generated by Django |
original_id | A combination of OSS community name, item type, and an auto-generated index, e.g. ffmpeg-newseries-1 |
cover_letter_msg_id | In-reply-to id of patches; Referenced by the reply_to_msg_id in the patches collection |
project | Referencing the original_id in the project collection |
submitter_identity | Referencing the original_id in the identity collection |
submitter_individual | Referencing the original_id in the individual collection |
series | Referencing the original_id in the series collection |
inspection_needed | Indicating the corresponding data item might has some problems and manually checking might be needed |
Fields | Description |
---|---|
_id | Object id created by MongoDB |
id | Auto-increment id generated by Django |
original_id | A combination of OSS community name, item type, and an auto-generated index, e.g. ffmpeg-change1-1 , ffmpeg-change2-1 |
is_accepted | Whether the improved patch in the change set is accepted |
parent_commit_id | Commit id of the previous version before this change is merged into the git repository |
merged_commit_id | Commit id of the merge of this change into the git repository |
commit_date | Date of commit |
project | Referencing the original_id in the projects collection |
submitter_identity | Referencing the original_id in the accounts collection |
submitter_individual | Referencing the original_id in the users collection |
series | Referencing the original_id in the series collection |
newseries | Referencing the original_id in the newseries collection |
inspection_needed | Indicating the corresponding data item might has some problems and manually checking might be needed |
patches | detail of list of related patches |
comments | detail of list of related comments |
Fields | Description |
---|---|
_id | Object id created by MongoDB |
id | Auto-increment id generated by Django |
original_id | A combination of OSS community name, item type, and an auto-generated index, e.g. ffmpeg-mailinglist-1 |
msg_id | Message id of the email, referencing the msg_id of patches/comments |
subject | Email subject |
content | Email content |
date | Date when the email is sent |
sender_name | Name of the sender of the email |
web_url | URL of the original email in the mailing list |
project | Referencing the original_id in the projects collection |
When accessing dataset via Python application, the fields that are available for filtering in each collection are different.
For filter type exact and -, directly use the field name to filter. For example, id=1
.
For filter type icontains, gt, and lt, the filter type with two underlines in the front has to be added right after the field name. For instance, original_id__icontains=ffmpeg
.
Available field | Available filter type |
---|---|
id | exact |
original_id | exact, icontains |
username | exact, icontains |
exact, icontains | |
user_original_id | exact |
Available field | Available filter type |
---|---|
id | exact |
original_id | exact, icontains |
Available field | Available filter type |
---|---|
id | exact |
original_id | exact, icontains |
name | exact, icontains |
Available field | Available filter type |
---|---|
id | exact |
original_id | exact, icontains |
name | exact, icontains |
date | gt, lt |
cover_letter_content_contain | - |
project_original_id | exact |
submitter_account_original_id | exact |
submitter_user_original_id | exact |
Available field | Available filter type |
---|---|
id | exact |
original_id | exact, icontains |
name | exact, icontains |
state | exact |
date | gt, lt |
msg_content_contain | - |
code_diff_contain | - |
commit_ref | exact |
change1_original_id | exact |
change2_original_id | exact |
mailing_list_original_id | exact |
series_original_id | exact |
new_series_original_id | exact |
project_original_id | exact |
submitter_account_original_id | exact |
submitter_user_original_id | exact |
Available field | Available filter type |
---|---|
id | exact |
original_id | exact, icontains |
subject | exact, icontains |
date | gt, lt |
msg_content_contain | - |
change1_original_id | exact |
change2_original_id | exact |
mailing_list_original_id | exact |
patch_original_id | exact |
project_original_id | exact |
submitter_account_original_id | exact |
submitter_user_original_id | exact |
Available field | Available filter type |
---|---|
id | exact |
original_id | exact, icontains |
project_original_id | exact |
inspection_needed | exact |
Available field | Available filter type |
---|---|
id | exact |
original_id | exact, icontains |
is_accepted | exact |
parent_commit_id | exact |
merged_commit_id | exact |
project_original_id | exact |
inspection_needed | exact |
Available field | Available filter type |
---|---|
id | exact |
original_id | exact, icontains |
msg_id | exact |
subject | exact |
date | gt, lt |
sender_name | exact |
project_original_id | exact |
@inproceedings{Liang2024,
bibtex_show = {true},
title = {Curated Email-Based Code Reviews Datasets},
author = {Liang, Mingzhao and Charoenwet, Wachiraphan and Thongtanunam, Patanamon},
booktitle = {Proceedings of the IEEE/ACM International Conference on Mining Software Repositories},
abbr = {MSR},
pages = {to appear},
year = {2024},
doi = {},
html = {},
pdf = {https://www.researchgate.net/publication/378711871_Curated_Email-Based_Code_Reviews_Datasets}
}