Skip to content

awsm-research/patchwork-mining-tool-suite

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Curated Email-Based Code Reviews Datasets

Code review is an important practice that improves the overall quality of a proposed patch (i.e. code changes). While much research focused on tool-based code reviews (e.g. a Gerrit code review tool, GitHub), many traditional open-source software (OSS) projects still conduct code reviews through emails. However, due to the nature of unstructured email-based data, it can be challenging to mine email-based code reviews, hindering researchers from delving into the code review practice of such long-standing OSS projects. Therefore, this paper presents large-scale datasets of email-based code reviews of 167 projects across three OSS communities (i.e. Linux Kernel, OzLabs, and FFmpeg). We mined the data from Patchwork, a web-based patch-tracking system for email-based code review, and curated the data by grouping a submitted patch and its revised versions and grouping email aliases. Our datasets include a total of 4.2M patches with 2.1M patch groups and 169K email addresses belonging to 141K individuals. Our published artefacts include the datasets as well as a tool suite to crawl, curate, and store Patchwork data. With our datasets, future work can directly delve into an email-based code review practice of large OSS projects without additional effort in data collection and curation.

Introduction

This project provides a suite of tools for mining and further processing Patchwork data. It consists of three parts:

  • Scrapy framework for crawling data
  • Django REST framework and Python application for accessing and processing the data
  • MongoDB database for storing the data

Table of Content

  1. Approaches and evaluation results
  2. Provided dataset
    1. Get provided dataset
    2. Use provided dataset
  3. Data crawling
    1. Basic steps for crawling
      1. Schedule spiders on scrapyd
      2. Run spiders from script
      3. Run spiders using commands
    2. Customise spiders
      1. Customise spiders in scrapyd
      2. Customise spiders from script
      3. Customise spiders run using commands
    3. Data Process and Import
  4. Data dictionary
    1. Application filter
  5. BibTeX Citation

1. Approaches and evaluation results

A sample for processing the raw crawled data, including identity grouping and patch grouping, and another for importing processed data to the database are provided in Jupyter notebook, which can be found in folder app.

1.1 Patch grouping approach constraints

Two approaches, Exact Bags-of-Words (BoW) Grouping and One-word Difference Grouping, are implemented for patch grouping. Below are the constraints for the approaches.

Exact BoW Grouping

  • The bag-of-words of the summary phrases of the patches are the same
  • The patches do not belong to the same series

One-word Difference Grouping

  • The bag of words of a group should be different from that of another group by one word
  • The different word should not be "revert"
  • Version references of both groups should not be intersected
  • Both groups contain at least one common patch submitter

1.2 Evaluation results

Accuracy of grouping

For our manual evaluation, a patch grouping is considered correct if all patches in the group are related to the same review process by investigating the content of each patch (e.g., commit message, related comments, code changes). Similarly, we consider an individual identification as correct if all the identities in the group are certainly from the same individual by examining whether 1) the identities have submitted patches in the same group, 2) the identities have commented on the same patches, and 3) the identities share other characteristics such as the organisation email addresses. Finally, we compute the grouping accuracy using the following calculation:

$Correctness = \frac{\text{\#Correct groups}}{\text{\#Evaluated groups}}$

where correct groups refer to 1) groups of patches that belong to the same code review process or 2) groups of identities that belong to the same individual that are correctly identified; and evaluated groups refer to the sampled groups that are manually evaluated.

Below is the correctness and Cohen's Kappa results of each approach applied to the selected five projects. (N/A indicates both raters agree to exclude zero sampled groups)

Exact BoW grouping

Projects Correctness Agreement Cohen's Îş Interpretation
FFmpeg 97.80%(±5%) 100% N/A N/A
QEMU 99.48%(±5%) 100% N/A N/A
U-Boot 98.42%(±5%) 98.95% N/A N/A
Linux Arm Kernel 99.48%(±5%) 100% N/A N/A
Netdev + BPF 98.41%(±5%) 96.81% N/A N/A

One-word difference grouping

Projects Correctness Agreement Cohen's Îş Interpretation
FFmpeg 82.27%(±5%) 91.43% 0.6809 substantial agreement
QEMU 88.10%(±5%) 94.87% 0.7709 substantial agreement
U-Boot 84.39%(±5%) 91.04% 0.7141 substantial agreement
Linux Arm Kernel 86.38%(±5%) 93.83% 0.7265 substantial agreement
Netdev + BPF 82.95%(±5%) 92.59% 0.6301 substantial agreement

Individual grouping

Projects Correctness Agreement Cohen's Îş Interpretation
FFmpeg 81.94%(±5%) 88.89% 0.6786 substantial agreement
QEMU 82.35%(±5%) 91.67% 0.7115 substantial agreement
U-Boot 84.55%(±5%) 87.27% 0.5276 moderate agreement
Linux Arm Kernel 82.47%(±5%) 94.52% 0.8016 substantial agreement
Netdev + BPF 87.37%(±5%) 92.00% 0.4565 moderate agreement

2. Provided dataset

We provide both database dump and data in a plain JSON format, both of which contain data of all projects from the three OSS communities, including Linux Kernel, Ozlabs, and FFmpeg, until 30/06/2023. There are ten collections in which Project, Identity, Series, Patch, Comment, and MailingList store the original crawled data (some fields will be updated during further processing), and Individual, Change1, Change2, and NewSeries record the results of processing.

2.1. Get provided dataset

The compressed complete datasets and the plain JSON data can be downloaded here. Decompress the downloaded file in root folder of the project to use in the following steps.

2.2. Use provided dataset

To use the provided dataset, simply run docker containers without migrating database by using following commands in the terminal.

cd app/docker

# build docker images
docker-compose -f docker-compose-non-migrate.yml build

# run docker containers
docker-compose -f docker-compose-non-migrate.yml up -d

To run certain services in docker, specify service names, i.e. docker-compose -f docker-compose-non-migrate.yml up <service_name> <service_name> -d

Then, restore the database by running following command.

docker exec -i mongodb_docker_container sh -c 'exec mongorestore --archive --nsInclude=code_review_db.*' < /path/to/code_review_db.archive

After restoration process is done, Mongodb database will be available for local access at mongodb://localhost:27017.

2.2.1. Access database directly

The sample analyses on code review metrics in a Jupyter notebook and their outputs can be found in folder analysis.

2.2.2. Access database via Python application

Data stored in the MongoDB database can be retrieved through Django REST API by simply using the retrieve_data method in the Python application either as the whole set of data in a collection or as a specific set by using the filters.

Retrieve the whole collection

The item type of the data to be retrieved has to be specified. Available item types include project, identity, series, patch, comment, newseries, change1, change2, mailinglist, and individual.

from application_layer import AccessData

access_data = AccessData()
item_type = 'project'
retrieved_data = access_data.retrieve_data(item_type)
Retrieve a specific set of data

Similarly, the item type also needs to be specified. To filter data, specify the filter in the arguments.

from application_layer import AccessData

access_data = AccessData()
item_type = 'series'

# filter series data which belong to the FFmpeg project and whose cover letter message content contains the word "improve"
filter = 'project_original_id=ffmpeg-project-1&cover_letter_content_contain=improve'
retrieved_data = access_data.retrieve_data(item_type, filter)

All available filters can be found in section Application filter.

3. Data crawling

To crawl new data from the source, apply the Scrapy framework in the suite. Retrieved data will be first stored in jsonlines files. The file content can be imported into the database with the help of the application layer.

There are three spiders for crawling patchwork data. Their spider names are patchwork_project, patchwork_series, and patchwork_patch.

  • patchwork_project crawls patchwork projects and corresponding maintainer accounts data.
  • patchwork_series crawls patchwork series and corresponding series submitter accounts data.
  • patchwork_patch crawls patchwork patches, comments and corresponding submitter accounts data.

The retrieved data will be stored under /docker/scrapy_docker_app/retrieved_data.

3.1. Basic steps for crawling

There are three ways to run spiders.

  • Schedule spiders on scrapyd
  • Run spiders from a script
  • Run spiders using commands.

3.1.1. Schedule spiders on scrapyd

Run the docker containers by entering following commands in terminal.

cd docker

# build docker images
docker-compose -f docker-compose.yml build

# run docker containers
docker-compose -f docker-compose.yml up -d

To run certain services in docker, specify service names, i.e. docker-compose -f docker-compose.yml up <service_name> <service_name> -d

Then, run the following command to schedule and run a spider in scapyd. Multiple spiders can be run and manage by scrapyd.

curl http://localhost:6800/schedule.json -d project=default -d spider=<spider-name>

3.1.2. Run spiders from script

A basic structure for running from the script is provided in /docker/scrapy_docker_app/patchwork_crawler/spiders/patchwork_api.py.

if __name__ == '__main__':

    configure_logging()
    runner = CrawlerRunner(settings={
        'ITEM_PIPELINES': {'patchwork_crawler.pipelines.PatchworkExporterPipeline': 300},
        'HTTPERROR_ALLOWED_CODES': [404, 500],
        'SPIDER_MODULES': ['patchwork_crawler.spiders'],
        'NEWSPIDER_MODULE': 'patchwork_crawler.spiders'
    })

    @defer.inlineCallbacksx
    def crawl():
        yield runner.crawl(PatchworkProjectSpider)
        yield runner.crawl(PatchworkSeriesSpider)
        yield runner.crawl(PatchworkPatchSpider)
        reactor.stop()

    crawl()
    reactor.run()

To run the spider, run the following command under /scrapy_docker_app/patchwork_crawler in the container terminal.

python -m patchwork_crawler.spiders.patchwork_api

For more information, visit the scrapy documentation

3.1.3. Run spiders using commands

Run the following command in the container terminal.

scrapy crawl <spider-name>

3.2. Customise spiders

Each spider crawls patchwork api web page by item id (e.g. patch id -> https://patchwork.ffmpeg.org/api/patches/1/). It automatically increases the item id to crawl the next web page until the id number reaches the default limit or the specified limit. The start id and the endpoint to be crawled can be specified, if necessary.

3.2.1. Customise spiders in scrapyd

Pass arguments in the command. Each argument should follow an option -d.

# crawl projects
curl http://localhost:6800/schedule.json -d project=default -d spider=<spider-name> -d start_project_id=<specified-id> -d end_project_id=<specified-id> -d endpoint_type=<endpoint-name>

# crawl series
curl http://localhost:6800/schedule.json -d project=default -d spider=<spider-name> -d start_series_id=<specified-id> -d end_series_id=<specified-id> -d endpoint_type =<endpoint-name>

# crawl patches
curl http://localhost:6800/schedule.json -d project=default -d spider=<spider-name> -d start_patch_id=<specified-id> -d end_patch_id=<specified-id> -d endpoint_type =<endpoint-name>

3.2.2. Customise spiders from script

In the provided structure, specify the argument in the crawlers.

@defer.inlineCallbacks
    def crawl():
        yield runner.crawl(PatchworkProjectSpider, start_project_id=..., end_project_id=..., endpoint_type=...)
        yield runner.crawl(PatchworkSeriesSpider, start_series_id=..., end_series_id=..., endpoint_type=...)
        yield runner.crawl(PatchworkPatchSpider, start_patch_id=..., end_patch_id=..., endpoint_type=...)
        reactor.stop()

3.2.3. Customise spiders run using commands

Similar to customisation in scrapyd, with the argument option -a

# take crawling project as an example
scrapy crawl <spider-name> -a start_project_id=<specified-id> -a end_project_id=<specified-id> -a endpoint_type=<endpoint-name>

3.3. Data Process and Import

After crawling data from Patchwork, data can be processed and imported to the database with the help of the application layer.

3.3.1. Data process

The application layer falls into two parts - AccessData for 1) importing data into the database, 2) accessing JSON files in the local machine, and 3) querying data from the database; ProcessIdentity, ProcessMailingList, and ProcessPatch for processing data, where ProcessIdentity identifies individuals (i.e. unique developers) within each project, ProcessMailingList sorts the mailing list data, and ProcessPatch groups related code review activities of the same proposed patch.

It is adivised to run the automated approaches, i.e. identity (email aliases) grouping, exact bags-of-words grouping and one-word difference grouping, provided in ProcessPatch before data import. An implementation example can be found in implementation.ipynb.

Note that identity data come from multiple sources, including project, series, patch, and comment, and thus they are stored into different files when data crawling is completed. Therefore, separated identity files are supposed to be merged into one befre processing and importing data.

In addition, newseries, change1, change2, and individual are newly generated data and they do not have original id as those crawled from Patchwork (See data attributes in the data dictionary), so their initial original ids are required to be specified if the approaches are not run at the first time.

To specify the original ids, corresponding maximum original id can be retrieved from the database by db.collection-name.count_documents({}), which returns the number of records in a collection. ProcessIdentity and ProcessPatch accept arguments to specify the starting original id.

import pymongo, application layer

client = pymongo.MongoClient("mongodb://localhost:27017/")
db = client["code_review_db"]

individual_original_id = db.patchwork_individual.count_documents({}) + 1

process_identity = application_layer.main.ProcessIdentity(individual_original_id)

Alternatively a specific number can be specified.

process_identity = application_layer.main.ProcessIdentity(individual_original_id=10)

3.3.2. Data import

Data can be imported to the database by using the insert_data() function provided in AccessData. Specifically, specify the data to be imported or the location of the data to be imported, and its corresponding item type. The item type include identity, project, mailinglist, individual, series, newseries, change1, change2, patch, and comment.

Note that for the msg_content field in the patch and comment datasets and content field in mailinglist dataset, the null character is converted to its unicode, i.e. \u0000, when crawling data, and thus it has to be converted back.

# data has been loaded before import
access_data.insert_data(data=project_data, item_type="project")

# data has not been loaded before import
access_data.insert_data(data="path/to/project/data", item_type="project")

However, the import of each item type should follow a specific order: identity -> project -> mailinglist-> individual -> series -> newseries -> change1 -> change2 -> patch -> comment, unless it is confirmed that related foreign key data in the data to be imported are already in the database (See the complete ER diagram).

4. Data dictionary

This section describes the high-level structure of the dataset. (Note that fields named change1 refer to ExactBoWGroup and those named change2 refer to OWDiffGroup)

Below is a complete ER diagram depicting the database structure, which can also be accessed in dbdiagram.

text

Collections

Identity

Fields Description
_id Object id created by MongoDB
id Auto-increment id generated by Django
original_id A combination of OSS community name, item type, and original Django id presented in the api data, e.g. ffmpeg-people-1; Note that for maintainer identities of a project, the item type is user, e.g. ffmpeg-user-1
email Email of the identity
name Name of the identity
api_url API URL for retrieving the original data in patchwork (Authentication needed as patchwork blocks the access)

Individual

Fields Description
_id Object id created by MongoDB
id Auto-increment id generated by Django
original_id A combination of OSS community name, item type, and an auto-generated index, e.g. ffmpeg-individual-1
project The project in which this individual has submitted patches, comments, etc.
identity Idenities which belong to the same individual in a project

Project

Fields Description
_id Object id created by MongoDB
id Auto-increment id generated by Django
original_id A combination of OSS community name, item type, and original Django id presented in the api data, e.g. ffmpeg-project-1
name Name of the project
repository_url URL of the git repository of the project if applicable
api_url API URL for retrieving the project data in patchwork
web_url URL of the project web page (if applicable)
list_id Id of the mailing list of the project
list_address Email address of the mailing list of the project
maintainer_identity Maintainers' corresponding identity detail

Series

Fields Description
_id Object id created by MongoDB
id Auto-increment id generated by Django
original_id A combination of OSS community name, item type, and original Django id presented in the api data, e.g. ffmpeg-series-1
name Name of the series (maybe null)
date Created date of the series
version Version of the series
total Number of patches submitted under the series
received_total Number of patches submitted under the series and received by the mailing list
cover_letter_msg_id Message id of the cover letter email (the email that all following patch emails reply to)
cover_letter_content Content of the cover letter email (maybe null)
api_url API URL for retrieving the series data in patchwork
web_url URL of the series in patchwork
project Detail of the project that this series belong to
submitter_identity Submitter's identity detail
submitter_individual Submitter's individual detail

Patch

Fields Description
_id Object id created by MongoDB
id Auto-increment id generated by Django
original_id A combination of OSS community name, item type, and original Django id presented in the api data, e.g. ffmpeg-patch-1
name Name of the patch
state Status of the patch (accepted, superseded, new, etc.)
date Date when the patch is submitted
msg_id Message id of the patch, which can be referenced to message id of the original email in the mailing list
msg_content Message content of the patch
code_diff Differences of the code changes in the patch
api_url API URL for retrieving the patch data in patchwork
web_url URL of the patch in patchwork
commit_ref Commit id in the corresponding git repository
in_reply_to Message id of the email that the patch replies to (in most cases, it is the msg_id of a cover letter)
change1 Referencing the original_id in the ExactBoWGroup collection (generated by Exact BoW Grouping)
change2 Referencing the original_id in the OWDiffGroup collection (generated by One-word Difference Grouping)
mailinglist Referencing the original_id in the mailinglist collection
series Referencing the original_id in the series collection
newseries Referencing the original_id in the newseries collection
submitter_identity Referencing the original_id in the identity collection
submitter_individual Referencing the original_id in the individual collection
project Referencing the original_id in the project collection

Comment

Fields Description
_id Object id created by MongoDB
id Auto-increment id generated by Django
original_id A combination of OSS community name, item type, and original Django id presented in the api data, e.g. ffmpeg-comment-1
msg_id Message id of the comment, which can be referenced to message id of the original email in the mailing list
msg_content Content of the comment
date Date when the comment is submitted
subject Email subject of the comment
in_reply_to Message id of the patch that the comment replies to
web_url URL of the comment in patchwork
change1 Referencing the original_id in the ExactBoWGroup collection (generated by Exact BoW Grouping)
change2 Referencing the original_id in the OWDiffGroup collection (generated by One-word Difference Grouping)
mailinglist Referencing the original_id in the mailinglist collection
submitter_identity Referencing the original_id in the identity collection
submitter_individual Referencing the original_id in the individual collection
patch Referencing the original_id in the patch collection
project Referencing the original_id in the project collection

NewSeries

Fields Description
_id Object id created by MongoDB
id Auto-increment id generated by Django
original_id A combination of OSS community name, item type, and an auto-generated index, e.g. ffmpeg-newseries-1
cover_letter_msg_id In-reply-to id of patches; Referenced by the reply_to_msg_id in the patches collection
project Referencing the original_id in the project collection
submitter_identity Referencing the original_id in the identity collection
submitter_individual Referencing the original_id in the individual collection
series Referencing the original_id in the series collection
inspection_needed Indicating the corresponding data item might has some problems and manually checking might be needed

ExactBoWGroup / OWDiffGroup

Fields Description
_id Object id created by MongoDB
id Auto-increment id generated by Django
original_id A combination of OSS community name, item type, and an auto-generated index, e.g. ffmpeg-change1-1, ffmpeg-change2-1
is_accepted Whether the improved patch in the change set is accepted
parent_commit_id Commit id of the previous version before this change is merged into the git repository
merged_commit_id Commit id of the merge of this change into the git repository
commit_date Date of commit
project Referencing the original_id in the projects collection
submitter_identity Referencing the original_id in the accounts collection
submitter_individual Referencing the original_id in the users collection
series Referencing the original_id in the series collection
newseries Referencing the original_id in the newseries collection
inspection_needed Indicating the corresponding data item might has some problems and manually checking might be needed
patches detail of list of related patches
comments detail of list of related comments

MailingList

Fields Description
_id Object id created by MongoDB
id Auto-increment id generated by Django
original_id A combination of OSS community name, item type, and an auto-generated index, e.g. ffmpeg-mailinglist-1
msg_id Message id of the email, referencing the msg_id of patches/comments
subject Email subject
content Email content
date Date when the email is sent
sender_name Name of the sender of the email
web_url URL of the original email in the mailing list
project Referencing the original_id in the projects collection

4.1. Application filter

When accessing dataset via Python application, the fields that are available for filtering in each collection are different.

For filter type exact and -, directly use the field name to filter. For example, id=1.

For filter type icontains, gt, and lt, the filter type with two underlines in the front has to be added right after the field name. For instance, original_id__icontains=ffmpeg.

Identity

Available field Available filter type
id exact
original_id exact, icontains
username exact, icontains
email exact, icontains
user_original_id exact

Individual

Available field Available filter type
id exact
original_id exact, icontains

Project

Available field Available filter type
id exact
original_id exact, icontains
name exact, icontains

Series

Available field Available filter type
id exact
original_id exact, icontains
name exact, icontains
date gt, lt
cover_letter_content_contain -
project_original_id exact
submitter_account_original_id exact
submitter_user_original_id exact

Patch

Available field Available filter type
id exact
original_id exact, icontains
name exact, icontains
state exact
date gt, lt
msg_content_contain -
code_diff_contain -
commit_ref exact
change1_original_id exact
change2_original_id exact
mailing_list_original_id exact
series_original_id exact
new_series_original_id exact
project_original_id exact
submitter_account_original_id exact
submitter_user_original_id exact

Comment

Available field Available filter type
id exact
original_id exact, icontains
subject exact, icontains
date gt, lt
msg_content_contain -
change1_original_id exact
change2_original_id exact
mailing_list_original_id exact
patch_original_id exact
project_original_id exact
submitter_account_original_id exact
submitter_user_original_id exact

NewSeries

Available field Available filter type
id exact
original_id exact, icontains
project_original_id exact
inspection_needed exact

ExactBoWGroup / OWDiffGroup

Available field Available filter type
id exact
original_id exact, icontains
is_accepted exact
parent_commit_id exact
merged_commit_id exact
project_original_id exact
inspection_needed exact

MailingList

Available field Available filter type
id exact
original_id exact, icontains
msg_id exact
subject exact
date gt, lt
sender_name exact
project_original_id exact

BibTeX Citation

@inproceedings{Liang2024,
  bibtex_show = {true},
  title = {Curated Email-Based Code Reviews Datasets},
  author = {Liang, Mingzhao and Charoenwet, Wachiraphan and Thongtanunam, Patanamon},
  booktitle = {Proceedings of the IEEE/ACM International Conference on Mining Software Repositories},
  abbr = {MSR},
  pages = {to appear},
  year = {2024},
  doi = {},
  html = {},
  pdf = {https://www.researchgate.net/publication/378711871_Curated_Email-Based_Code_Reviews_Datasets}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published