Throughput GitHub Scrapers

The Throughput Database was seeded with information obtained from scraping GitHub using an authorized script. The scrapers looked for both repositories associated with specific NSF Awards, and also with databases defined within Re3Data.

These scripts connect to an instance of the Throughput Neo4j graph database (using the py2neo Pyhton package) and make calls to the GitHub API using the GitHub package (pyGithub) for Python. With each query there are checkers to ensure that the script has not triggered that rate limiter, to ensure that the query is sucessfully returning information that is relevant to Throughput and valid for the award, and then to post the data to the Throughput Database.

Both the NSF scraper and the re3scraper return two files, one for positive matches and one for negative matches, to let us better assess the quality of the returned data.

Contributors

This project is an open project, and contributions are welcome from any individual. All contributors to this project are bound by a code of conduct. Please review and follow this code of conduct as part of your contribution.

Simon Goring

Tips for Contributing

Issues and bug reports are always welcome. Code clean-up, and feature additions can be done either through pull requests to project forks or branches.

All products of the Throughput Annotation Project are licensed under an MIT License unless otherwise noted.

How to use this repository

This project was developed to require both an instance of a Neo4j graph database, and to use Python virtual environments. Future work may focus on the development of a Docker-ized workflow, but at present there are no plans to do so.

The required packages are located within the file requirements.txt in the root folder of each of the two scrapers. The requirements were generated using the pipreqs package for Python.

To start to use one of the scripts, begin by initializing a virtual environment:

python3 -m venv .
source ./bin/activate

Workflow Overview

Each script links to a Neo4j database and queries a set of objects (NSF Awards, Databases). This returns an array of objects that are to be looped through.

Each object is used as the basis of a specific query to the Github code search API. The result is a set of repositories (or no result). In the case of no result, we continue to the next API call. If repositories match, we test each to determine whether they are of use to Throughput.

System Requirements

This project uses Python v3.7.6, and Neo4j v4.1.1. Development was undertaken on a system using Linux Mint 20.

Data Requirements

The project requires a version of the Throughput Database. A recent snapshot is available here, or the database can be reconstructed (in part) using the code within the throughputdb repository.

Key Outputs

This project adds data directly to the local (or remote) neo4j database. Two files are created during the execution of the script:

pass_log.txt - Records each repository that was returned through the GitHub API request and was added to the Throughput graph.
fail_log.txt - Records each repository that was returned through the GitHub API request and was not added to the Throughput graph.

Both files return a single result per line, in JSON format:

{"query": "\"NSF  1541002\" in:file", "text": [" Paleoecology Database and Neotoma data stewards. Work on this paper was supported by NSF Awards NSF-1541002, NSF-1550707 and NSF-1550707.\n\n# REFERENCES\n"]}

It is possible to quickly check how many results have passed or failed using the linux command line argument wc -l pass_log.txt which will return the number of lines in the file.

Metrics

This project is evaluated along the following metrics:

Successful implementation of NSF Award Scraper
Successful implementation of Re3Data Scraper
Number of Github repositories added to the Throughput Database

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
github_deepdive		github_deepdive
nsf_deepdive		nsf_deepdive
nsfscraper		nsfscraper
re3scraper		re3scraper
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
paperDBrepo.png		paperDBrepo.png
paperDBrepo.svg		paperDBrepo.svg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Throughput GitHub Scrapers

Contributors

Tips for Contributing

How to use this repository

Workflow Overview

System Requirements

Data Requirements

Key Outputs

Metrics

About

Releases

Packages

Languages

License

throughput-ec/github_scrapers

Folders and files

Latest commit

History

Repository files navigation

Throughput GitHub Scrapers

Contributors

Tips for Contributing

How to use this repository

Workflow Overview

System Requirements

Data Requirements

Key Outputs

Metrics

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages