ASKMe QA dataset analysis project

This project contains scripts for downloading and processing the Google Natural Questions, HotpotQA, and TriviaQA datasets, as well as Jupyter notebooks for analyzing the data. The goal of this project is to provide a starting point for researchers and developers who want to work with these datasets, by providing a simple and easy-to-use interface for downloading, processing, and analyzing the data.

This project is a part of the ASKMe QA benchmark.

Project Structure

The project is organized into directories for each dataset:

Each directory contains scripts for downloading and processing the respective dataset(download_dataset.sh), as well as a Jupyter notebook (stats.ipynb) for analyzing the data. In addition to downloading the dataset, the GoogleNQ directory's download_and_simplify_dataset.sh script also simplifies it to a more manageable format that's more suitable for analysis.

Each directory also contains a sample.json or sample.jsonl file that contains a single entry from the dataset, to give an idea of the structure of the data for your convenience.

The data/ directory would contain the raw and processed data files for each dataset.

Getting Started

Clone the repository.
Install the dependencies listed in pyproject.toml using poetry.

poetry install

Download and process the datasets using the scripts in the respective directories. For example, to download and process the Google Natural Questions dataset:

./GoogleNQ/download_dataset.sh

Note that the script requires the gsutil command-line tool to be installed. You can install it by following the instructions here. The rest of the datasets can be downloaded and processed in a similar manner, without any additional dependencies.

License

This project is licensed under the Apache License 2.0, in accordance with Google's Natural Questions dataset - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
ASKMe		ASKMe
GoogleNQ		GoogleNQ
HotpotQA		HotpotQA
TriviaQA		TriviaQA
lib		lib
.gitignore		.gitignore
README.md		README.md
combined_stats.ipynb		combined_stats.ipynb
poetry.lock		poetry.lock
poetry.toml		poetry.toml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ASKMe QA dataset analysis project

Project Structure

Getting Started

License

About

Releases

Packages

Languages

CoLearn-Dev/FleeceKM-stats

Folders and files

Latest commit

History

Repository files navigation

ASKMe QA dataset analysis project

Project Structure

Getting Started

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages