GitHub - NUS-Curiosity/VulZoo: VulZoo: A Comprehensive Vulnerability Intelligence Dataset (ASE 2024 Demo)

Introduction

VulZoo is a large-scale vulnerability intelligence dataset that integrates various sources of structural and non-structural data. It is designed to be used by security researchers, penetration testers, and security analysts to get a comprehensive view of vulnerabilities and their associated data.

This dataset is divided into two parts: raw data and processed data.

raw-data/: contains the raw data from different sources.
processed/: contains the processed data that is extracted or converted from the raw data.

VulZoo aims to provide the most comprehensive profiling of vulnerabilities for downstream tasks, e.g., vulnerability detection, assessment, prioritization, exploitation, and mitigation.

The following figure shows the conceptual overview of VulZoo:

README.md in processed/ provides more details about the processed data.

Quick Start

If the existing data in VulZoo satisfies your demand, you can just clone this repository without --recurse-submodules option:

git clone https://github.com/NUS-Curiosity/VulZoo

The dataset is in processed/ directory. If you need the up-to-date data, please following the data management process below.

Data Management

Pre-requisites:

Python 3.6+
Disk space: 25GB+

VulZoo is composed of both git-based and non-git-based sources. The git-based sources are from upstream repositories and organized as git submodules in this repository. The non-git-based sources are crawled and maintained in this repository. To get started, clone the repository with the following command:

git clone --recurse-submodules https://github.com/NUS-Curiosity/VulZoo

VulZoo provides some useful scripts to help you manage the data. As some scripts require specific Python packages, it is recommended to install the required packages first:

pip install -r requirements.txt

You can run the sync-raw-data.sh script to incrementally update the local raw data:

./sync-raw-data.sh

Then, you can run the sync-processed.sh script to process the raw data and synchronize the processed data with the latest raw data:

./sync-processed.sh

P.S.

You can run print-statistics.py to get the statistics of the processed data.
The updating of attackerkb-database requires API key provided by AttackerKB. Please set it via environment variable and run sync-attackerkb.py in scripts/raw-data manually.
The CPE dictionary is too large to be uploaded to GitHub. Please run sync-cpe.sh scripts in both scripts/raw-data and scripts/processed locally.

Data Sources

Structural

Non-structural

Hybrid

Linux Kernel Vulns

Citation

If you use this dataset, please cite the VulZoo paper:

@article{ruan2024vulzoo,
      title={VulZoo: A Comprehensive Vulnerability Intelligence Dataset}, 
      author={Bonan Ruan and Jiahao Liu and Weibo Zhao and Zhenkai Liang},
      year={2024},
      eprint={2406.16347},
      eprinttype={arXiv}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Quick Start

Data Management

Data Sources

Structural

Non-structural

Hybrid

Citation

About

Releases

Contributors 2

Name		Name	Last commit message	Last commit date
Latest commit History 120 Commits
images		images
processed		processed
raw-data		raw-data
scripts		scripts
.gitignore		.gitignore
.gitmodules		.gitmodules
CHANGELOG		CHANGELOG
README.md		README.md
print-statistics.py		print-statistics.py
requirements.txt		requirements.txt
sync-processed.sh		sync-processed.sh
sync-raw-data.sh		sync-raw-data.sh

NUS-Curiosity/VulZoo

Folders and files

Latest commit

History

Repository files navigation

Introduction

Quick Start

Data Management

Data Sources

Structural

Non-structural

Hybrid

Citation

About

Resources

Stars

Watchers

Forks

Releases

Contributors 2