Skip to content

VulZoo: A Comprehensive Vulnerability Intelligence Dataset (ASE 2024 Demo)

Notifications You must be signed in to change notification settings

NUS-Curiosity/VulZoo

Repository files navigation

vulzoo-logo

Introduction

VulZoo is a large-scale vulnerability intelligence dataset that integrates various sources of structural and non-structural data. It is designed to be used by security researchers, penetration testers, and security analysts to get a comprehensive view of vulnerabilities and their associated data.

This dataset is divided into two parts: raw data and processed data.

  • raw-data/: contains the raw data from different sources.
  • processed/: contains the processed data that is extracted or converted from the raw data.

VulZoo aims to provide the most comprehensive profiling of vulnerabilities for downstream tasks, e.g., vulnerability detection, assessment, prioritization, exploitation, and mitigation.

The following figure shows the conceptual overview of VulZoo:

VulZoo Overview

README.md in processed/ provides more details about the processed data.

Quick Start

If the existing data in VulZoo satisfies your demand, you can just clone this repository without --recurse-submodules option:

git clone https://github.com/NUS-Curiosity/VulZoo

The dataset is in processed/ directory. If you need the up-to-date data, please following the data management process below.

Data Management

Pre-requisites:

  • Python 3.6+
  • Disk space: 25GB+

VulZoo is composed of both git-based and non-git-based sources. The git-based sources are from upstream repositories and organized as git submodules in this repository. The non-git-based sources are crawled and maintained in this repository. To get started, clone the repository with the following command:

git clone --recurse-submodules https://github.com/NUS-Curiosity/VulZoo

VulZoo provides some useful scripts to help you manage the data. As some scripts require specific Python packages, it is recommended to install the required packages first:

pip install -r requirements.txt

You can run the sync-raw-data.sh script to incrementally update the local raw data:

./sync-raw-data.sh

Then, you can run the sync-processed.sh script to process the raw data and synchronize the processed data with the latest raw data:

./sync-processed.sh

P.S.

  • You can run print-statistics.py to get the statistics of the processed data.
  • The updating of attackerkb-database requires API key provided by AttackerKB. Please set it via environment variable and run sync-attackerkb.py in scripts/raw-data manually.
  • The CPE dictionary is too large to be uploaded to GitHub. Please run sync-cpe.sh scripts in both scripts/raw-data and scripts/processed locally.

Data Sources

Structural

Non-structural

Hybrid

Citation

If you use this dataset, please cite the VulZoo paper:

@article{ruan2024vulzoo,
      title={VulZoo: A Comprehensive Vulnerability Intelligence Dataset}, 
      author={Bonan Ruan and Jiahao Liu and Weibo Zhao and Zhenkai Liang},
      year={2024},
      eprint={2406.16347},
      eprinttype={arXiv}
}