March 5, 2019 Dataset Open Access

The Software Heritage Graph Dataset

Antoine Pietri; Diomidis Spinellis; Stefano Zacchiroli

Software Heritage is the largest existing public archive of software source

code and accompanying development history: it currently spans more than five

billion unique source code files and one billion unique commits, coming from

more than 80 million software projects.

This is the Software Heritage graph dataset: a fully-deduplicated

Merkle DAG representation of the Software Heritage archive. The dataset links

together file content identifiers, source code directories, Version Control

System (VCS) commits tracking evolution over time, up to the full states of VCS

repositories as observed by Software Heritage during periodic crawls. The

dataset’s contents come from major development forges (including GitHub and

GitLab), FOSS distributions (e.g., Debian), and language-specific package

managers (e.g., PyPI). Crawling information is also included, providing

timestamps about when and where all archived source code artifacts have been

observed in the wild.

The Software Heritage graph dataset is available in multiple formats, including

downloadable CSV dumps and Apache Parquet files for local use, as well as a

public instance on Amazon Athena interactive query service for ready-to-use

powerful analytical processing.

By accessing the dataset, you agree with the Software Heritage Ethical Charter

for using the archive data, and the terms of use for bulk access.

If you use this dataset for research purposes, please cite the following paper:

Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli.

The Software Heritage Graph Dataset: Public software development under one roof.

In proceedings of MSR 2019: The 16th International Conference on Mining Software Repositories, May 2019, Montreal, Canada. Co-located with ICSE 2019.

preprint, bibtex

You can also refer to the above paper for more information the dataset and sample queries.