One of the 1 million+ notebooks we scraped from GitHub in July 2017. This notebook combines code, visualizations, and text to create an effective computational narrative.

This is a guest post on how members of the Jupyter community publish code, visualizations, and text using Jupyter Notebooks. We’re excited the Design Lab is sharing their research and data on the blog. If you have a post relevant to the community you’d like to share on the Jupyter Blog, please contact us. -The Jupyter Team

In July 2017, my team in the Design Lab at UC San Diego scraped and analyzed over 1 million Jupyter Notebooks from GitHub. Today I am excited to announce we are making these data publicly available for you to explore! While only a snapshot of one corner of the Jupyter universe, these data provide unique perspective into how people use and share Jupyter Notebooks.

The collection includes over 1 million notebooks as well as metadata about the nearly 200,000 repositories where they lived. The full dataset is nearly 600GB so we have created a smaller 5GB sampler dataset for you to get started. This includes roughly 6,000 notebooks from 1000 repositories.

We originally collected these data to explore how people use narrative text in Jupyter Notebooks. We found many notebooks, even those accompanying academic publications, had little in the way of descriptive text. This is likely because many analysts view their notebooks as personal and messy works-in-progress. On the other hand, many of the notebooks we collected were masterpieces of computational narrative, elegantly explaining complex analyses (one notebook even had more text than The Great Gatsby). We think this spread reflects a tension between data exploration, which tends to produce messy notebooks, and process explanation, in which analysts clean and organize their notebooks for a particular audience.

Over 25% of the 1 million+ notebooks we collected from GitHub had no descriptive text, yet some rivaled classic novels in length.

Beyond simply counting lines and words, we also looked at how authors organized their code and text. For example, most notebooks had markdown headers and nearly a third linked to other resources. Most notebooks had code comments and over a third defined new functions.

Analyzing this data helped us see how people organize notebook code and text.

We will be presenting the full results of our analysis in April at the 2018 ACM CHI Conference on Human Factors in Computing Systems and you can read more about our work in this preprint copy of our paper. In the meantime, our team has moved on to developing tools that take some of the effort out of cleaning and organizing Jupyter Notebooks.

There is so much left to explore in the data we collected. We are excited to see what you do with them! Thank you to the UC San Diego Library for graciously hosting the data. If you encounter a problem downloading them, please open an issue on this GitHub repo.