Version control

In a professional environment, notebooks are designed by, say, a data scientist, but the task of running them in production may be handled by a different team. So in general people have to share notebooks. This is best done through a version control system.

Jupyter notebooks are famous for the difficulty of their version control. Let’s consider our notebook above, with a file size of 3 MB, much of it being contributed by the embedded Plotly library. The notebook is less than 80 KB if we remove the output of the second code cell. And as small as 1.75 KB when all outputs are removed. This shows how much of its contents is unrelated to pure code! If we don’t pay attention, code changes in the notebook will be lost in an ocean of binary contents.

To get meaningful diffs, we use Jupytext (disclaimer: I’m the author of Jupytext). Jupytext can be installed with pip or conda . Once the notebook server is restarted, a Jupytext menu appears in Jupyter:

We click on Pair Notebook with Markdown, save the notebook… and we obtain two representations of the notebook: world_fact.ipynb (with both input and output cells) and world_fact.md (with only the input cells).

Jupytext’s representation of notebooks as Markdown files is compatible with all major Markdown editors and viewers, including GitHub and VS Code. The Markdown version is for example rendered by GitHub as:

As you can see, the Markdown file does not include any output. Indeed, we don’t want it at this stage since we only need to share the notebook code. The Markdown file also has a very clear diff history, which makes versioning notebooks simple.

The world_facts.md file is automatically updated by Jupyter when you save the notebook. And the other way round also works! If you modify world_facts.md with either a text editor, or by pulling the latest contributions from the version control system, then the changes appear in Jupyter when you refresh the notebook in the browser.

In our version control system, we only need to track the Markdown file (and we even explicitly ignore all .ipynb files). Obviously, the team that will execute the notebook needs to regenerate the world_fact.ipynb document. For this they use Jupytext in the command line:

$ jupytext world_facts.md --to ipynb

[jupytext] Reading world_facts.md

[jupytext] Writing world_facts.ipynb

We are now properly versioning the notebook. The diff history is much clearer. See for instance how the addition of the gross domestic products to our report looks like:

Jupyter notebooks as Scripts?

As an alternative to the Markdown representation, we could have paired the notebook to a world_facts.py script using Jupytext. You should give it a try if your notebook contains more code than text. That's often a first good step towards a complete and efficient refactoring of long notebooks: once the notebook is represented as a script, you can extract any complex code and move it to a (unit-tested) library using the refactoring tools in your IDE.

JupyterLab, JupyterHub, Binder, Nteract, Colab & Cloud notebooks?

Do you use JupyterLab and not Jupyter Notebook? No worries: the method above also applies in this case. You will just have to use the Jupytext extension for JupyterLab instead of the Jupytext menu. And in case you were wondering, Jupytext also work in JupyterHub and Binder.

If you use other notebook editors like Nteract desktop, CoCalc, Google Colab, or another cloud notebook editor, you may not be able to use Jupytext as a plugin in the editor. In this case you can simply use Jupytext in the command line. Close your notebook and inject the pairing information into world_facts.ipynb with

$ jupytext --set-formats ipynb,md world_facts.ipynb

and then keep the two representations synchronised with