In our post on deep learning environments, we described how we’re using Jupyter-driven Docker containers for a faster on-ramp and better collaboration around several technologies. As we mentioned in that post, we just put together a collaboration feature to bring Jupyter notebooks under the fold of our existing GitHub workflow. In this post, we wanted to shed light on its component parts and security considerations in case anyone else might be looking for something similar.

To better integrate Jupyter with our existing development workflow, we wrote a custom Jupyter extension to “Commit-and-Push” directly to GitHub from a notebook. The extension has two core components:

A new button on the frontend, implemented in Javascript, captures the user’s commit message and name of the current notebook

A backend Python script receives that data and uses the gitpython module to automate the following tasks:

Check out a new branch Commit changes to the identified notebook Push commits to GitHub Submit a Pull Request

Commit-and-push extension within a Jupyter notebook

Commit message in Jupyter notebook

GitHub pull request from commit-and-push extension

Security Considerations

Even though we push all our code to GitHub, our Jupyter notebooks connect to several internal data and compute resources. Like most developers who want to keep their jobs, we want to protect internal-only data and account information. Simply put, automatically pushing changes to GitHub requires a few important security considerations:

GitHub Account: Pushing changes to GitHub without a username and password requires SSH keys tied to the account. Rather than attach any of our developers’ private keys to the notebook server, we created a separate account on GitHub (we named it gitbot41). When building the Docker image, we ADD SSH keys (which must be generated prior to building) into the container. That method isolates all notebook changes and SSH key management to a separate account. GitHub Token: Interaction with GitHub’s API requires an access token. Following Twelve-Factor App best practices, we use Docker’s — env-file runtime option to pass all GitHub parameters into the container as environment variables. That method isolates sensitive variables to a single file we can easily update, as well as omit from our public repo. Notebook Output: Code executed within Jupyter notebooks sometimes includes sensitive data such as directory listings and server paths. To lessen the possibility of information leaks, our Docker image includes the git filter ipynb_stripout. That filter ensures any output stays isolated to the notebook instead of leaking into public commits, and since it is a straightforward Python script, we could easily extend it for additional text processing or git repo management functionality.

Output stripped from Jupyter notebook

Room for Improvements

We just started using the Commit-and-Push extension on a limited basis to evaluate its utility and points of failure. We’ve already identified a few ways we could make a future version better:

GitHub authorship : The extension commits changes to everyone’s work as a single GitHub user, thereby failing to associate specific commits to individual developers. We’re looking into Jupyter Hub as a potential way to provide a multi-user notebook environment.

: The extension commits changes to everyone’s work as a single GitHub user, thereby failing to associate specific commits to individual developers. We’re looking into Jupyter Hub as a potential way to provide a multi-user notebook environment. More automation : To use the extension, we must jump through several manual steps on GitHub. For example, we had to create a separate account, generate SSH keys, associate SSH keys, generate an access token, create and fork the notebook repo, and update environment variables before building and running the container. Most of those steps cannot be avoided, but we’re pretty sure there are at least a few we could better automate.

: To use the extension, we must jump through several manual steps on GitHub. For example, we had to create a separate account, generate SSH keys, associate SSH keys, generate an access token, create and fork the notebook repo, and update environment variables before building and running the container. Most of those steps cannot be avoided, but we’re pretty sure there are at least a few we could better automate. Better code filtering: The extension completely strips all notebook output before each commit, which is blunt to say the least. The extension would be better if it provided the option to filter output, as well as if it detected specific keywords/phrases/regex anywhere in the notebook.

That’s pretty much it. Right now this extension is simply a component of our Docker-based environment. If you think the extension would be useful in the overall Jupyter ecosystem, go ahead and submit a GitHub issue and we’ll make that a priority. Thanks for reading!