Our unit of value at Quilt Data is data packages: like code packages, but for data. We wanted data packages to look and feel as much like code packages as possible — right down to how you import them.

Unfortunately import wasn’t an immediate fit for data packages; the default import logic is intended for code modules, which point to a .py file somewhere on disk, whilst our data packages are objects, and are backed by .json files.

Luckily imports are extremely hackable!

We were able to get the import behavior we wanted by building our own module loader:finder pair. In this blog post we’ll learn how this feature works — and how you can use it yourself to enable Python imports on almost anything.

Here’s how we used this feature in t4 :

# code dependencies

import pandas as pd

import numpy as np # data dependencies

from t4.data.aleksey import fashion_mnist

from t4.data.quilt import open_images # your code # ??? # profit!

To get this code snippet working, we had to “teach” Python how to construct data packages (e.g. aleksey/fashion_mnist ) out of JSON files.

A crash course on import

Before we can hack import we first have to know a little bit about how it works.

The first time that you try to import foo , Python will scan a pre-selected group of paths for an importable thing with the name foo . You can see the list of paths that it rummages through using sys.path . For example, on my machine:

In [1]: import sys; print(sys.path)

Out[2]:

['',

'/Users/alex/miniconda3/envs/quilt-dev/bin',

'/Users/alex/miniconda3/envs/quilt-dev/lib/python36.zip',

'/Users/alex/miniconda3/envs/quilt-dev/lib/python3.6',

'/Users/alex/miniconda3/envs/quilt-dev/lib/python3.6/lib-dynload',

'/Users/alex/miniconda3/envs/quilt-dev/lib/python3.6/site-packages',

...

]

The first path in sys.path is usually '' , which is an alias for “the current directory”. The next handful of paths are various mount points for Python code packages. Python scans these paths in order, and returns with the first matching name it can find.

So for instance, if you were to create and pip install a new package named os , when you run import os you will get the stdlib os module instead. That’s because stdlib os is in the python3.6 directory, whereas your version of os is in site-packages — which is further down the list.

sys.path is the easiest way to make your own code importable. If you want to make something.py importable, all you have to do is append the path to the directory containing that file to your sys.path .

Another important feature of Python imports is module caching. Every time you try to import a name, Python first checks the module cache to see if it has already imported it. If it hasn’t, it tries to import that name, then (if successful) adds it to the module cache. If it has, it does nothing.

This is why you can’t import a package, change the code, then import it again.

Here’s a feature of Python imports that’s much less well known: sys.meta_path .

In [3]: import sys; print(sys.meta_path)

Out[4]:

[_frozen_importlib.BuiltinImporter,

_frozen_importlib.FrozenImporter,

_frozen_importlib_external.PathFinder,

<six._SixMetaPathImporter at 0x10c14ef60>]

The objects in this list are what are known as finders. Every time you import a name, Python presents that name to each of these finders (in precedence order, just like in sys.path ). Finders determine whether they can import a particular name; if so, they return a loader object that then actually, you know, loads it.

So here’s an updated picture of how Python imports work. Python goes through the list of finders in your sys.meta_path , asking each one if it knows how to import a name. The first finder to say it can wins the rights to the import. If no finder knows what to do with a name, Python gives up and returns an ImportError .

So what is sys.path ? It’s actually just the list of directories that PathFinder will search through in an attempt to find a code module with the given name. Most Python imports are handled by PathFinder ; only a handful of built-in modules compiled directly in C, like sys , are imported by BuiltinImporter or FrozenImporter further up the list.

sys.meta_path is our “in”! In order to implement a new type of module in Python, we need to create a new finder:loader pair for our new module type and append it to the sys.meta_path .

Building your own finder:loader pair

I’ll demonstrate how finders and loaders work using the importer code in t4 .

The code that follows has only been verified to work in Python 3.6+.

First, the finder object:

DataPackageFinder implements a required find_spec method. It is the job of find_spec to determine whether or not it can import a name. If it can’t, it should return None . If it can, it should return a module specification parameterized with two things: the module name, and the module loader.

In our case this was easy: we just matched any names starting with t4.data . Other use cases may require more complicated name matching. For example, here in the PathFinder code.

The module loader is more complicated, and requires a bit more study.

DataPackageImporter implements two required methods. The first is create_module . If this method returns None , the default module creator will be used; this is probably what you want, unless you’re doing something super weird.

The second required method is exec_module .

exec_module takes a constructed module object as input: a bare-bones representation of an importable module whose only special characteristic is a __name__ attribute, which has been set to the value of the fullname parameter from DataPackageFinder.find_spec .

In Python, modules are really objects, and objects are really dictionaries. So to extend this module, we hang new “stuff” directly on the module __dict__ , e.g. via module.__dict__['foo'] = 'bar' .

Module names with many parts are executed in a top-down manner. So to import foo.bar.baz , we first import foo , then import foo.bar , and then only then import foo.bar.baz . The only restriction on what you can return on import is that you have to return a module object. The module objects you return as you go down the list of module namespaces don’t have to be at all related!

In our case, since we wanted to be able to import using a from t4.data.namespace import packagename pattern, we need logic for the t4 , t4.data , and t4.data.namespace names. For t4.data we return an empty module, and for t4.data.namespace we return a module object relevant objects keyed by packagename into the module.__dict__ .

However, since Python requires that you answer imports with a module object, but we want to return a t4.Package object, we can’t and don’t implement logic for import t4.data.namespace.packagename . Instead we tell users to run from t4.data.foo import bar . Running this code imports the t4.data.foo module object, then plucks the bar object out of the module __dict__ .

All of this logic is implemented in the DataPackageImporter above. Scroll up to the code sample again and see if you now understand what it’s doing!

There’s just a couple more details to keep in mind:

It’s not possible to “look ahead” and check what specific sub-module a user asked for. Because of this, we have to include every possible user request in the module we return (this is why we have to iterate over list_packages() in the code above). This is a conscious design decision: we’re worse off if we only import one object, since we had to do all the additional work of finding all the other things; but better off if we import many objects, since now every object except for the first is just plucked from the module cache.

in the code above). This is a conscious design decision: we’re worse off if we only import one object, since we had to do all the additional work of finding all the other things; but better off if we import many objects, since now every object except for the first is just plucked from the module cache. Every module must have its __path__ attribute set. This is true even if the module is “virtual”, e.g. it doesn’t actually exist on-disk. You can get around this by setting __path__ to an empty list [] (module paths are lists for…reasons).

Finally, there’s one more thing we need to do — hang the finder on sys.meta_path :

from imports import DataPackageFinder

import sys

sys.meta_path.append(DataPackageFinder)

That’s it! We’re now ready to import our data packages. Code like from t4.data.foo import bar will “just work”.

Conclusion

Hopefully from reading this blog post you’ve learned a lot about how Python import works, and about how you can extend for yourself.

If you’re not satisfied with what you’ve seen here, and want to dig even deeper into the (super hairy) world of Python imports, I highly recommend watching David Beazley’s classic PyCon talk, “Packages and modules: live and let die!”. Warning though — it’s three hours long. 🙂

For more finder:loader code samples, browse our source code or see the Python stdlib loaders and finders.

If you want to see just how far you can go with import hacks check out “How to use loader and finder objects in Python”, which talks through how a group of researchers hacked import so they could interface with model files written in the Clojure (!) programming language.

Interested in data packages? Help us build them in the Quilt T4 repo.