✏️ safer: a safer file writer ✏️

Part 1 of the Coroniad

Introduction

This series of articles is aimed at any Python reader past the beginner level. The article describes not just a tiny library that does just one essential thing, but some of the backstory of how it went from being a humble utility in a project somewhere to being a slightly less humble, productionized library.

Update

When you’re finished this article, see the sequel safer 2.0 here.

What is safer?

It’s a tiny, single file Python library for safely writing to files where either the entire file is written, or nothing is changed.

Why would you use it?

In all programming languages, it’s very easy to start writing an important file, and then run into an exceptional condition, and fail to finish, leaving the file broken.

For example, in Python, to write a JSON file you often see code like this:

with open(filename, 'w') as fp:

json.dump(data, fp)

( fp is a stream: a Python object that reads and/or writes strings.)

And that might code work perfectly well for years. But one day, something goes wrong — say, by mistake the wrong sort of object gets into the data dictionary, maybe under some conditions that your tests don’t cover — and json.dump() raises an exception.

Which would be easily fixed by pushing out a new release…except that the JSON file is already partially written, and the data file is corrupted.

One bad software release and users’ data files are destroyed. Oh no.

With safer , you write almost identical code:

with safer.open(filename, 'w') as fp:

json.dump(data, fp)

Now if json.dump() throws an exception, the original file is unchanged, so your important data file lives to see another day.

Photo by form PxHere

What can you learn from this code?

It’s short enough to understand completely

The actual executable Python code for v1.0.0 is 28 lines here and two lines there — almost nothing. There’s also some wrapping and decorators, and those are like code, and much harder to grasp than regular old code, but those still add a handful of lines.

Compared with some 150 executable lines of test code, it’s almost nothing. But the tiny target code hides a turbulent history.

I had pulled working code in with tests from another project named gitz .

It was working fine on the second commit — and then a couple of dozen commits later I felt it was finished.

I needed safer in another project, but I realized fast that I needed a printer — a callable object that looks like the built-in print() function — and not a writer — a stream that can be written to.

I added a lot of features but I kept only a few and threw most of them away. I found a lot of edge cases and tweaked it. It became a lot bigger and then I cut it down again.

And a lot of renaming and re-renaming went on to get things as clear as absolutely possible.

It’s a chance to study Python context managers

One of the nice things about Python is that so many parts of the language are immediately understandable and elegant.

Context managers are still very elegant, but somewhat more difficult to understand. Luckily, you can use a context manager perfectly successfully without really understanding how they work: I certainly did for a long time.

When you first start Python, you might write data to a JSON file like this:

fp = open(filename, 'w')

json.dump(data, fp)

It works for you, but it turns out there’s a subtle trap there that might one day bite you.

Let me ask you a question here: in the Python code above, when does the data get written to the file on disk?

Trick question: it is nowhere in that code. In fact, that code fragment above is not technically speaking guaranteed to actually write any data to the file at all. Surprise!

Oh, json.dump() certainly writes to the stream fp but most streams like the one you get from open() are buffered streams: it would be horribly inefficient to write each tiny bit of data to the actual file on disk, so the stream writes data to an internal buffer with a fixed size, and that buffer only gets written to the file when it fills up, or when the stream is closed.

If data isn’t very big, it probably won’t overflow the buffer, so it won’t get written until fp is closed.

So what closes that stream fp ?

It is Python’s famous garbage collector that does the dirty work.

After the code above is executed, fp is not referenced anymore, and “probably very soon” the garbage collector finds the stream and calls the stream’s destructor, which closes the stream, which flushes its buffer, which actually writes to the file.

The garbage collector is amazingly good, but it isn’t perfect, and there are no hard guarantees on how quickly this will happen. More, the garbage collector doesn’t even guarantee to call the destructor on every object, though the cases where it doesn’t are very rare.

So the actual answer is that you don’t know for 100%-guaranteed-every-single-time sure when or even if it gets written.

In practice, the data is nearly always written except in the most unusual circumstances, and it “nearly always” happens “almost immediately”… but sometimes that just isn’t good enough, particularly if some other thread or process immediately starts reading the incomplete file, or if you accidentally keep the reference to the stream so it’s never garbage collected.

Managing resources

A resource is some software thing with a limited lifespan that you, the programmer, need to manage.

Resources are potentially expensive and important: memory that’s been allocated, a socket or database connection, or a cursor within a database query, or a temporary file, or many other things you can’t just drop on the floor and hope it all works out.

The Pythonic way to manage resources is with a context manager, which sets up before hand, runs a block of code, then cleans up afterwards.

So now, instead of

fp = open(filename, 'w')

json.dump(data, fp)

where fp almost certainly gets closed very soon, you write

with open(filename, 'w') as fp:

json.dump(data, fp)

where fp is definitely closed once the block is finished, whether or not an exception is raised.

This is better, but, if an exception is raised, then the file will be partially overwritten, and you probably don’t want that.

And so safer.writer() is a context manager that offers exactly the same functionality that you get from open() , but goes one step further by not actually overwriting the original file until the context exits.

There are two ways to create a context manager, and safer uses the easiest one: the decorator contextlib.contextmanager .

Here’s the place in the code where safer uses the decorator, and here’s where the actual context manager is implemented: the yield statement on line 84 is where the user gets given the stream fp to write to.

Photo by Plush Design Studio on Unsplash

safer uses functools.wraps() for something that isn’t a decorator

functools.wraps() lets your function copy the signature and documentation of another function.

A decorator is a function that wraps other functions, so you should always use functools.wraps() if you are writing a decorator, and up until now I had never used it for anything else.

But safer has two public functions — safer.open() and safer.printer() . Both functions have the same signature and very similar documentation.

In earlier versions of safer , the documentation and signature were duplicated and got out of sync more than once. Oh no.

functools.wraps() copies the signature and documented of safer.open onto safer.printer , so the definition of safer.printer() looks like this:

@functools.wraps(open)

@contextlib.contextmanager

def printer(name, mode='w', *args, **kwargs):

with open(name, mode, *args, **kwargs) as fp:

yield functools.partial(print, file=fp)

and help(safer.printer()) still shows the full signature and documentation.

One source of error lost, less code, but I keep the correct method signature: it’s like free money.

I’ve run into this issue of massive duplication between function signatures in several projects, and this is the first time I’ve thought of using functools.wraps() .

safer works on every version of Python, including 2.7

(Update: I obsoleted this as-of release v3.0.0 — the last Python 2.7-compatible implementation is here.) Python 2.7 is the only remaining version of Python 2 and it’s rather a curiosity to be writing new code that supports 2.7.

Python 2 is supposed to be dead and gone, and the porting procedure from 2 to 3 is near trivial and can be done incrementally, one file at a time, but some people are still on legacy systems. The lazy need to be provided for too!

Me, I stopped supporting Python 2 last year, and I had safer mostly written before I realized that I was using almost nothing from Python 3 at all.

So for a lark, I ported it to Python 2 (and did old Python 3.4 at the same time).

This was not my greatest lark, because even though the changes to the code itself were tiny, the tests relied on several Python 3 features and to come up with a neat way around it, I had to write all this and then rewrite every test — and then later on I’ve made changes that broke in just one version (2.7 or 3.4) though I quickly learned to test just those two versions in development.

But it worked. And so you can use this on any Python codebase.

The README.rst is automatically generated with almost no machinery

If you write Python code that you want other people to use, you need to make sure that the documentation in the source file ends up being documentation on the web.

For a one-file library, you can put it all into one file, often named README.rst .

You still don’t want to have to change the documentation by hand every time you you change the code, so you need a tool to extract the code documentation into web documentation.

Unfortunately, Sphinx, the industry-standard documentation tool for Python, is both complex and complicated, and in previous projects I and other developers have struggled to set it up, then struggled further to make seemingly minor tweaks.

Having to debug your documentation is demoralizing and a waste of time. It’s particularly annoying when your whole darned program is only one file in the first place.

This time, I asked, “What’s the least amount of work I’d have to put in to do it myself? How can I make a useful document extractor just for this project in the absolute minimum amount of work?” (Because I love to work, but I also want to do the least possible amount of work to get results.)

And here it is, doc_safer.py . It took me much, much less time than wrestling with Sphinx ever did.

Now I never directly touch README.rst at all, but instead change the comments in safer.py and run doc_safer.py .

To be fair, Sphinx can write several formats including html , and I just wanted to generate .rst files, so I simply put a little reStructuredText into the Python source code. This changes the problem from sophisticated parsing and productions to mostly just copying things around.

The real work is just in these few lines here, which use Python’s inspect module to pull the signatures and documentation out from each public function in safer . There’s a bit of a hack to get the EXAMPLES section that I wanted, but overall it came out nicely, and I’m thinking of extracting it into a tiny tool for a later article.

It’s a good template for a single-file library

This first commit has everything — a single file, a test, an installer, instructions to flake8, and a continuous integration setup with Travis (you need to sign up for it, but you can use it for free for open source).

And then I rewrote it!

See part 2 here.

Thanks for reading!

If you want to read more: