November 28, 2011 at 07:48 Tags Articles , Python , Python internals

For one of the hobby projects I'm currently hacking on, I recently had to do a lot of binary data processing in memory. Large chunks of data are being read from a file, then examined and modified in memory and finally used to write some reports.

This made me think about the most efficient way to read data from a file into a modifiable memory chunk in Python. As we all know, the standard file read method, for a file opened in binary mode, returns a bytes object , which is immutable:

# Snippet #1 f = open (FILENAME, 'rb' ) data = f.read() # oops: TypeError: 'bytes' object does not support item assignment data[ 0 ] = 97

This reads the whole contents of the file into data - a bytes object which is read only. But what if we now want to perform some modifications on the data? Then, we need to somehow get it into a writable object. The most straightforward writable data buffer in Python is a bytearray . So we can do this:

# Snippet #2 f = open (FILENAME, 'rb' ) data = bytearray (f.read()) data[ 0 ] = 97 # OK!

Now, the bytes object returned by f.read() is passed into the bytearray constructor, which copies its contents into an internal buffer. Since data is a bytearray , we can manipulate it.

Although it appears that the goal has been achieved, I don't like this solution. The extra copy made by bytearray is bugging me. Why is this copy needed? f.read() just returns a throwaway buffer we don't need anyway - can't we just initialize the bytearray directly, without copying a temporary buffer?

This use case is one of the reasons the Python buffer protocol exists.

The buffer protocol - introduction The buffer protocol is described in the Python documentation and in PEP 3118 . Briefly, it provides a way for Python objects to expose their internal buffers to other objects. This is useful to avoid extra copies and for certain kinds of sharing. There are many examples of the buffer protocol in use. In the core language - in builtin types such as bytes and bytearray , in the standard library (for example array.array and ctypes ) and 3rd party libraries (some important Python libraries such as numpy and PIL rely extensively on the buffer protocol for performance). There are usually two or more parties involved in each protocol. In the case of the Python buffer protocol, the parties are a "producer" (or "provider") and a "consumer". The producer exposes its internals via the buffer protocol, and the consumer accesses those internals. Here I want to focus specifically on one use of the buffer protocol that's relevant to this article. The producer is the built-in bytearray type, and the consumer is a method in the file object named readinto .

A more efficient way to read into a bytearray Here's the way to do what Snippet #2 did, just without the extra copy: # Snippet #3 f = open (FILENAME, 'rb' ) data = bytearray (os.path.getsize(FILENAME)) f.readinto(data) First, a bytearray is created and pre-allocated to the size of the data we're going to read into it. The pre-allocation is important - since readinto directly accesses the internal buffer of bytearray , it won't write more than has been allocated. Next, the file.readinto method is used to read the data directly into the bytearray's internal storage, without going via temporary buffers. The result: this code runs ~30% faster than snippet #2 .

Variations on the theme Other objects and modules could be used here. For example, the built-in array.array class also supports the buffer protocol, so it can also be written and read from a file directly and efficiently. The same goes for numpy arrays. On the consumer side, the socket module can also read directly into a buffer with the read_into method. I'm sure that it's easy to find many other sample uses of this protocol in Python itself and some 3rd partly libraries - if you find something interesting, please let me know.

The buffer protocol - implementation Let's see how Snippet #3 works under the hood using the buffer protocol . We'll start with the producer. bytearray declares that it implements the buffer protocol by filling the tp_as_buffer slot of its type object . What's placed there is the address of a PyBufferProcs structure, which is a simple container for two function pointers: typedef int (*getbufferproc)(PyObject *, Py_buffer *, int ); typedef void (*releasebufferproc)(PyObject *, Py_buffer *); /* ... */ typedef struct { getbufferproc bf_getbuffer; releasebufferproc bf_releasebuffer; } PyBufferProcs; bf_getbuffer is the function used to obtain a buffer from the object providing it, and bf_releasebuffer is the function used to notify the object that the provided buffer is no longer needed. The bytearray implementation in Objects/bytearrayobject.c initializes an instance of PyBufferProces thus: static PyBufferProcs bytearray_as_buffer = { (getbufferproc)bytearray_getbuffer, (releasebufferproc)bytearray_releasebuffer, }; The more interesting function here is bytearray_getbuffer : static int bytearray_getbuffer (PyByteArrayObject *obj, Py_buffer *view, int flags) { int ret; void *ptr; if (view == NULL ) { obj->ob_exports++; return 0 ; } ptr = ( void *) PyByteArray_AS_STRING(obj); ret = PyBuffer_FillInfo(view, (PyObject*)obj, ptr, Py_SIZE(obj), 0 , flags); if (ret >= 0 ) { obj->ob_exports++; } return ret; } It simply uses the PyBuffer_FillInfo API to fill the buffer structure passed to it. PyBuffer_FillInfo provides a simplified method of filling the buffer structure, which is suitable for unsophisticated objects like bytearray (if you want to see a more complex example that has to fill the buffer structure manually, take a look at the corresponding function of array.array ). On the consumer side, the code that interests us is the buffered_readinto function in Modules\_io\bufferedio.c . I won't show its full code here since it's quite complex, but with regards to the buffer protocol, the flow is simple: Use the PyArg_ParseTuple function with the w* format specifier to parse its argument as a R/W buffer object, which itself calls PyObject_GetBuffer - a Python API that invokes the producer's "get buffer" function. Read data from the file directly into this buffer. Release the buffer using the PyBuffer_Release API , which eventually gets routed to the bytearray_releasebuffer function in our case. To conclude, here's what the call sequence looks like when f.readinto(data) is executed in the Python code: buffered_readinto | \--> PyArg_ParseTuple(..., "w*", ...) | | | \--> PyObject_GetBuffer(obj) | | | \--> obj->ob_type->tp_as_buffer->bf_getbuffer | |--> ... read the data | \--> PyBuffer_Release | \--> obj->ob_type->tp_as_buffer->bf_releasebuffer

Memory views The buffer protocol is an internal implementation detail of Python, accessible only on the C-API level. And that's a good thing, since the buffer protocol requires certain low-level behavior such as properly releasing buffers. Memoryview objects were created to expose it to a user's Python code in a safe manner: memoryview objects allow Python code to access the internal data of an object that supports the buffer protocol without copying. The linked documentation page explains memoryviews quite well and should be immediately comprehensible if you've reached so far in this article. Therefore I'm not going to explain how a memoryview works, just show some examples of its use. It is a known fact that in Python, slices on strings and bytes make copies. Sometimes when performance matters and the buffers are large, this is a big waste. Suppose you have a large buffer and you want to pass just half of it to some function (that will send it to a socket or do something else ). Here's what happens (annotated Python pseudo-code): mybuf = ... # some large buffer of bytes func(mybuf[: len (mybuf)// 2 ]) # passes the first half of mybuf into func # COPIES half of mybuf's data to a new buffer The copy can be expensive if there's a lot of data involved. What's the alternative? Using a memoryview : mybuf = ... # some large buffer of bytes mv_mybuf = memoryview(mybuf) # a memoryview of mybuf func(mv_mybuf[: len (mv_mybuf)// 2 ]) # passes the first half of mybuf into func as a "sub-view" created # by slicing a memoryview. # NO COPY is made here! A memoryview behaves just like bytes in many useful contexts (for example, it supports the mapping protocol) so it provides an adequate replacement if used carefully. The great thing about it is that it uses the buffer protocol beneath the covers to avoid copies and just juggle pointers to data. The performance difference is dramatic - I timed a 300x speedup on slicing out a half of a 1MB bytes buffer when using a memoryview as demonstrated above. And this speedup will get larger with larger buffers, since it's O(1) vs. the O(n) of copying. But there's more. On writable producers such as bytearray , a memoryview creates a writable view that can be modified: >>> buf = bytearray (b 'abcdefgh' ) >>> mv = memoryview(buf) >>> mv[ 4 : 6 ] = b 'ZA' >>> buf bytearray (b 'abcdZAgh' ) This gives us a way to do something we couldn't achieve by any other means - read from a file (or receive from a socket) directly into the middle of some existing buffer : buf = bytearray (...) # pre-allocated to the needed size mv = memoryview(buf) numread = f.readinto(mv[some_offset:])