The pickle module implements serialization protocol, which provides an ability to save and later load Python objects using special binary format. Unlike json , pickle is not limited to simple objects. It can also store references to functions and classes, as well as the state of class instances.

Before we start, it is worth mentioning, that there are two versions of modules: pickle and cPickle . The latter is faster and implements the same algorithm but in C. The downside of this is that you cannot inherit pickle's classes. In Python 3, the accelerated version is imported automatically when it's possible.

Pickle example

import pickle import pickletools class Node : def __init__ ( self , data ): self . data = data self . children = [] def add_child ( self , obj ): self . children . append ( obj ) node = Node ({ "int" : 1 , "float" : 2.0 }) data = pickle . dumps ( node )

The binary output looks as follows:

>>> data b ' \x80\x03 c__main__

Node

q \x00 ) \x81 q \x01 }q \x02 (X \x08\x00\x00\x00 childrenq \x03 ]q \x04 X \x04\x00\x00\x00 dataq \x05 }q \x06 (X \x03\x00\x00\x00 intq \x07 K \x01 X \x05\x00\x00\x00 floatq \x08 G@ \x00\x00\x00\x00\x00\x00\x00 uub.'

We can use pickletools to convert it to a human readable format:

>>> pickletools . dis ( data ) 0 : \ x80 PROTO 3 2 : c GLOBAL '__main__ Node' 17 : q BINPUT 0 19 : ) EMPTY_TUPLE 20 : \ x81 NEWOBJ 21 : q BINPUT 1 23 : } EMPTY_DICT 24 : q BINPUT 2 26 : ( MARK 27 : X BINUNICODE 'children' 40 : q BINPUT 3 42 : ] EMPTY_LIST 43 : q BINPUT 4 45 : X BINUNICODE 'data' 54 : q BINPUT 5 56 : } EMPTY_DICT 57 : q BINPUT 6 59 : ( MARK 60 : X BINUNICODE 'int' 68 : q BINPUT 7 70 : K BININT1 1 72 : X BINUNICODE 'float' 82 : q BINPUT 8 84 : G BINFLOAT 2.0 93 : u SETITEMS ( MARK at 59 ) 94 : u SETITEMS ( MARK at 26 ) 95 : b BUILD 96 : . STOP highest protocol among opcodes = 2

Serialization algorithm

Internally, the serialization algorithm is called a stack-based virtual pickle machine (PM). The name and format can be confusing, but actually, pickle bases on a simple concept.

The pickle protocol (byte stream) contains a set of opcodes each followed by one argument. Opcodes are executed once each, from left to right. To store intermediate results pickle uses two data structures: a stack (based on a list ) and a memo (can be based on a list or a dictionary ).

To get an idea let's start with a simple example:

>>> pickle . dumps ([ 1 , 2 , 3 , 4 ]) b ' \x80\x03 ]q \x00 (K \x01 K \x02 K \x03 K \x04 e.' >>> pickletools . dis ( _ ) 0 : \ x80 PROTO 3 2 : ] EMPTY_LIST 3 : q BINPUT 0 5 : ( MARK 6 : K BININT1 1 8 : K BININT1 2 10 : K BININT1 3 12 : K BININT1 4 14 : e APPENDS ( MARK at 5 ) 15 : . STOP highest protocol among opcodes = 2

Here PROTO indicates the version of the protocol, which you can change for compatibility with older Python versions. The EMPTY_LIST opcode creates an empty Python list and pushes it on the stack. The MARK opcode is used as a special marker. In our particular case, it indicates the start of the list on the stack.

The BININT1 opcode parses an integer from binary representation and pushes it to the stack. The pickle protocol does not know the number of items in the list in advance, so it keeps pushing values to the stack until the different opcode is reached.

The APPENDS opcode takes all the objects from the top of the stack down to (but not including) the topmost marker object and appends them to a list.

Python implementation of APPENDS :

def load_appends ( self ): items = self . pop_mark () list_obj = self . stack [ - 1 ] try : extend = list_obj . extend except AttributeError : pass else : extend ( items ) return # Even if the PEP 307 requires extend() and append() methods, # fall back on append() if the object has no extend() method # for backward compatibility. append = list_obj . append for item in items : append ( item )

But how about other objects? How it works for a dictionary, for example? Well, instead of pushing only one value pickle pushes key and value.

def load_setitems ( self ): items = self . pop_mark () dict = self . stack [ - 1 ] for i in range ( 0 , len ( items ), 2 ): dict [ items [ i ]] = items [ i + 1 ]

How pickle stores class instances

To serialize class instance we need to know its name and state (i.e., data attributes). In some languages, it requires a complicated class traversing algorithm. However, in Python, all class attributes (except for __slots__ ) are stored as a dictionary.

Every class has universal __reduce__ and __reduce_ex__ methods which return all necessary data (i.e., class name, object constructor, slots, and its attributes dictionary).

Let's restore our Node class (from the first example):

# Get state using protocol 3 constructor , _ , state , _ , _ = node . __reduce_ex__ ( 3 ) # create an empty instance # or node = Node.__new__(Node) node = constructor ( Node ) # replace instance's dictionary instance_dict = node . __dict__ for k , v in state . items (): instance_dict [ k ] = v print ( node . data )

Why pickle is not secure

The documentation of pickle module states:

The pickle module is not secure against erroneous or maliciously constructed data. Never unpickle data received from an untrusted or unauthenticated source.

Pickle has a REDUCE opcode, which was intended for custom object reconstruction but can be used for evil. It takes the name of the function with its arguments from the stack and immediately executes it. Unfortunately, there are no safety checks.

This is how you can call eval with arbitary code in it:

>>> payload = b "c__builtin__

eval

(S'print(123)'

tR." >>> pickletools . dis ( payload ) 0 : c GLOBAL '__builtin__ eval' 18 : ( MARK 19 : S STRING 'print(123)' 33 : t TUPLE ( MARK at 18 ) 34 : R REDUCE 35 : . STOP highest protocol among opcodes = 0 >>> pickle . loads ( payload ) 123

More about pickle

If you want to understand more, see pickletools.py for extensive comments about the protocol, and pickle.py for implementation details.