ZetCode

Python pickle

last modified August 29, 2020

Python pickle tutorial shows how to do data serialization in Python with the pickle module.

The pickle module

The pickle module implements binary protocols for serializing and deserializing a Python object structure. Serialization is the process of converting an object in memory to a byte stream that can be stored on disk or sent over a network. Deserialization is the process of converting a byte stream to Python object.

This process is also called pickling/unpickling or marshalling/unmarshalling.

The pickletools module contains tools for analyzing data streams generated by pickle.

Note: data serialization with the pickle module is insecure. The documentation stresses that we should never unpickle data that comes from an untrusted source or is transmitted over an insecure network.

Python pickle serialize

The following example serializes data into a binary file.

simple_write.py
#!/usr/bin/python

import pickle

data = {
    'a': [1, 4.0, 3, 4+6j],
    'b': ("a red fox", b"and old falcon"),
    'c': {None, True, False}
}

with open('data.bin', 'wb') as f:
    pickle.dump(data, f, pickle.HIGHEST_PROTOCOL)

We have a dictionary of different data types. The data is pickled into a binary file.

with open('data.bin', 'wb') as f:
    pickle.dump(data, f, pickle.HIGHEST_PROTOCOL)

The dump function writes the pickled representation of the object to the file object. Over the time several protocols have been developed. The protocol version determines the capabilities of the serialization process. In our code, we choose the highest protocol version.

$ hexdump -C data.bin
00000000  80 05 95 75 00 00 00 00  00 00 00 7d 94 28 8c 01  |...u.......}.(..|
00000010  61 94 5d 94 28 4b 01 47  40 10 00 00 00 00 00 00  |a.].(K.G@.......|
00000020  4b 03 8c 08 62 75 69 6c  74 69 6e 73 94 8c 07 63  |K...builtins...c|
00000030  6f 6d 70 6c 65 78 94 93  94 47 40 10 00 00 00 00  |omplex...G@.....|
00000040  00 00 47 40 18 00 00 00  00 00 00 86 94 52 94 65  |..G@.........R.e|
00000050  8c 01 62 94 8c 09 61 20  72 65 64 20 66 6f 78 94  |..b...a red fox.|
00000060  43 0e 61 6e 64 20 6f 6c  64 20 66 61 6c 63 6f 6e  |C.and old falcon|
00000070  94 86 94 8c 01 63 94 8f  94 28 89 88 4e 90 75 2e  |.....c...(..N.u.|
00000080

Binary files cannot be read with simple text editors; we need tools that can work with hexadecimal data.

Python pickle deserialize

In the next example, we unpickle data from a binary file.

simple_read.py
#!/usr/bin/python

import pickle

with open('data.bin', 'rb') as f:

    data = pickle.load(f)

    print(data)

The load function reads the pickled representation of an object from the file object and returns the reconstituted object.

$ ./simple_read.py
{'a': [1, 4.0, 3, (4+6j)], 'b': ('a red fox', b'and old falcon'),
    'c': {False, True, None}}

We have successfully recreated the dictionary object.

Python pickle dumps/loads

The dumps function returns the pickled representation of the object as a bytes object, instead of writing it to a file. The loads function returns the reconstituted object hierarchy of the pickled representation data of an object. The data must be a bytes-like object.

dumps_loads.py
#!/usr/bin/python

import pickle


data = [1, 2, 3, 4, 5]

dumped = pickle.dumps(data)
print(dumped)

loaded = pickle.loads(dumped)
print(loaded)

In the example, we serialize and deserialize a Python list with dumps and loads.

$ ./dumps_loads.py
b'\x80\x04\x95\x0f\x00\x00\x00\x00\x00\x00\x00]\x94(K\x01K\x02K\x03K\x04K\x05e.'
[1, 2, 3, 4, 5]

This is the output.

Python pickle __getstate__/__setstate__

The process of pickling and unpickling can be influenced with the __getstate__ and __setstate__ functions. The __getstate__ function is called upon pickling and the __setstate__ function upon unpickling.

words.txt
blue, rock, water, sky, cloud, forest, hawk, falcon

This is the words.txt file

colours.txt
red, green, blue, pink, orange

This is the colours.txt file

state.py
#!/usr/bin/python

import pickle


class MyData:

    def __init__(self, filename):

        self.name = filename
        self.fh = open(filename)

    def __getstate__(self):

        odict = self.__dict__.copy()
        print(odict)
        del odict['fh']
        return odict

    def __setstate__(self, dict):

        fh = open(dict['name'])
        self.name = dict['name']
        self.fh = fh

obj = MyData('words.txt')

res = pickle.loads(pickle.dumps(obj))
print(res.fh.read())

obj2 = MyData('colours.txt')

res = pickle.loads(pickle.dumps(obj2))
print(res.fh.read())

In the example, we store and remove the file handle in the __setstate__ and __getstate__ member functions.

$ ./state.py 
{'name': 'words.txt', 'fh': <_io.TextIOWrapper name='words.txt' mode='r' encoding='UTF-8'>}
blue, rock, water, sky, cloud, forest, hawk, falcon
{'name': 'colours.txt', 'fh': <_io.TextIOWrapper name='colours.txt' mode='r' encoding='UTF-8'>}
red, green, blue, pink, orange

This is the output.

Python pickle is insecure

The pickle module is insecure. The module is a virtual machine which uses predefined opcodes to do its work. By using specially crafted binary strings the attacker can launch system commands which can damage data or launch reverse shells.

insec.py
#!/usr/bin/python

import pickle

pickle.loads(b"cos\nsystem\n(S'ls -l'\ntR.")

This example launches the Linux ls command.

$ ./insec.py 
total 36
drwxr-xr-x 2 user2 user2 4096 Aug 13 16:16 Desktop
drwxr-xr-x 2 user2 user2 4096 Aug 13 16:18 Documents
drwxr-xr-x 2 user2 user2 4096 Aug 13 16:16 Downloads
-rwxr-xr-x 1 user2 user2   79 Aug 29 11:08 insec.py
drwxr-xr-x 2 user2 user2 4096 Aug 13 16:16 Music
drwxr-xr-x 2 user2 user2 4096 Aug 13 16:16 Pictures
drwxr-xr-x 2 user2 user2 4096 Aug 13 16:16 Public
drwxr-xr-x 2 user2 user2 4096 Aug 13 16:16 Templates
drwxr-xr-x 2 user2 user2 4096 Aug 13 16:16 Videos

This is a sample output.

In this tutorial, we have worked with the Python pickle module.

List all Python tutorials.