Code reading: The python std lib module - shelve.py
Table of Contents
Introduction
Reading Python Standard Library (stdlib) modules are fun. They are also instructional on how to write idiomatic python. In this article, we are going to look at a stdlib module called shelve
.
Given below is a sample usage of the module. A local database called bio.db
will be created after running the code.
#!/usr/bin/env python
import dbm
import shelve
class Bio:
def __init__(self, name, age):
self.name = name
self.age = age
def __repr__(self):
return f"Bio(name={self.name}, age={self.age})"
def main(name: str, age: int):
with (d := shelve.open("bio", writeback=True)):
print(f"The shelve module is currently using: {dbm._defaultmod}")
d[name] = Bio(name, age)
with (d := shelve.open("bio")):
if name in d:
print(d[name])
if __name__ == '__main__':
main("Jane Doe", 18)
Source Code
I will go over the source code of shelve.py
and document many interesting tidbits along the way.
"""Manage shelves of pickled objects.
A "shelf" is a persistent, dictionary-like object. The difference
with dbm databases is that the values (not the keys!) in a shelf can
be essentially arbitrary Python objects -- anything that the "pickle"
module can handle. This includes most class instances, recursive data
types, and objects containing lots of shared sub-objects. The keys
are ordinary strings.
...
"""
One of the main advantage of using Shelve instead of directly using dbm
module is that it natively supports working with and persisting dictionary like objects in python. It delegates the low level step of serializing the keys and values into bytes to dbm
module internally. Allowing only str
keys is also pragmatic design choice.
from pickle import DEFAULT_PROTOCOL, Pickler, Unpickler
from io import BytesIO
import collections.abc
__all__ = ["Shelf", "BsdDbShelf", "DbfilenameShelf", "open"]
class _ClosedDict(collections.abc.MutableMapping):
'Marker for a closed dict. Access attempts raise a ValueError.'
def closed(self, *args):
raise ValueError('invalid operation on closed shelf')
__iter__ = __len__ = __getitem__ = __setitem__ = __delitem__ = keys = closed
def __repr__(self):
return '<Closed Dictionary>'
It is interesting to note that the shelve
internally uses pickle
module to serialize the value. Pickle is the standard binary serializer/deserializer of python objects that supports more types than json
. Though, it is not secure especially when deserializing from untrusted source.
Defining __all__
is a convention in python. The symbols included in it are the public API of the module. Only symbols in __all__
are imported into another module when from shelve import *
is run.
_ClosedDict
is a private symbol as it starts with underscore - another convention in python. Interestingly, it is a class to denote a "closed" dict. Any access to method like __getitem__
will throw ValueError
. It is implemented by setting all the relevant dictionary method's reference to a single method whose implementation throw ValueError
if invoked - a beautiful and idiomatic python.
class Shelf(collections.abc.MutableMapping):
"""Base class for shelf implementations.
This is initialized with a dictionary-like object.
See the module's __doc__ string for an overview of the interface.
"""
def __init__(self, dict, protocol=None, writeback=False,
keyencoding="utf-8"):
self.dict = dict
if protocol is None:
protocol = DEFAULT_PROTOCOL
self._protocol = protocol
self.writeback = writeback
self.cache = {}
self.keyencoding = keyencoding
...
The base class Shelf
extends from collections.abc.MutableMapping
which is an abstract base class that define interface for dictionary like behavior. protocol
is passed to pickle to determine pickle's protocol to use. keyencoding
define encoding for string key. cache
maintains an in-memory copy of python object associated to the string key in the shelf. Finally, dict
is the dictionary like object which is by default initialized with dbm
database for disk persistence.
def __iter__(self):
for k in self.dict.keys():
yield k.decode(self.keyencoding)
def __len__(self):
return len(self.dict)
def __contains__(self, key):
return key.encode(self.keyencoding) in self.dict
def get(self, key, default=None):
if key.encode(self.keyencoding) in self.dict:
return self[key]
return default
...
__iter__
dunder method is a generator that yields the key bytes decoded as string using the keyencoding originally passed to the shelf constructor. Rest of the dunder methods are standard fare for a python programmer. Crucially, self[key]
in get
method calls __getitem__
defined further down.
def __getitem__(self, key):
try:
value = self.cache[key]
except KeyError:
f = BytesIO(self.dict[key.encode(self.keyencoding)])
value = Unpickler(f).load()
if self.writeback:
self.cache[key] = value
return value
def __setitem__(self, key, value):
if self.writeback:
self.cache[key] = value
f = BytesIO()
p = Pickler(f, self._protocol)
p.dump(value)
self.dict[key.encode(self.keyencoding)] = f.getvalue()
def __delitem__(self, key):
del self.dict[key.encode(self.keyencoding)]
try:
del self.cache[key]
except KeyError:
pass
...
__getitem__
return the python object associated with the key
from cache
. If that fail, it return the value from the persistent store. It is a two step process:
a) Invoke BytesIO
module to create a file-like object in memory represented as byte stream.
b) Invoke Unpickler(f).load()
to deserialize the raw bytes into python object.
If writeback
is True
then it stores the python object in cache before returning it to the caller.
__setitem__
track the key-value pair in a cache
if writeback
is True
. Then, it does the reverse of __getitem__
by first creating a file-like object in memory using BytesIO
and then using pickle
to serialize the value as byte stream and set to dict
which persist it in storage.
When writeback
is True
, cache
stores a copy of the key-value pair because when a value is mutated, we need to persist them in storage. When close
is called, shelve writes back all entries in cache
to persistent storage. It is very inefficient in terms of memory and time to do this since cache
cannot know which key-value pair is value mutated so it writeback all of its entries.
__delitem__
deletes the item in storage and also in cache
if the key exists, since writeback
could have been False
.
def __enter__(self):
return self
def __exit__(self, type, value, traceback):
self.close()
...
shelf
is a context manager by virtue of defining __enter__
and __exit__
dunder methods.
def close(self):
if self.dict is None:
return
try:
self.sync()
try:
self.dict.close()
except AttributeError:
pass
finally:
# Catch errors that may happen when close is called from __del__
# because CPython is in interpreter shutdown.
try:
self.dict = _ClosedDict()
except:
self.dict = None
def __del__(self):
if not hasattr(self, 'writeback'):
# __init__ didn't succeed, so don't bother closing
# see http://bugs.python.org/issue1339007 for details
return
self.close()
def sync(self):
if self.writeback and self.cache:
self.writeback = False
for key, entry in self.cache.items():
self[key] = entry
self.writeback = True
self.cache = {}
if hasattr(self.dict, 'sync'):
self.dict.sync()
...
In __del__
, a check is made before close()
is called to prevent AttributeError. Even after looking at the bug report and fix, it is not clear why only writeback
attribute is checked and not all of the attributes passed to shelf.open()
.
sync()
make sure the key-value pairs in cache
is written back to persistent storage if writeback
is True
and dict
which is a dbm
instance is synced to disk. Here writeback
is set to False
before calling self[key] = entry
to avoid mutating cache
inside __setitem__
while iterating on it in sync()
.
close()
calls sync()
and then closes the dict
. Rest of the code handle few corner case scenarios which is not clear even with comments.
class BsdDbShelf(Shelf):
...
Subclass of Shelf
that is backed by "bsd" db interface that is different from the dbm
db interface.
class DbfilenameShelf(Shelf):
"""Shelf implementation using the "dbm" generic dbm interface.
This is initialized with the filename for the dbm database.
See the module's __doc__ string for an overview of the interface.
"""
def __init__(self, filename, flag='c', protocol=None, writeback=False):
import dbm
Shelf.__init__(self, dbm.open(filename, flag), protocol, writeback)
def clear(self):
"""Remove all items from the shelf."""
# Call through to the clear method on dbm-backed shelves.
# see https://github.com/python/cpython/issues/107089
self.cache.clear()
self.dict.clear()
The method is self-explanatory. Inside __init__
, we import dbm
and then instantiate Shelf
by setting the dict to the instance of dbm.open
.
def open(filename, flag='c', protocol=None, writeback=False):
"""Open a persistent dictionary for reading and writing.
The filename parameter is the base filename for the underlying
database. As a side-effect, an extension may be added to the
filename and more than one file may be created. The optional flag
parameter has the same interpretation as the flag parameter of
dbm.open(). The optional protocol parameter specifies the
version of the pickle protocol.
See the module's __doc__ string for an overview of the interface.
"""
return DbfilenameShelf(filename, flag, protocol, writeback)
Again, self-explanatory. When shelf.open()
is called, it invokes this method. If you pass filename
as bio
then bio.db
will be the name of the database file persisted on disk.
You might be wondering about the default implementation of dbm
interface. It depends on your system. On my MacOs, it defaults to dbm.ndbm
.
Takeaway
Python stdlib is vast and contains supremely readable idiomatic code. If you are looking for a way to develop code reading skill to become a better engineer then definitely checkout the source code of python stdlib.