Regrettable Code, Episode 1

It’s easy to talk about pretty code. Let’s talk about nasty code and see what we can learn from it.

Background

The product I was working on has a concept of an Account . On a given Account there may be one-to-many Teams to which users belong. There was a RESTful API that allowed querying various settings of Account and Team objects of the form

/api/v1/features/team-<id>

For example, /api/v1/features/team-21 would return all features of team 21. There was also a form of the endpoint that would return all teams and their respective settings: /api/v1/features/team-*

The Problem

In a nutshell, on some accounts with many teams, querying for team-21 was slow. Querying for team-* was exceptionally slow — for an account with 500ish teams, it would easily take ten to fifteen seconds to return a response.

Let’s Profile!

My initial gut was: someone probably stuck a database query in a loop or something silly like that, and I’d fix it with a simple change to make a single database call. To confirm this hypothesis, I did a quick profile.

This code was in python, so cProfile was a quick way to get some timings against a production account. A call to cProfile.run('features = _inner_fetch_all_features("team-*", account_id)') gave me results. And the results were very, very unexpected.

The first profile returned in 11.935 seconds. The first thing that leapt out was this code was spending an extraordinary amount of time in some class called AttrDict :

ncalls tottime percall cumtime percall filename:lineno(function) ... 11825 0.049 0.000 0.060 0.000 attrdict.py:26(__getattr__)

384912/432 1.762 0.000 7.012 0.016 attrdict.py:39(deepcopy)

454032/64800 0.650 0.000 6.943 0.000 attrdict.py:49(<genexpr>)

23650 0.023 0.000 0.023 0.000 attrdict.py:58(__setstate__)

…roughly 59% of the entire time spent was in AttrDict . Whoa. Did not expect. The other item that was fishy involved a library called pickle :

46859 0.178 0.000 0.347 0.000 pickle.py:1033(load_dict)

3 0.000 0.000 0.000 0.000 pickle.py:1093(load_global)

3 0.000 0.000 0.000 0.000 pickle.py:1128(find_class)

11825 0.032 0.000 0.088 0.000 pickle.py:1135(load_reduce)

225226 0.497 0.000 0.797 0.000 pickle.py:1156(load_get)

106928 0.180 0.000 0.258 0.000 pickle.py:1170(load_put)

175038 0.393 0.000 0.626 0.000 pickle.py:1199(load_setitem)

11825 0.091 0.000 0.271 0.000 pickle.py:1217(load_build)

58684 0.081 0.000 0.117 0.000 pickle.py:1256(load_mark)

1 0.000 0.000 0.000 0.000 pickle.py:1260(load_stop)

1 0.011 0.011 4.788 4.788 pickle.py:1386(loads)

1 0.000 0.000 0.000 0.000 pickle.py:83(__init__)

1 0.000 0.000 0.000 0.000 pickle.py:835(__init__)

1 1.183 1.183 4.777 4.777 pickle.py:851(load)

58684 0.094 0.000 0.132 0.000 pickle.py:876(marker)

88873 0.230 0.000 0.348 0.000 pickle.py:917(load_int)

2 0.000 0.000 0.000 0.000 pickle.py:967(load_string)

36414 0.116 0.000 0.168 0.000 pickle.py:985(load_unicode)

11825 0.027 0.000 0.056 0.000 pickle.py:999(load_tuple)

The remaining 40% was spent in pickle , which was some library for serializing/deserializing python objects.

I was a bit stunned at this point; not a single slow spot involved database code, which meant this was purely some algorithmic problem in python. And for some reason, it was slow when there were more teams.

Those Still Waters Run Deep

I decided the first thing I’d go after was AttrDict.deepcopy , since that was the slow spot. This also turned out to be a poor assumption, because I didn’t fully understand what this code was doing and why, but we’ll get to that shortly.

AttrDict looked like so (I’ve removed fluff):

class AttrDict(dict):

"""

Dict that allows attribute access

"""

def __getattr__(self, name):

try:

return self[name]

except KeyError:

return super(AttrDict, self).__getattribute__(name)



def __setattr__(self, name, value):

self[name] = value



@classmethod

def deepcopy(cls, other):

if isinstance(other, basestring):

return other

elif isinstance(other, collections.Sequence):

assert isinstance(other, list)

return [AttrDict.deepcopy(item) for item in other]

elif isinstance(other, collections.Mapping):

assert isinstance(other, (AttrDict, dict))

return AttrDict(

(key, AttrDict.deepcopy(value))

for key, value in other.iteritems()

)

else:

return other

So AttrDict was basically a python dict with some flair. One piece of flair was the ability to access the dictionary members with a dot syntax similar to javascript. So instead of some_attrdict['blah'] , you could write some_attrdict.blah . This code was likely ripped off from stack overflow somewhere, since whoever wrote this wasn’t the first person to desire this functionality.

The second part was deepcopy , the offending code. At first glance, it took some object called other and would recursively create a deep copy of that. One interesting side-effect of this was supplied dict objects would be turned into AttrDict .

The other interesting part was: this code was using isinstance , and abusively. isinstance is pricy, because it needs to check an object’s inheritance hierarchy to see if the supplied object — or some child class — is of the supplied type.

There was also some pointless asserting (footnote: our asserts run in production. None of these had fired as far back as I had log history for, so they are worthless), and a nonsensical if/else structure.

So win. I can rewrite this to improve performance and that may help. I changed to:

@classmethod

def deepcopy(cls, other):

other_type = type(other)

if other_type in [dict, AttrDict]:

return AttrDict(

(key, AttrDict.deepcopy(value))

for key, value in list(other.items())

)

elif other_type is list:

return [AttrDict.deepcopy(item) for item in other]

else:

return other

…there was no need to check for string types because they followed the same path as the else path: return other .

And rather than use isinstance each time, use type once and check for specific types I knew were supported. This change was not without risk: it meant I could no longer properly handle inherited objects like the code previously did. To mitigate this, I confirmed this updated code delivered the same results in production.

After configuring prod to re-run this code so I could get a clean profile, the bulk of AttrDict now consumed roughly two seconds. Not great, but acceptable. Given this was less than the other offending chunk of code, I decided to move on to pickle .

This Must Be The Pickle

First, I found where we were using pickle. There was a chunk of code that would fetch account features, that cached them to avoid regenerating:

def _all_features(account_id):

cache_key = _get_cache_key(account_id)

account_features = cache.get(cache_key)

if account_features:

try:

account_features = AttrDict(pickle.loads(zlib.decompress(account_features)))

except (...):

...

else:

return account_features



account_features = _split_features(_query_features(account_id))

cache.set(cache_key, zlib.compress(pickle.dumps(account_features)))

return account_features

A basic overview:

Try to fetch features out of a cache. If they exist, decompress with zlib, hydrate them with pickle, and return those. If they don’t exist, re-generate them, dehydrate with pickle, compress with zlib, save to the cache and return them.

After reading this code, I had more questions than answers. A few thoughts/ideas that went through my head:

The whole reason we were using pickle was definitely because of AttrDict . At some point, someone decided we needed to cache this stuff. But because we were using AttrDict , we needed a way to serialize/deserialize python objects. The use of zlib was odd. After consulting a coworker, I learned this was because a cache object had a 1 MB limit, and for accounts with many teams, we had exceeded that limit. So zlib compression was added as a quick stop-gap measure.

The second part kinda floored me; what the heck was pickle persisting that would bump into a 1 MB limit? Also, if all we were ultimately storing was dict/list objects, what did it matter if we used pickle ? We could serialize with json .

I did some quick terminal hacking to see how JSON would compare:

In [35]: json_compressed = zlib.compress(json.dumps(account_features)) In [36]: pickle_compressed = zlib.compress(pickle.dumps(account_features)) In [37]: len(pickle_compressed)

Out[37]: 642794 In [38]: len(json_compressed)

Out[38]: 50585

…goodness gracious. The compressed payload with pickle is twelve times the size. But it gets worse:

In [41]: start = time.clock()

...: json_compressed = zlib.compress(json.dumps(account_features))

...: print time.clock() - start

...:

0.035571 In [42]: start = time.clock()

...: pickle_compressed = zlib.compress(pickle.dumps(account_features))

...: print time.clock() - start

...:

1.201328

…working with a JSON payload was exceptionally faster — roughly 35x for hydrating/dehydrating the payload. Swapping from pickle to json alone would solve this problem.

But one major snag: when taking an AttrDict , dumping to JSON, compressing, decompressing and then loading with JSON, we would lose AttrDict entirely — it’d come back as dict , not AttrDict . And updating vast swaths of our codebase that used AttrDict was out of the question. I’d never make it through QA in a reasonable amount of time, nor would I be able to have confidence in such a massive change.

And then I remembered deepcopy . I could do this:

account_features = AttrDict.deepcopy(json.loads(zlib.decompress(account_features)))

…when rehydrating, and that would automatically convert dict to AttrDict . Problem solved. And adding one additional deepcopy operation was irrelevant, even though that was the previous offending code. We were already doing hundreds of them (see profile above) and this codepath was called only once.

Edit: it was pointed out that rather than using deepcopy , it would be preferable to use json.load 's object_hook feature:

json.loads( zlib.decompress(account_features), object_hook=AttrDict)

…this was roughly 5x better than the deepcopy strategy. Thank you!

Except Not Solved :/

At this point, it was lightning fast for accounts with small team counts, and 2–3 seconds for accounts with 500ish teams. So I created an account with a thousand teams to get a sense of performance.

I was right back at ten seconds for team-* .

It was this moment that I put it together; my problem with this endpoint wasn’t bad python code, although that was a contributing factor. It was that I had a big-O problem. If 500 took me 2.5 seconds, and a 1000 took me 10 seconds, that mapped almost perfectly to O(n^2) . The underlying algorithmic problem still existed, and it grew with the addition of teams.

Unfortunately, at this point I had to acknowledge one other frustrating reality of engineering. My improvements were already good enough to get this work through QA. We didn’t have any accounts with more than 500 teams. I asked our QA people to create a new defect for 1000 teams to track the issue moving forward, and closed out the bug ticket.

As much as I wanted to figure it out and fix it, perfect is the enemy of good enough. Adios, amigo.

Git History, A Few Take-Aways, A Few Random Musings

I did, however, dig through the git history to figure out what the heck the deal was with AttrDict and pickle :

It was written five years prior by a junior developer. The intent was definitely just convenience — the ability to do some_attr_dict.blah instead of some_attr_dict['blah'] . The bad if/else logic and nonsensical asserting also made more sense given a junior developer. At this time, however, the codebase was small and there probably was little consequence to AttrDict A few days later, a more senior developer added the ability to pickle this object so we could start caching it. The caching was added to address performance issues. It’s unclear what those performance issues were; there was no clear history on why this was done. At some point, when the cache exploded as we gained accounts with more and more teams, some other developer added zlib compression to the caching. I can’t fault this decision; it was an emergency fix when stuff started exploding in production.

My takeaways from all of this:

I think adding a wrapper around basic python types like dict and list should be avoided. That decision had serious consequences down the line, for what appears to be nothing more than syntactic sugar. It is unfortunate the senior developer didn’t push the junior developer towards simplicity for the sake of simplicity. Although, that brings up: It is unfortunate our code review process did not exist at the time AttrDict was written. The original code is the sort of stuff I would have asked be rewritten or justified. It is unfortunate the performance issues that warranted pickle weren’t clearly communicated in git history, or some linked defect in an issue tracker. It makes me understand why having clear guidelines around commit messages is so vitally important for large, long-term codebases. This is an interesting study in the long-term consequences of a programming choice. It is clear to me this is one of those places where “clever” code had serious consequences and the business would have been better off without it. It is clear to me as a senior developer, it is important to mentor junior developers towards the simplest thing that could possibly work.

And some other random, final musings: