Python stdlib gems: collections.Counter

Here's a toy problem. Given a corpus of phone numbers for different countries determine a most prevalent display format in each country and use it to re-format an arbitrary phone number for its country. For example, if most US numbers in our data corpus are written like xxx-xxx-xxxx then the string (206) 1234567 should be converted to 206-123-4567 .

For simplicity, let's assume that all numbers are local so we don't have to deal with complexity of international prefixes.

Conversion

The actual conversion is not what this post is about, but for completeness here's a pair of complementary functions, one determining a pattern, and the other formatting a phone number according to it:

import re def get_pattern(phone): pattern = re.sub(r'\s+', ' ', phone) # normalize whitespace return re.sub(r'\d', '{}', pattern) # replace digits with placeholders def format(pattern, phone): return pattern.format(*re.findall(r'\d', phone))

(Error handling omitted for clarity.)

Straightforward solution

The usual approach is to use a dict-like structure that would hold patterns as keys and count how often they appear in the source database, so the result is going to look something like this:

{ 'US': { '({}{}{}) {}{}{}-{}{}{}{}': 1, '{}{}{}-{}{}{}-{}{}{}{}': 2, }, 'GB': { '{}{}{} {}{}{} {}{}{}{}': 1, }, 'RU': { '({}{}{}) {}{}{}-{}{}-{}{}': 1, }, }

(Don't mind the numbers, it's a small database!)

As far as I have seen, people usually turn to defaultdict for something like this. It automatically initializes values of absent keys on first access which lets you simply do result[key] += 1 every time. In our case we need a more complicated thing: a defaultdict for countries which would initialize keys with nested defaultdict s for counts. Overall it might look like this:

from collections import defaultdict stats = defaultdict(lambda: defaultdict(int)) # nested defaultdicts for country, phone in DB: pattern = get_pattern(phone) stats[country][pattern] += 1 # Replace each country stats with the most common pattern for country, patterns in stats.items(): pattern = max(patterns.items(), key=lambda i: i[1])[0] stats[country] = pattern

Counter

Here's a rule of thumb: whenever you see defaultdict(int) you can replace it with a Counter. It can do everything a defaultdict can, plus a few convenient tricks, like the .most_common() method which we can use instead of the ungainly max() expression from the previous example.

from collections import defaultdict, Counter stats = defaultdict(Counter) for country, phone in DB: pattern = get_pattern(phone) stats[country][pattern] += 1 for country, patterns in stats.items(): stats[country] = patterns.most_common(1)[0][0] # most_common would return # a list [(pattern, count)], so # we need this [0][0] here

Flat and functional

Let me offer you another solution, which uses Counter magic even more. I am also going to replace for-loops with a functional approach, so it won't be so embarrassingly imperative :-)

Here's the code, followed by an explanation:

from collections import Counter def determine_patterns(db): stats = Counter((country, get_pattern(phone)) for country, phone in db) stats = reversed(stats.most_common()) return {country: pattern for (country, pattern), _ in stats}

Look ma no loops!

Let me explain all that's happening here.

As it happens, Counter can consume a flat sequence of things and count them in one go: Counter(['a', 'b', 'a', 'c']) => {'a': 2, 'b': 1, 'c': 1} For this to work in our case we'll have to replace the nested data structure with a flat one. This can be done by simply gluing keys together in a tuple, replacing stats[country][pattern] with equivalent stats[(country, pattern)] (you can do it for any number of keys).

The expression (country, get_pattern(phone)) for country, phone in db looks like a list comprehension without square brackets. It's a generator expression which you can iterate over without constructing an actual list in memory. It is usually enclosed in parentheses ( .. ) but when it is a sole argument in a function call you can omit them and avoid ugly doubling like Counter(( .. )) .

stats.most_common() with no arguments returns the entire contents of the structure as a sorted list of pairs (key, count) : [ (('US', '{}{}{}-{}{}{}-{}{}{}{}'), 2), (('GB', '{}{}{} {}{}{} {}{}{}{}'), 1), (('US', '({}{}{}) {}{}{}-{}{}{}{}'), 1), (('RU', '({}{}{}) {}{}{}-{}{}-{}{}'), 1), ] It's sorted by count, with no regard for countries, but we don't care as you'll see later, as well as why we need to reverse it.

The last expression is a dict comprehension which walks over the sorted list, destructuring its nested pairs into individual variables using (country, pattern), _ which mimics the structure of one pair. _ is a conventional name for things we don't use, count in our case. So the sequence we actually fold into a final dict looks like this, with countries becoming keys and patterns becoming values: ('US', '{}{}{}-{}{}{}-{}{}{}{}'), ('GB', '{}{}{} {}{}{} {}{}{}{}'), ('US', '({}{}{}) {}{}{}-{}{}{}{}'), ('RU', '({}{}{}) {}{}{}-{}{}-{}{}'),

Here's the important thing: subsequent values with the same key replace the ones already in the dict. This is why we need to reverse the list, so the patterns with bigger counts replace least common ones as we go along. This is also why we don't care about countries being intermixed, as they won't affect each other. Admittedly, the last bit is the most unobvious here.

Is it better?

I don't really care, you can judge for yourself :-) I only wanted to demonstrate collections.Counter in action!

Working code as a bonus: phone-counter.py.