Python's concurrent.futures

In this essay I'll describe how to use the concurrent.futures API from Python 3.2. Since I'm still using Python 2.7, I'll use Alex Grönholm's back port instead.

PEP 3148 gives the motivation for the new concurrent module:

Python currently has powerful primitives to construct multi-threaded and multi-process applications but parallelizing simple operations requires a lot of work i.e. explicitly launching processes/threads, constructing a work/results queue, and waiting for completion or some other termination condition (e.g. failure, timeout). It is also difficult to design an application with a global process/thread limit when each component invents its own parallel execution strategy.

The guiding problem: analyze web logs

Basically, using "threading" and "multiprocessing" are harder than they should be.

My web site archives the daily server logs. Filenames are of the form "www.20120115.gz". Each access is a single line in "combined log format." Here's an example line:

198.180.131.21 - - [25/Dec/2011:00:47:19 -0500] "GET /writings/diary/diary-rss.xml HTTP/1.1" 304 174 "-" "Mozilla/5.0 (Windows NT 5.1; rv:8.0) Gecko/20100101 Firefox/8.0"

It contains the host IP address, date, URL path, referrer information, user agent, and a few more fields.

I have 169 files which I want to analyze. gzcat *.gz | wc -l says there are 1,346,595 records. I'll use this data set to show some examples of how to use concurrent.futures.

Number of accesses per day (single-threaded)

For the start, how many log events are there per day?

import glob import gzip for filename in glob.glob("www_logs/www.*.gz"): with gzip.open(filename) as f: num_lines = sum(1 for line in f) print filename.split(".")[1], num_lines

AttributeError: GzipFile instance has no attribute '__exit__'

20110801 7305 20110802 7594 20110803 7470 20110804 7348 20110805 7504 20110806 4774 20110807 4870 20110808 9815 ... 20120113 18124 20120114 9245 20120115 8100 20120116 14117

Note: gzip files didn't support context managers until Python 2.7. If you are on Python 2.6 then you'll get the error messageWhen I run that I get output which looks like:That's too detailed, and hard to interpret. A graph would be nicer. Here it is:

I seem to get more people during the work week than the weekend, and one of my other essays got on Hacker News in early January.

I made that plot using the matplotlib's "pylab" API:

import glob import gzip from pylab import * import datetime dates = [] counts = [] for filename in glob.glob("www_logs/www.*.gz"): with gzip.open(filename) as f: num_lines = sum(1 for line in f) date = datetime.datetime.strptime(filename, "www_logs/www.%Y%m%d.gz") dates.append(date) counts.append(num_lines) plot(dates, counts) ylim(0, max(counts)) title("My website accesses") show()

That code is a bit ugly, so I'll clean it up a bit and conveniently put it into a form which helps transition to the parallelization code:

import glob import gzip import datetime import time def count_lines(filename): with gzip.open(filename) as f: num_lines = sum(1 for line in f) date = datetime.datetime.strptime(filename, "www_logs/www.%Y%m%d.gz") return (date, num_lines) filenames = glob.glob("www_logs/www.*.gz") dates = [] counts = [] for filename in filenames: date, count = count_lines(filename) dates.append(date) counts.append(count) ## Believe or not, but this next line does the same as the previous block(!) # dates, counts = zip(*(count_lines(filename) for filename in filenames)) from pylab import * plot(dates, counts) ylim(0, max(counts)) title("My website accesses") show()

It's slow. Make it faster!

That code takes 5.5 seconds to read the 1.3 million lines. I have a four core machine - surely I can make better use of my hardware!

I'll start with multiple threads. Python has supported threads since the 1990s, but as we all know, CPython has the Global Interpreter Lock which prevents multiple threads from running Python code at the same time. On the other hand, this task is doing file I/O, and gzip uncompression in code which might release the GIL. Perhaps threads will work here?

I'll use a very standard approach. I'll define a set of jobs, and pass that over to a thread pool. Each job takes a filename to process as input, calls the function "count_lines", and returns the timestamp and number of lines in the file.

Here's how you do that with the concurrent.futures API:

import glob import gzip import datetime from concurrent import futures def count_lines(filename): with gzip.open(filename) as f: num_lines = sum(1 for line in f) date = datetime.datetime.strptime(filename, "www_logs/www.%Y%m%d.gz") return (date, num_lines) filenames = glob.glob("www_logs/www.*.gz") dates = [] counts = [] with futures.ThreadPoolExecutor(max_workers=2) as executor: for (date, count) in executor.map(count_lines, filenames): dates.append(date) counts.append(count) from pylab import * plot(dates, counts) ylim(0, max(counts)) title("My website accesses") show()

The "ThreadPoolExecutor" creates a thread pool, in this case with two workers. You can submit as many jobs as you want to this thread pool, but only two (in this case) will be processed at a time. The thread pool is also a context manager, and no more jobs can be submitted once the context is finished.

How are jobs submitted? You can either submit a job using submit() or you can submit a number of jobs using the "map() " idiom, which is what I did here. Remember, this is a Python 3.x API so map() returns an iterator, and not a list like it does in Python 2.x.

What is "map"?

The term "map" comes from functional programming, but functional programming is not emphasized in the Python language. Instead, we more often use a list or generator comprehension, or build a list manually. The following three methods are equivalent:

>>> print [ord(c) for c in "Andrew"] [65, 110, 100, 114, 101, 119]

>>> print map(ord, "Andrew") [65, 110, 100, 114, 101, 119]

>>> result = [] >>> for c in "Andrew": ... result.append(ord(c)) ... >>> print result [65, 110, 100, 114, 101, 119]

for filename in filenames: yield count_lines(filename)

So "map(count_lines, filenames)" is a roughly the same as:and "executor.map" does the same thing, only it uses a thread in the thread pool to evaluate the function.

Also, to switch the above code to its almost exact single-threaded version, what you can do is get the Python 2.x iterater version of "map" (in itertools.imap) and rewrite the above as:

import itertools ... for (date, count) in itertools.imap(count_lines, filenames): dates.append(date) counts.append(count)

But is it faster?

No. ;)

With one thread in the thread pool, the task takes 5.5 seconds. The overall time is unchanged from the unthreaded version, as we should expect.

With two worker threads, it takes 7.0 seconds - even longer than with one thread!

Three worker threads takes 7.3 seconds, and four threads takes 7.4. This is not a trend you want to see when you need to parallelize your software.

There are two likely candidates for the slowdown. The GIL is the obvious one, but perhaps my computer doesn't handle parallel disk I/O that well.

What about multiple processes?

What I'll do is switch from the multi-threaded version to the multi-processing version. Instead of using a thread pool, I'll have a process pool, which uses interprocess communications to send the job request to each process and get the results:

import glob import gzip import datetime from concurrent import futures def count_lines(filename): with gzip.open(filename) as f: num_lines = sum(1 for line in f) date = datetime.datetime.strptime(filename, "www_logs/www.%Y%m%d.gz") return (date, num_lines) filenames = glob.glob("www_logs/www.*.gz") dates = [] counts = [] with futures.ProcessPoolExecutor(max_workers=4) as executor: for (date, count) in executor.map(count_lines, filenames): dates.append(date) counts.append(count) from pylab import * plot(dates, counts) ylim(0, max(counts)) title("My website accesses") show()

Did you see the difference? I used a "ProcessPoolExecutor" instead of a "ThreadPoolExecutor".

With that small change, a process pool with only one worker finishes in 5.6 seconds, which is a bit slower. That's probably due to the overhead of starting a new process and sending data back and forth.

What's exciting is that two workers finishes in 3.6 seconds, three workers in 2.8 seconds, and four workers in 2.6 seconds. It's obviously not great speedup (perfect scaling would be 5.5, 2.3, 1.8, and 1.1 seconds), but I end up cutting my time in half with relatively little work.

Faster, please

At this point it's safe to assume that most of the gzip+line count code requires the GIL. A quick look at "gzip.py" tells me that, yes, that is the case.

With some non-trivial effort, I could write a specialized C extension to replace the gzip module. That's overkill for this project. Instead, my computer has the usual unix utilities so I'll rewrite the "count_lines" function and let them them do the work instead.

import subprocess def count_lines(filename): gzcat = subprocess.Popen(["gzcat", filename], stdout = subprocess.PIPE) wc = subprocess.Popen(["wc", "-l"], stdin = gzcat.stdout, stdout = subprocess.PIPE) num_lines = int(wc.stdout.readline()) date = datetime.datetime.strptime(filename, "www_logs/www.%Y%m%d.gz") return (date, num_lines)

Using this version, my single-threaded time is 3.2 seconds, with two threads it's 2.0 seconds, three threads is 1.8 seconds, and four threads is 1.7 seconds.

The respective times with the process pool are 3.3 seconds, 2.1 seconds, 1.8 seconds and 1.8 seconds. This means that very little time in either of these cases is spent in the GIL, and the slightly slower multiprocess times likely reflects extra cost of starting a process and doing interprocess communications (IPC).

What are the top URLs on my site?

Okay, I admit that the previous section was overkill, but it's fun sometimes to try out and compare different alternatives.

I want to mine my logs for more information. What are the top 10 most downloaded URLs?

This is the perfect situation for Python's Counter container. This was added in Python 2.7; see that link for how to support older versions of Python.

I'll start with the simplest single-threaded version; remember that a line in the log file looks like:

198.180.131.21 - - [25/Dec/2011:00:47:19 -0500] "GET /writings/diary/diary-rss.xml HTTP/1.1" 304 174 "-" "Mozilla/5.0 (Windows NT 5.1; rv:8.0) Gecko/20100101 Firefox/8.0"

import glob import gzip from collections import Counter counter = Counter() for filename in glob.glob("www_logs/www.*.gz"): with gzip.open(filename) as f: for line in f: # Extract the path field from the log string request = line.split('"')[1] path = request.split(" ")[1] counter[path] += 1 for path, count in counter.most_common(10): print count, path

170073 /favicon.ico 93354 /writings/diary/diary-rss.xml 81513 /dss.css 78961 /images/toplogo_left.gif 78655 /images/spacer.gif 78526 /images/toplogo_right.gif 74223 /images/news_title.gif 26528 / 25349 /robots.txt 16962 /writings/NBN/python_intro/standard.css

A concurrent.futures version

The following analysis code:takes 8.9 seconds to generate this listing:That's really not exciting information. In a bit, I'll have it only display counts for the information.

We've determined that Python's gzip reader uses the GIL, so it's pointless to parallelize the above code using threads.

There's another issue. The "counter" is a global data structure, and that can't be shared across multiple Python processes. I'll have to update the algorithm somewhat. I'll let each worker function process a file and create a new counter for that file. Once it's done, I'll send the counter instance back to the main process for further processing.

Here's a worker function which does that.

def count_urls(filename): counter = Counter() with gzip.open(filename) as f: for line in f: request = line.split('"')[1] path = request.split(" ")[1] counter[path] += 1 return counter

The code in the main process has to kick off all of the jobs, collect the counters from each file, merge the counters into one, and report the top hits. The new(ish) Counter object helps make this easy because the "update()" method sums the values for shared keys instead of replacing like it would for a dictionary.

merged_counter = Counter() filenames = glob.glob("www_logs/www.*.gz") with futures.ProcessPoolExecutor(max_workers=4) as executor: for counter in executor.map(count_urls, filenames): merged_counter.update(counter) for path, count in merged_counter.most_common(10): print count, path

(You might be asking "How does it exchange Python objects?" Answer: Through pickles .)

The above runs in 4.4 seconds, so about 1/2 the time as the single processor version. And after I fixed a bug (I used "counter" my report, not "merged_counter"), I got identical values as the single-threaded version.

4.4 seconds is pretty good. As we saw before, Python's gzip reader is not as fast as calling out to gzcat, so I decided to use a Popen call instead. Also, I changed the code slightly so it only reports paths which end with ".html".

The final code runs in 3.2 seconds, and here it is:

from collections import Counter from concurrent import futures import glob import gzip import itertools import subprocess def count_urls(filename): counter = Counter() p = subprocess.Popen(["gzcat", filename], stdout = subprocess.PIPE) for line in p.stdout: request = line.split('"')[1] path = request.split(" ")[1] if path.endswith(".html"): counter[path] += 1 return counter filenames = glob.glob("www_logs/www.*.gz") merged_counter = Counter() with futures.ProcessPoolExecutor(max_workers=4) as executor: for counter in executor.map(count_urls, filenames): merged_counter.update(counter) for path, count in merged_counter.most_common(10): print count, path

15830 /Python/PyRSS2Gen.html 13722 /writings/NBN/python_intro/command_line.html 11739 /writings/NBN/threads.html 10663 /writings/NBN/validation.html 6635 /writings/diary/archive/2007/06/01/lolpython.html 4525 /writings/NBN/writing_html.html 3756 /writings/NBN/generators.html 3465 /writings/NBN/parsing_with_ply.html 2958 /writings/diary/archive/2005/04/21/screen_scraping.html 2786 /writings/NBN/blast_parsing.html

Resolving host names from IP addresses

It tells me that the 10 most popular HTML pages from my site are

My logs contain bare IP address. I'm curious about where they come from. I write about cheminformatics; are any computers from pharma companies reading my pages? To do that, I need a fully qualified domain name for each IP address. Moreover, I want to save the IP address to domain name mapping so I can use it in other analyses.

Here's how to get the FQDN given an IP address as a string.

>>> import socket >>> socket.getfqdn("82.94.164.162") 'dinsdale.python.org'

DNS lookups take a surprisingly long time; 0.2 seconds on my desktop, and I understand this is typical. Since I have 117,504 addresses, that may take a few hours. On the other hand, all of that time is spent waiting for the network to respond. This is easily parallelized.

socket.getfqdn() is single-threaded on a Mac

I tried at first to use multiple threads for this, but that didn't work. No matter how many threads I used, the overall time was the same. After a wild goose chase where I suspected that my ISP throttled the number of DNS lookups, I found the problem.

The getfqdn function is a thin wrapper to socket.gethostbyaddr(), which itself is a thin layer on top of the C function "gethostbyaddr()". In most cases, the underlying API may only be called from a single thread. A common solution is to implement a reentrant version, usually named "gethostbyaddr_r", but the OS X developers decided that people should use a different API for that case. ("getaddrinfo ... is a replacement for and provides more flexibility than the gethostbyname(3) and getservbyname(3) functions".) The Python module only calls the single-threaded code, and uses a lock to ensure that only one thread calls it at a time.

The problem is easily solved by using a process pool instead of a thread pool.

Extracting IP addresses from a set of gzip compressed log files

The first step is to get the IP addresses which I want to convert. I only care about unique IP addresses, and don't want to waste time looking up duplicates. The code to extract the IP addresses is straight-forward. Reading the compressed file is not the slow part, so there's no reason to parallelize this or use an external gzip process to speed things up.

def get_unique_ip_addresses(filenames): # Report only the unique IP addresses in the set seen = set() for filename in filenames: with gzip.open(filename) as gzfile: for line in gzfile: # The IP address is the first word in the line ip_addr = line.split()[0] if ip_addr not in seen: seen.add(ip_addr) yield ip_addr

filenames = glob.glob("www_logs/www.*.gz") ip_addresses = itertools.islice(get_unique_ip_addresses(filenames), 1800, 1900)

Using "executor.submit()" instead of "executor.map"

I don't want to process the entire data set during testing and debugging. The above returns an iterator, so I use itertools.islice to get a section of 100 terms, also as an iterator:(I started with the range (0, 1000), but then ran into the gethostbyaddr reentrancy problems. I didn't want my computer to do a simple local cache lookup, so I change the range to (1000, 1100), then (1100, 1200) and so on. This show that it took me a while to figure out what was wrong!)

How am I going to do the parallel call? I could do a simple

with ProcessPoolExecutor(max_workers=20) as executor: for fqdn in executor.map(socket.getfqdn, ip_addresses): print fqdn

but then I lose track of the original IP address, and I wanted to cache the IP address to FQDN mapping for later use. While it might be possible to use a combination of itertools.tee and itertools.izip, I decided that "map" wasn't the right call in the first place.

The executor's "map" function guarantees that the result order will be the same as the input order. I don't care about the order. Instead, I'll submit each job using the "submit()" method.

with futures.ProcessPoolExecutor(max_workers=10) as executor: jobs = [] for ip_addr in ip_addresses: job = executor.submit(resolve_fqdn, ip_addr) jobs.append(job)

ip_addr, fqdn = job.result()

for job in futures.as_completed(jobs): ip_addr, fqdn = job.result() print ip_addr, fqdn

def resolve_fqdn(ip_addr): fqdn = socket.getfqdn(ip_addr) return ip_addr, fqdn

from concurrent import futures import glob import gzip import itertools import socket import time def get_unique_ip_addresses(filenames): # Report only the unique IP addresses in the set seen = set() for filename in filenames: with gzip.open(filename) as gzfile: for line in gzfile: # The IP address is the first word in the line ip_addr = line.split()[0] if ip_addr not in seen: seen.add(ip_addr) yield ip_addr def resolve_fqdn(ip_addr): fqdn = socket.getfqdn(ip_addr) return ip_addr, fqdn filenames = glob.glob("www_logs/www.*.gz") ip_addresses = itertools.islice(get_unique_ip_addresses(filenames), 1800, 1900) with futures.ProcessPoolExecutor(max_workers=20) as executor: jobs = [] for ip_addr in ip_addresses: job = executor.submit(resolve_fqdn, ip_addr) jobs.append(job) # Get the completed jobs whenever they are done for job in futures.as_completed(jobs): ip_addr, fqdn = job.result() print ip_addr, fqdn

Use a dictionary of futures instead of a list

The submit function returns a " concurrent.futures.Future " object. For now, there are two important things about it. You can ask it for its "result()", like this:The "result()" method blocks until the Promise has a result. Blocking is bad for performance, so how do you know which job promises are actually ready? Use " concurrent.futures.as_completed() " for that:The last part to this puzzle is to have the actual job return the two element tuple with both the input IP address and the resulting FQDNPut it all together and the code is:This processes 100 IP addresses in about 2-8 seconds. The actual time is highly dependent on DNS response times from servers around the world. To reduce the variability, I increased the number of IP addresses I used for my measurements. I found that with 20 processes I could do about 50 lookups per second, and with 50 processes I could do about 90 lookups per second. I didn't try a higher number.

What I did seems somewhat clumsy in that I send the IP address to the process, and the process sends the IP address back to me. I did that because it was easy. The module documentation shows another technique.

You can keep the jobs in a dictionary, where the key is the future object (returned by "submit()"), and its value is the information you want to save. That is, you can rewrite the above as:

with futures.ProcessPoolExecutor(max_workers=20) as executor: jobs = {} for ip_addr in ip_addresses: job = executor.submit(socket.getfqdn, ip_addr) jobs[job] = ip_addr # Get the completed jobs whenever they are done for job in futures.as_completed(jobs): ip_addr = jobs[job] fqdn = job.result() print ip_addr, fqdn

Add a callback to the job future

Notice how it doesn't need the "resolve_fqdn" function; it can call socket.getfqdn directly.

The conceptual model so far is "create all the jobs" followed by "do something with the results." This works well, except for latency. I only processed 100 IP addresses in my example. I removed the "islice()" call and asked it to process all 117,504 IP addresses in my data set. The code looked like it wasn't working because it wasn't giving output. As it turned out, it was still loading all of the jobs.

The concurrent module uses an asynchronous model, and just like Twisted's Deferred and jQuery's deferred.promise(), there's a way to attach a callback function to a future, which will be called once the answer is ready. Here's how it works:

with futures.ProcessPoolExecutor(max_workers=50) as executor: for ip_addr in ip_addresses: job = executor.submit(resolve_fqdn, ip_addr) job.add_done_callback(print_mapping)

def print_mapping(job): ip_addr, fqdn = job.result() print ip_addr, fqdn

When each job future is ready, the concurrent library will call the "print_mapping" callback, with the job result as its sole parameter:Technical notes: The callback occurs in the same process which submitted the job, which is exactly what's needed here. However, the documentation doesn't say that all of the callbacks will be done from the same thread, so if you are using a thread pool then you probably want to use a thread lock around a shared resource. (sys.stdout is a shared resource, so you would need one around the print statement here. I'm using a process pool, and the concurrent process pool implementation uses a single local worker thread, so I don't think I have to worry about contention. You should verify that.)

Here is the final callback-based code:

from concurrent import futures import glob import gzip import socket def get_unique_ip_addresses(filenames): # Report only the unique IP addresses in the set seen = set() for filename in filenames: with gzip.open(filename) as gzfile: for line in gzfile: # The IP address is the first word in the line ip_addr = line.split()[0] if ip_addr not in seen: seen.add(ip_addr) yield ip_addr def resolve_fqdn(ip_addr): fqdn = socket.getfqdn(ip_addr) return ip_addr, fqdn ## A multi-threaded version should use create a resource lock # import threading # write_lock = threading.Lock() def print_mapping(job): ip_addr, fqdn = job.result() print ip_addr, fqdn ## A multi-threaded version should use the resource lock # with write_lock: # print ip_addr, fqdn filenames = glob.glob("www_logs/www.*.gz") with futures.ProcessPoolExecutor(max_workers=50) as executor: for ip_addr in get_unique_ip_addresses(filenames): job = executor.submit(resolve_fqdn, ip_addr) job.add_done_callback(print_mapping)

functools.partial

It processed my 117,504 addresses in 1236 seconds (about 21 minutes), which means a rate of 95 per second. That's much better than my original rate of 5 per second!

By the way, just like earlier, there's no absolute need for the worker function to return the ip address. I could have written this as:

job = executor.submit(socket.getfqdn, ip_addr) job.add_done_callback(functools.partial(print_mapping, ip_addr))

or even as an ugly-looking lambda function with a default value to get around scoping issues.

In this variation, print_mapping becomes:

def print_mapping(ip_addr, job): fqdn = job.result() print ip_addr, fqdn

where the "ip_addr" was stored by the "partial()", and where "job" comes from the completed promise.

This approach feels more "pure", but I find that methods like this are harder for most people to understand.

Who subscribes to my blog's RSS feed?

A quick check of the list of hostnames shows that no one from AstraZeneca reads my blog from a work machine. Actually, I don't have any accesses from them at all, which is a bit surprising since I know some of them follow what I do. They might use a blog aggregator like Google Reader, or use a home account, or perhaps AZ's data goes through a proxy which doesn't have the name "az" or "astrazeneca" in it.

There are requests from Roche, and Vertex, but no blog subscribers. Who then subscribes to my blog?

Here I print the hostnames for requests which fetch my blog's RSS feed. With 'zgrep' it's fast enough that I'm not going to parallelize the code.

import subprocess import glob hostname_table = dict(line.split() for line in open("hostnames")) filenames = glob.glob("www_logs/www.*.gz") p = subprocess.Popen(["zgrep", "--no-filename", "/writings/diary/diary-rss.xml"] + filenames, stdout = subprocess.PIPE) for line in p.stdout: ip_addr = line.split()[0] hostname = hostname_table[ip_addr] if hostname_table == ip_addr: # Couldn't find a reverse lookup; ignore continue print hostname

% python2.7 readers.py | sort | uniq -c | sort -n | grep -v amazon | grep -v google.com

176 modemcable139.154-178-173.mc.videotron.ca 184 62.197.198.100 187 v041222.dynamic.ppp.asahi-net.or.jp 195 modemcable147.252-178-173.mc.videotron.ca 200 94-226-195-151.access.telenet.be 202 217.28.199.236 223 123.124.21.91 223 65.52.56.128 241 5a-m02-d1.data-hotel.net 263 117.218.210-67.q9.net 263 173-11-122-218-sfba.hfc.comcastbusiness.net 274 71-222-225-175.albq.qwest.net 332 modemcable069.85-178-173.mc.videotron.ca 335 no-dns-yet.convergencegroup.co.uk 337 ip-81-210-146-57.unitymediagroup.de 338 embln.embl.de 353 cpe-72-183-122-94.austin.res.rr.com 365 90-227-178-245-no128.tbcn.telia.com 370 138.194.48.143 408 210.96-246-81.adsl-static.isp.belgacom.be 428 k8024-02l.mc.chalmers.se 490 cpe-70-115-243-212.satx.res.rr.com 493 www26006u.sakura.ne.jp 527 219.239.34.54 534 44.186.34.193.bridgep.com 535 5a-m02-d2.data-hotel.net 586 5a-m02-c6.data-hotel.net 666 168-103-109-30.albq.qwest.net 676 y236106.dynamic.ppp.asahi-net.or.jp 698 82-169-211-97.ip.telfort.nl 759 w-192.cust-7150.ip.static.uno.uk.net 1071 adsl-75-23-68-58.dsl.peoril.sbcglobal.net 1217 hekate.eva.mpg.de 1223 211.103.236.94 1307 li147-78.members.linode.com 1342 static24-72-40-170.r.rev.accesscomm.ca 1346 dinsdale.python.org 1398 145.253.161.126 1771 pat1.orbitz.net 2060 artima.com 3164 148.188.1.60 3767 90-224-169-87-no128.tbcn.telia.com 4518 jervis.textdrive.com 5791 cpe-70-114-252-25.austin.res.rr.com 6171 ip21.biogen.com 8264 it18689.research.novo.dk 10919 61.135.216.104

A quick look at the output shows a lot of requests from Amazon and Google, so I removed those, and report the results using:Since I have 169 days of log file, I'll say that "avid readers" poll the URL at least once per day. That gives me:I know who comes from one of the Max Planck Institute machines, and a big "hello!" to the readers from Novo Nordisk and Biogen - thanks for subscribing to my blog! "dinsdale.python.org" is the Planet Python aggregator, and artima.com is the Artima Developer Community ; another aggregator.

I know more than 60 people read my blog posts within 12 hours of when they are posted, so this tells me that most people read blogs through a web-based aggregator (like Planet Python or Google Reader), and not through a program running on their desktop. I'm glad to know I'm not alone in doing that!

Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me