There’s been no shortage of pain for both the ABS, and Australians trying to fill out the census. This article is going to attempt to look at what and why this outage may have occurred.

Was the Census ‘hacked’?

No. The ABS has claimed the Census was the target for a Denial of Service (DOS) attack. This is an attempt to flood the available servers, with fake requests, to the extent that legitimate users cannot connect to the server.

Was the Census attacked by foreigners?

It’s extremely unlikely. Currently, the census site is not accessible from outside Australia. This may be an attempt to fix the issues the site has been having. However, security experts have not seen an influx of traffic, which you normally associate with a foreign-originating distributed denial of service (DDOS) attack.

Could the census site cope with the load?

There appears to be a few questionable decisions in the design of the Census site. First, it appears as if the census site isn’t using a content delivery network (CDN). Whether queried from Sydney or Melbourne, the census site appears to be served from Melbourne. It looks like IBM have rented some servers from “Nextgen Networks”, and hosted the site there.

If a content delivery network were in place, then requests from Sydney would be served by a host of smaller cache servers closer to the user. Failure to use a CDN, means that it becomes harder and harder to scale a web site to server many users at once. Further, it actually means the servers work harder, the slower a user’s internet is, and the further it is from the servers. This could potentially explain why a lesser number of connections from outside Australia might initially look like a DOS attack.

CDNs are cheap, effective, and in use by government already. They are a commoditised solution with many providers such as Akamai, Amazon Web Services and Section.io providing cheap, capable services. It’s hard to imagine why one wouldn’t be in use, especially when you consider the number of simultaneous connections they should be expecting, which brings me to the next question.

It passed ‘load testing’, so why would it fail if it wasn’t attacked?

The simple answer is that the ABS tested double the estimated average load, rather than double the estimated peak load. If your testing is predicated on census users gradually filling it out during the day in an orderly fashion, rather than filling it out directly after work or after their meal, then it’s obviously going to be making the wrong assumptions. Most people work 9 – 5, and will be filling out the census between 18:00 and 22:00, at a guess. That means that the average over that period, could exceed 2 million hits per hour, with an actual peak more like 4.

So, it cost $9.6M and didn’t do the job?

No! It cost more like $27M, and didn’t do the job. The $9.6M is just one of the sections IBM’s billing.

This table was posted to Linkedin by Matt Barrie, CEO of Freelancer.com:

Yes and no. If you are running a finite number of servers (behind and in front of load balancers, as panic set in over the course of last night) then it doesn’t make sense to buy/rent servers in one geographic location to serve a whole country. When you consider hundreds of requests required to render one page, and multiply that by millions of page views, suddenly the round-trip time, time spent fulfilling static requests, and overhead in load-balancing/initiating and terminating connections become massive, massive problems. As they exceed the ratings/tolerances/memory/bandwidth available, the problems start to cascade. Why wouldn’t you leverage a content delivery network, that already has infrastructure in place to route huge percentages of the internets traffic seamlessly, rather than roll your own solution. Not only do you need to buy redundancy and overhead that you will never use again after the census (increase cost), but you also have to hope that your staff in architecture, implementation, and operations are up to the task. For hundreds of thousands of dollars, companies like the above-mentioned CDN’s which run the world’s news sites, sports sites, and top retail sites networks, would have made sure the site stayed up, or that the failures were better managed.

EDIT 19:00:

The ABC has an officially released timeline. It doesn’t detract from anything I’ve so far written. A CDN is built to be able to turn off geographic regions, without failing. That’s just a fundamental requirement. IBM has likely spent a lot more money, replicating a failed version of this feature.

The government has stepped away from saying this is a DOS attack. Are they saying it’s accidental? The correct term for DOS that isn’t an attack, is heavy load. The kind you get when you project 500,000 hits an hour for the census…