« Bang, Bucks, and Delivery in Recompense | Main | The Fermi Paradox revisited; random despatches from the front line »

Why your internet experience is slow

Here is a random-ish URL from Salon.com, a not too unusual online magazine: http://www.salon.com/tech/col/smith/2008/05/09/askthepilot276/.

This HTML page contains the first chunk of a piece of journalism by Patrick Smith; the actual body copy runs to approximately 950 words of text. The average word in English is 5.5 characters long; add 1 character for punctuation or whitespace and we would reasonably expect this file to be on the close order of 6.5Kb in size.

(Patrick, if you're reading this, I am not picking on you; I just decided to do some digging when I got annoyed by how long my browser was taking to load your words.)

In actual fact, the web page my browser was downloading turned out to be 68.4Kb in size. The bulk of the extra content consists of HTML tags and links. It's difficult to say how much cruft there is — much of it is Javascript, and I used a non-Javascript web browser for some of this analysis — but a naive dump of the content reveals 128 URLs.

So, we now have an order of magnitude bloat, courtesy of the salon.com content management system adding in links and other cruft. But that's just the text, and as we all know, no web page is complete without an animated GIF image. So how big is this article, really?

I stared at it for some time while it loaded over a 10mbps cable modem connection. Then I switched off my browser anti-advertising plugins (AbBlock and NoScript), hit "reload", and then saved the web page. Inline in the page are: 4 JPEG images, 4 Shockwave FLASH animations, 4 PNG images, 8 GIF images (of which no less than five are single-pixel web bugs), 4 HTML sub-documents, 6 CSS (style sheet) files, 22 separate Javascript files ... and a bunch of other crap.

The grand total of extras comes to 860Kb by dry weight, meaning that in order to read 950 words by Patrick Smith my cable modem had to pull in 948Kb, of which 942Kb was in no way related to the stuff I wanted to actually read.

With AdBlock and Noscript switched back on, the cruft dropped off considerably, but not completely — the core HTML file squished down to 52Kb (after a bunch of Javascript extensions failed to load) and the hairball of advertising cruft dropped from 62 to 41 included files, for a grand total of 372Kb of crap (from 840Kb). Finally, I updated my /etc/hosts file to include this blacklist of advertising sites, redirecting all requests for objects hosted on them into the bit bucket: the final download came to 40Kb of HTML in the main file and 208Kb of unwanted crap.

Let me put this in perspective:

This is a novel in HTML, with three small image files (totaling about 10Kb). "Accelerando" runs to 145,000 words; it fits in about 400 pages, typeset as a book, using very small print. It is 949Kb in size, or about 10Kb larger than a Salon.com feature containing 950-odd words.

Here's another novel, available for download in HTML. "Down and Out in the Magic Kingdom" runs to 328Kb in HTML; it's about 180 pages in book form, and it's still 40Kb smaller than the hairball you get from Salon.com after you switch AdBlock and NoScript on.

If content is king, why is there so little of it on the web? And why are content providers like Salon always whining about their huge bandwidth costs, given that 99% of what they ship — and that is an exact measurement, not hyperbole — is spam?

(Note: these are rhetorical questions. Despite the burning certainty that someone on the internet is wrong, you don't need to try and explain how the advertising industry works to me. Really and truly. I'm just taking my sense of indignation for a Sunday walk.)

| Permalink