Announcing the HTTP Archive

I’m proud to announce the release of the HTTP Archive. From the mission statement:

Successful societies and institutions recognize the need to record their history – this provides a way to review the past, find explanations for current behavior, and spot emerging trends. In 1996 Brewster Kahle realized the cultural significance of the Internet and the need to record its history. As a result he founded the Internet Archive which collects and permanently stores the Web’s digitized content.

In addition to the content of web pages, it’s important to record how this digitized content is constructed and served. The HTTP Archive provides this record. It is a permanent repository of web performance information such as size of pages, failed requests, and technologies utilized. This performance information allows us to see trends in how the Web is built and provides a common data set from which to conduct web performance research.

The HTTP Archive code is open source and the data is downloadable. Approximately 17,000 top websites are examined every two weeks. I started gathering the data in October 2010. The list of URLs is derived from various sources including Alexa, Fortune 500, and Quantcast. The system is built on the shoulders of WebPagetest which downloads each URL and gathers a HAR file, screenshots, video of the site loading, and other information. From this information the HTTP Archive extracts data, stores it in a database, aggregates the data, and provides various statistical analyses.

This chart is one of the many interesting stats from the HTTP Archive. It shows the number of bytes downloaded for the average web page broken out by content type.

In addition to interesting stats for an individual run, the HTTP Archive also has trending charts. For example, this chart shows the total transfer size of web pages has increased 88 kB (15%) over the last six month. (In order to compare apples-to-apples, since the list of top sites can change month to month, these comparisons are done using the intersection of ~1100 URLs analyzed in every run.)

The likely cause for this increase is size of images. The next chart shows that the total transfer size of images increased 61 kB (22%) over the same time period, even though the total number of image requests only increased by two (6%). Why did the size of images grow so much? This highlights one of the benefits of the HTTP Archive. Anyone who wants to answer that question can download the data and perform an analysis. Possible things to check include if the ratio of JPG images increased (since the average size of JPG images is 14 kB compared to 3 kB for GIF and 8 kB for PNG) and whether the increase occurs across most images in the page or a few larger images.

There are hours of slicing and dicing in store for performance engineers and anyone who loves data. About 20 charts are available today. More will come (add suggestions to the list of isues). The most important thing to remember is the data is being collected and archived, and will be accessible to new views that are added in the future.

Check out the FAQ for more information. I encourage you to join the HTTP Archive group in order to follow our progress and add to the discussion. Once there you’ll see the first email from Brewster Kahle who says “Fantastic! I have not seen such a wealth of information on the performance of the web recorded and so enriched.” I hope you too will find it worthwhile.