In October 2011 Helen Papadopoulos wrote about the Genome project - a mammoth effort to digitise an issue of the Radio Times from every week between 1923 and 2009 and make searchable programme listings available online. Helen expected there to be between 3 and 3.5 million programme entries. Since then the number has grown to 4,423,653 programmes from 4,469 issues. You can now browse and search all of them at http://genome.ch.bbc.co.uk/ Back in 2011 the process of digitising the scanned magazines was well advanced and our thoughts were turning to how to present the archive online. It's taken three years and a few prototypes to get us to our first public release.

The first edition of the Radio Times

Jake Berger and Hilary Bishop have written on the About the BBC blog about the Genesis of Genome but I know some of you share our fascination with the technicalities of projects such as this, so I'm going to give you some of the gritty behind-the-scenes details. The web site is hosted on Linux servers. We have database servers running MySQL and Sphinx Search and application servers which run Apache and a web application written in Perl using Dancer and Template Toolkit. In front of the application servers a layer of Varnish application cache servers helps to handle the load. Along the way we've used a lot of other Open Source software to index, catalogue and transform data, scale images and automate development processes. As is often the case with projects underaken at the BBC we couldn't have done it without the work of countless Open Source developers and, as ever, we are extremely grateful to them for making our work possible. The web application was designed from the outset to allow speedy browsing and searching of over four million records so we always had an eye on performance when making technical choices. Even so the sheer size of the data was more of a problem than we expected. Until we switched our development servers to using solid state disks (SSD) it took over a week just to load the database into MySQL. Even after the switch to SSDs a complete database load would take more than twenty-four hours. We quickly learnt that "we'll just load up another copy of the database for testing" was not an option. Once the data is loaded it's mostly unchanging, which helps a little with caching. However the data set is so large that we can't realistically cache all of it. You may notice that some pages of listings take longer to load than others. If that's the case (and assuming it's not too frustrating for you) you can congratulate yourself for finding a bit of the archive that nobody else has looked at recently - if they had it would be 'hot' in the cache and served to you quickly. One of the most common ways to use the archive will be the search box; we expect a lot of searching and, naturally, the search queries are unpredictable (although we guess that lot of people are going to search for "Doctor Who"). The unpredictability of the search terms means that we can't cache search results at all. We've done quite a bit of performance testing, but one of the reasons for the "Beta" badge that the site is currently wearing is that we couldn't really tell how well it would perform until it went live.

Radio Times covers through the years