Helsinki cathedral entrance

Over the last few weeks my photo collection application has been struggling seriously with its performance. The Tomcat server would sometimes crash, which isn't so serious, as my monitoring script would restart it at most 30 minutes later. What's worse is that often it would get stuck and also make Apache freeze, and this would kill the entire site (including this blog), and the monitor script doesn't detect that. Or, load on the server would soar into the double digits, and just stay there, basically making the server unusable until I did a manual restart.

The robots

So, of course I started digging into the problem to see what might be causing this. And pretty quickly I found that Yahoo's Slurp robot was hammering tmphoto with more requests than it was able to handle. At the same time msnbot and Googlebot were also pretty active, though not as bad as Slurp.

My stats made it abundantly clear that most of the traffic in the application was from robots. I checked the logs quickly now, and I find that out of 2.3 million requests, 93.8% were from robots. Of the total, 23% were from Slurp. However, Slurp tends to bunch its requests together, so at times of peak Slurp traffic a much higher proportion of the traffic would be Slurp.

I figured the easiest way to solve this would be to make the robots go a little easier on the site, and added "crawl-delay" statements for Slurp and msnbot to my robots.txt file, telling the robots to wait 45 seconds between requests. I could see both robots picking up the new robots.txt, and while they might have slowed down a bit over the days that followed, they didn't really slow down much.

So I wrote to the Yahoo team complaining about their robot, and asking them to do something. Contrary to expectation, they wrote back the same day, asking me to set the "crawl-delay" to slow down the robot. Wonderful. So no help there. From what I read on the web other people are finding much the same.

A closer look at my access log also revealed gems like this one:

74.6.22.172 - - [22/Jun/2008:04:41:38 -0400] "GET /tmphoto/photo.jsp?id=t61298 HTTP/1.0" 200 4314 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)" 67.195.37.177 - - [22/Jun/2008:04:42:34 -0400] "GET /tmphoto/photo.jsp?id=t88323 HTTP/1.0" 200 4324 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; http://help.yahoo.com/help/us/ysearch/slurp)" 74.6.22.172 - - [22/Jun/2008:04:41:40 -0400] "GET /tmphoto/photo.jsp?id=t61298 HTTP/1.0" 200 4314 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)" 74.6.22.172 - - [22/Jun/2008:04:41:36 -0400] "GET /tmphoto/photo.jsp?id=t61298 HTTP/1.0" 200 4314 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"

What you're seeing here is four consecutive lines lifted straight out of the log. Ignore the second line, as that's not what we're interested in. The other three show Apache receiving three requests for the same page from the same IP within a span of four seconds. I think what's happening is that when Slurp doesn't get a response quickly enough, it tries again.

This is not exactly sympathetic behaviour. Imagine, for a moment, what might make it take a while to get a response from a server. Well, it might be load on the server, right? So what happens when you ask again? Well, you increase the load further, don't you? So basically, when my server first started struggling, Slurp would hammer it even more, thus making it even more overloaded. Reasonable behaviour would be to make a note that this URL is not responding quickly right now, and put it back in the queue to be tried again later, rather than trying again right away. This is much more efficient for the robot (it's likely to be slow on the next try as well, so the robot is likely to get through more pages if it changes to a different URL), and much friendlier to other people's web sites.

URL design

Passing boat, Løvøysund, Norway

There was one thing about this application that had been worrying me for a long time, which was the way I'd implemented filtering and context-tracking. Initially, people had found it confusing that if, for example, you went to the list of beer photos and clicked on a photo, the next/previous buttons on the photo page would not necessarily take you to a beer photo. Instead, they would just go to the next/previous photo chronologically, regardless of whether that was also a beer photo.

This was because I'd done a naive implementation where every page was completely stateless, and so the next/previous buttons would not know where you came from, and would always behave the same. To get around this, I made some pages pass in an extra request parameter saying where you'd come from, if you came from a person, a place, or a category. This way the previous/next buttons would move within the list you'd come from.

Similarly, in lists like the list of beer photos you can use the filters on the right-hand side to select only beer photos from a specific place, event, or with a specific person in them, etc. The ID of the filtering topic is also passed as a request parameter back to the same page, telling it which topic to filter by.

This all worked fine, but unfortunately it increased the number of URLs in my web application quite a lot. Without this feature there is basically one URL per parameterless page (such as the start page, lists of places/events/etc) and one URL per topic (that is, one per event, person, etc). With 82 categories, 157 events, 324 places, 195 persons, and 8521 photos this comes to 9279 URLs plus the static pages. Not so bad.

But the list stepping parameters and the filtering parameters changes this, because now there is one URL per combination. Photos can be filtered by categories, events, places, and persons, so we get 8521 * (82 + 157 + 324 + 195 + 1) = 6,467,439 URLs there. Lists of photos by categories, events, places, and persons can be filtered by the same topics, so we get roughly (82 + 157 + 324 + 195)2 = 574,564 URLs there. That gives us a rough estimate of 7 million URLs. And while normal people will never try out more than a tiny fraction of those combinations there's nothing to tell the robots not to do it, and so of course they'll just keep crawling and crawling the site endlessly.

So I sat down to do a quick rewrite and within an hour or two I had changed the code to store the filters in the user's server-side session and not pass them as request parameters. This also had the nice side-effect of simplifying the code quite a lot, as I no longer had to remember to keep propagating these request parameters in every link. (There were actually a few bugs here, which are now gone.) The downside is that most likely people will find the filters on in places where they don't expect it (because of caching, or because they don't pass through a page which removes the filters from their session), but as few people use this anyway, and it's just one click to turn them off, that shouldn't be much of an issue.

Of course, I also had to add code to handle the old URLs, which I did by adding Java code at the very top of the pages to check for filtering request parameters. If any are found, the code updates the session, then sends a permanent forward to the URL without filtering. Thus, old URLs will continue to work, and the search engines are told not to keep these URLs in their indexes, but instead to replace them with the filterless URL.

The redirect code is the first that executes on the page, so for these requests the server hardly does any work, and barely produces any output at all, thus dramatically reducing the server load from these requests.

The Four Courts, Dublin, Ireland

So, I deployed, and waited to see the results. After some hours it became clear that while the site would no longer go into a death-spiral immediately after being started, it was still struggling, and would still wind up barely crawling along once every two hours or so, and then never come back to health without a restart. This was disappointing, to put it mildly.

It was at this point I started doing what I should have done immediately: do some more precise analysis to see what the impact of the filtering URLs actually was. Some quick log analysis revealed that, actually, pages with filtering URLs was only about 55% of the traffic. So I'd reduced the load by about half, but no more, and apparently this was not enough.

From this I learned two things:

Consider the design of your URLs with search engines in mind, and avoid designs that cause an explosion of possible web-visible URLs. Don't make optimizations without estimating the effect of your optimization first! You often find that the optimization has much less effect than you thought, making it a waste of effort (and, often, a needless source of bugs).

In this case, the optimization had other benefits, so I won't be going back to the old design, but my site was still near-unusable. So then what? Well, this blog entry is now more than long enough as it is, and the solution I eventually found takes us into very different territory, so I'll save that story for the next entry.

Update: next entry now published.