Thu, Jun. 12th, 2008, 02:12 pm

mod_failgracefully Failing gracefully is a far more difficult problem than people intuitively expect. Take the case of a web site which has insufficient bandwidth or CPU to handle its current load. Graceful failure in this case would be to let some fraction of users in, and politely tell the other users they have to wait because of the load. Web sites don't by default do anything even vaguely resembling this. Instead, page load times become very slow, and when things time out its on a per-page or even per-object basis, resulting in an acceptable site experience for no one. Even worse, when users get a bad object load in a page or a bad page load as a whole, they'll usually hit reload rather than leaving, causing the site load to be far greater than it would be were it serving the same number of users in a reasonable fashion.



It would be nice if someone could write a thing for Apache to do this automatically. When a site starts hitting bandwidth or CPU limits (preferably automatically detected by the site itself) it starts rejecting users and giving them cookies to keep repeated reloads from getting in. Ideally, it would even create a queue of users and let them know when they're up and give an estimated amount of time until they're let in.



Of course, it's always better to simply have enough bandwidth and CPU around. But once in a while someone finds themselves in a situation where they don't, and if it were possible to simply install a generic fail gracefully utility and leave that running as an interim solution until they can get a proper one in place, that would make the world a better place. Failing gracefully is a far more difficult problem than people intuitively expect. Take the case of a web site which has insufficient bandwidth or CPU to handle its current load. Graceful failure in this case would be to let some fraction of users in, and politely tell the other users they have to wait because of the load. Web sites don't by default do anything even vaguely resembling this. Instead, page load times become very slow, and when things time out its on a per-page or even per-object basis, resulting in an acceptable site experience for no one. Even worse, when users get a bad object load in a page or a bad page load as a whole, they'll usually hit reload rather than leaving, causing the site load to be far greater than it would be were it serving the same number of users in a reasonable fashion.It would be nice if someone could write a thing for Apache to do this automatically. When a site starts hitting bandwidth or CPU limits (preferably automatically detected by the site itself) it starts rejecting users and giving them cookies to keep repeated reloads from getting in. Ideally, it would even create a queue of users and let them know when they're up and give an estimated amount of time until they're let in.Of course, it's always better to simply have enough bandwidth and CPU around. But once in a while someone finds themselves in a situation where they don't, and if it were possible to simply install a generic fail gracefully utility and leave that running as an interim solution until they can get a proper one in place, that would make the world a better place. Thu, Jun. 12th, 2008 11:55 pm (UTC)

chouyu_31 I read a paper a few years ago about handling load by proxy; you have a service that accepts all incoming connections, and depending on the load of the machine, either says "sorry, go away right now" (it was written in the context of POP3 mailboxes, so actually pretended to be a POP3 server and said "sorry, no mail"), or it passes the file handle to the actual server (there are methods to do this on linux, bsd, and windows).



If one were to make the proxy smart enough, one wouldn't even need to use cookies (in the case of http), but then the question becomes how to pass the file handle to apache so that it can handle the request properly. Maybe through a mod? I don't know, I know very little about apache internals. I read a paper a few years ago about handling load by proxy; you have a service that accepts all incoming connections, and depending on the load of the machine, either says "sorry, go away right now" (it was written in the context of POP3 mailboxes, so actually pretended to be a POP3 server and said "sorry, no mail"), or it passes the file handle to the actual server (there are methods to do this on linux, bsd, and windows).If one were to make the proxy smart enough, one wouldn't even need to use cookies (in the case of http), but then the question becomes how to pass the file handle to apache so that it can handle the request properly. Maybe through a mod? I don't know, I know very little about apache internals. Fri, Jun. 13th, 2008 12:15 am (UTC)

root_fu Maybe what the world really needs is a distributed, network, architecture which allows clients connecting to a server to act as intermediaries.



This would allow a server under duress to shift bandwidth and CPU loads to the client side, allowing (in theory) better connectivity.



We could call it: bit torrent. With it we could rule the world. Ok, let's be 'real', maybe 30% of the internet bandwidth.



What do you say? Maybe what the world really needs is a distributed, network, architecture which allows clients connecting to a server to act as intermediaries.This would allow a server under duress to shift bandwidth and CPU loads to the client side, allowing (in theory) better connectivity.We could call it: bit torrent. With it we could rule the world. Ok, let's be 'real', maybe 30% of the internet bandwidth.What do you say? Fri, Jun. 13th, 2008 12:31 am (UTC)

bramcohen Web surfing has very stringent demands on latency, which peer to peer can't really meet. Peers also can't handle dynamic content at all. Web surfing has very stringent demands on latency, which peer to peer can't really meet. Peers also can't handle dynamic content at all. Fri, Jun. 13th, 2008 01:17 am (UTC)

mackys When I wrote the When I wrote the JWS , I grappled with this problem. The JWS accepts connections on the listen socket and immediately throws them into a queue. The connections in the queue are serviced by a pool of threads. The queue has a finite size (WORKERPOOL_TASKQ_MAXSIZE) and if it gets full, it begins rejecting enqueue attempts. The main accept() loop notices this and sends an "HTTP/1.0 500 Internal Server Error (is the server overloaded?)" to the rejected socket, and then closes it. Fri, Jun. 13th, 2008 01:22 am (UTC)

mackys WARNING: should not be used in a production environment.



(Yeah, most people will be smart enough to notice, but there are always a few idiots...) WARNING: The JWS is insecure and(Yeah, most people will be smart enough to notice, but there are always a few idiots...) Fri, Jun. 13th, 2008 01:44 am (UTC)

bramcohen That's about as good as you can do with normal web server behavior, but suffers from the people hitting reload problem, and doesn't detect when the web server doesn't have enough bandwidth, just when it doesn't have enough CPU. That's about as good as you can do with normal web server behavior, but suffers from the people hitting reload problem, and doesn't detect when the web server doesn't have enough bandwidth, just when it doesn't have enough CPU. Fri, Jun. 13th, 2008 02:06 am (UTC)

mackys doesn't detect when the web server doesn't have enough bandwidth



I wonder how you even do that... You can't just judge based on pure number of connections, since the computer could be on a 9600 baud slip line or could be gig-e connected to a router on a DS3. I can't think of anything in netstat that would give you useful info either.



I guess the machine could continually ping the default gateway and start refusing connections when the pings start coming back slow. Though that would give false readings if the WAN connection had gone down. The LAN would still be fast but the WAN wouldn't even be passing packets. (Though I suppose the problem of politely replying to incoming connections is kind of moot if people can't establish incoming connections...)



suffers from the people hitting reload problem



That one's tough. Even if you send them an error page saying "Overloaded, please wait five minutes" you know 90% of people are going to hit reload anyway. I guess you could keep a hash of recent IPs that have tried to connect, and reject connections from ones that try too much. I wonder how you even do that... You can't just judge based on pure number of connections, since the computer could be on a 9600 baud slip line or could be gig-e connected to a router on a DS3. I can't think of anything in netstat that would give you useful info either.I guess the machine could continually ping the default gateway and start refusing connections when the pings start coming back slow. Though that would give false readings if the WAN connection had gone down. The LAN would still be fast but the WAN wouldn't even be passing packets. (Though I suppose the problem of politely replying to incoming connections is kind of moot if people can't establish incoming connections...)That one's tough. Even if you send them an error page saying "Overloaded, please wait five minutes" you know 90% of people are going to hit reload anyway. I guess you could keep a hash of recent IPs that have tried to connect, and reject connections from ones that try too much. Fri, Jun. 13th, 2008 04:46 am (UTC)

bramcohen Maxing out upload is noticeable by every outgoing connection getting slow and having lots of packet loss. The information about this is on the machine, but it's in the TCP stack, not generally available to applications.



To stop people from hitting reload you need to message properly. Ideally you'd give them a cookie and have an actual queue for people to be let in and have the page auto-reload once in a while until it's their turn (if there isn't much on the page the amount of CPU and bandwidth in the reloads isn't all that much). You'd have to respond to a reload request by serving up a message saying 'hitting reload won't make this page load faster, in fact I just pushed you back in the queue a bit'. That would get the message across fairly quickly. Maxing out upload is noticeable by every outgoing connection getting slow and having lots of packet loss. The information about this is on the machine, but it's in the TCP stack, not generally available to applications.To stop people from hitting reload you need to message properly. Ideally you'd give them a cookie and have an actual queue for people to be let in and have the page auto-reload once in a while until it's their turn (if there isn't much on the page the amount of CPU and bandwidth in the reloads isn't all that much). You'd have to respond to a reload request by serving up a message saying 'hitting reload won't make this page load faster, in fact I just pushed you back in the queue a bit'. That would get the message across fairly quickly. Fri, Jun. 13th, 2008 01:30 am (UTC)

suppressingfire Indeed. This was the whole point of the classic "livelock" paper:



It's a shame that 13 year old research still hasn't made itself into mainstream production.



(also, just read somewhere that you're a yellow pig, too!) Indeed. This was the whole point of the classic "livelock" paper: http://citeseer.ist.psu.edu/326777.html It's a shame that 13 year old research still hasn't made itself into mainstream production.(also, just read somewhere that you're a yellow pig, too!) Fri, Jun. 13th, 2008 04:47 am (UTC)

bramcohen Yeah, I went to HCSSiM one summer. Yeah, I went to HCSSiM one summer. Fri, Jun. 13th, 2008 07:56 am (UTC)

slayemin I can't imagine anyone actually wanting to use this feature. Who would the audience be? If you're a small business, the last thing you want to do is to turn away potential customers. If you're a large business, public access to your website is very important. If you're a single geo-cities kinda webpage, its very unlikely that you'd ever be hammered by web traffic.

A better solution is to just use network load balancing with multiple webservers in high traffic situations. NLB balances traffic based on CPU and bandwidth loads. All you need is the extra hardware. Hardware is cheap. Lost time and business is not. I can't imagine anyone actually wanting to use this feature. Who would the audience be? If you're a small business, the last thing you want to do is to turn away potential customers. If you're a large business, public access to your website is very important. If you're a single geo-cities kinda webpage, its very unlikely that you'd ever be hammered by web traffic.A better solution is to just use network load balancing with multiple webservers in high traffic situations. NLB balances traffic based on CPU and bandwidth loads. All you need is the extra hardware. Hardware is cheap. Lost time and business is not. Fri, Jun. 13th, 2008 08:27 am (UTC)

krellan I agree that this feature wouldn't be desirable.



I've seen this in practice. I believe that some Microsoft web servers offer a feature to put up an empty "Server Too Busy" error webpage. I've seen this appear when trying to visit some sites. It's maddening, because the message appears so quickly, providing the site has plenty of bandwidth remaining, because it's able to still serve the error message. It seems that an arbitrary administrative limit was set too low.



If there is indeed congestion, I'd much rather be queued and wait, than simply rejected outright. As you said, turning away potential customers is never a good idea!

I agree that this feature wouldn't be desirable.I've seen this in practice. I believe that some Microsoft web servers offer a feature to put up an empty "Server Too Busy" error webpage. I've seen this appear when trying to visit some sites. It's maddening, because the message appears so quickly, providing the site has plenty of bandwidth remaining, because it's able to still serve the error message. It seems that an arbitrary administrative limit was set too low.If there is indeed congestion, I'd much rather be queued and wait, than simply rejected outright. As you said, turning away potential customers is never a good idea! Fri, Jun. 13th, 2008 09:27 am (UTC)

electrichamster "I've seen this appear when trying to visit some sites. It's maddening, because the message appears so quickly, providing the site has plenty of bandwidth remaining, because it's able to still serve the error message."



The problem is more likely to be CPU than bandwidth, so the error page being served quickly has little indication of the load that the servers are under.



We actually do something similar to this with a patch to the Perlbal load balancer. When the length of time that people are waiting in the serve queue exceeds 30s we serve an error page to any new connections. This happens incredibly rarely, generally only during hardware failure. It's enabled us to keep serving subscribers (who are prioritised) and a small number of users even when our capacity has been tiny - without this the site would have been inaccessible for nearly everyone. "I've seen this appear when trying to visit some sites. It's maddening, because the message appears so quickly, providing the site has plenty of bandwidth remaining, because it's able to still serve the error message."The problem is more likely to be CPU than bandwidth, so the error page being served quickly has little indication of the load that the servers are under.We actually do something similar to this with a patch to the Perlbal load balancer. When the length of time that people are waiting in the serve queue exceeds 30s we serve an error page to any new connections. This happens incredibly rarely, generally only during hardware failure. It's enabled us to keep serving subscribers (who are prioritised) and a small number of users even when our capacity has been tiny - without this the site would have been inaccessible for nearly everyone. Fri, Jun. 13th, 2008 04:36 pm (UTC)

therealdrhyde What if you're neither a small business, a large business, or a geocities user? There's plenty of people who run useful little sites on their own. Such as



Thankfully, some of my recent Apache config tweaks seem (touch wood) to have made that problem go away. We'll see what happens when you all rush in to take a look :-) What if you're neither a small business, a large business, or a geocities user? There's plenty of people who run useful little sites on their own. Such as this one of mine which fails Really Badly under heavy load. It fails so badly that I have a script on another box that, if it can't connect for a few minutes, remotely bounces the power.Thankfully, some of my recent Apache config tweaks seem (touch wood) to have made that problem go away. We'll see what happens when you all rush in to take a look :-) Fri, Jun. 13th, 2008 08:06 am (UTC)

ingulf I'm not sure I agree with this. I usually prefer a web page to load slowly than give up, because if it loads slowly I can just leave it in another tab to get on with it, but if it gives up, I have to come back and refresh until I get through.



More useful would be to return a we page which automatically reloaded some time in the future - the web server could adjust the time it gives to different users - booking its future load. Fri, Jun. 13th, 2008 09:05 am (UTC)

zanfur I do know that some network hardware will do this. In particular, for sites where there are both paying a free customers (like LJ, for instance, though I have no idea if they use this technology), some F5 equipment can give priority to the logged in users who are actually paying, and only letting the freeriders have whatever bandwidth is left. Configurable as you like. I think some Foundry equipment does it as well.



Be really nice if it was an Apache module, though... I do know that some network hardware will do this. In particular, for sites where there are both paying a free customers (like LJ, for instance, though I have no idea if they use this technology), some F5 equipment can give priority to the logged in users who are actually paying, and only letting the freeriders have whatever bandwidth is left. Configurable as you like. I think some Foundry equipment does it as well.Be really nice if it was an Apache module, though... Fri, Jun. 13th, 2008 08:43 pm (UTC)

teh_munchkin The solution that immediately comes to mind to me is to profile page load times and use a probabilistic drop routine to maintain reasonable response times, probably with weighting that leans heavily towards not dropping clients that have open sessions, or are loading referenced objects, etc. Could be self tuning to a degree. One could also do some tricks to serve lower quality versions of images when the server detects that it's running out of upload bandwidth (pretty easy to figure based on load times).