Monday, December 23, 2013 at 8:54AM

There's a deep similarity between how long running systems like our brains and computers accumulate errors and repair themselves.

Reboot it. Isn’t that the common treatment for most computer ailments? And you may have noticed now that your iPhone supports background processing it reboots a lot more often? Your DVR, phone, computer, router, car, and an untold number of long running computer systems all suffer from a nasty problem: over time they accumulate flaws and die or go crazy.

Now think about your brain. It’s a long running program running on very complex and error prone hardware. How does your brain keep itself sane over time? The answer may be found in something we spend a third of our lives doing. Sleep.

There’s new research out on how our brains are cleansed during sleep that has some interesting parallels to how we keep long running hardware-software systems up and running properly. This is a fun topic. Let’s explore it a little more.

One of the most frustrating system tests for a computer system is to just let it sit idle day after day and check that it doesn’t reboot, leak memory, or fail in some surprising never-thought-that-would-happen sort of way. Systems are never really idle on the inside. Stuff is always happening. Interrupts are being served, timers are firing, all kinds of metrics are being collected and sent, connections are being kept up, protocols are being serviced. Just like when you sleep, an apparently idle computer can be quite active.

Flaws in more complex hardware-software systems are of many kinds: memory leaks, garbage accumulation, out of memory, CPU starvation, data structures like lists get larger and larger over time which causes memory problems and increased latency as testing probably never tested these scenarios, improper freeing of objects, stack corruption, deadlock problems, pointer corruption, memory fragmentation, priority inversion, timer delays, timers and protocols inside a box and across nodes tends to synchronize which means you can blow your latency budgets, cascading failures as work stalls across a system and timeouts kick in, missing hardware interrupts, counter overflows, calendar problems like not dealing with day light savings correctly, upgrades that eat up memory or leave a system in a strange state it has never experience before, buggy upgrades that upgrade data structures improperly, improper or new configuration, hardware errors, errors from remote services, unknown and unhandled errors, applications that go into out of different states incorrectly, logs that aren’t truncated properly, the exercise of rarely used code paths in hard to test states that end in potentially disastrous ways.

Just the general accumulation of strange little inconsistency errors over time can make it look like your system has dementia. This is why you want to do a cold reboot. Power off, let all the hardware go dark, leave nothing in memory, reload the OS, reload applications, and just start over. Rebooting is like grace for computers.

How our biology works is necessarily different. Maiken Nedergaard, who led the study Sleep Drives Metabolite Clearance from the Adult Brain, thinks he may have discovered the long sought after secret to why sleep is crucial for all living organisms:

We sleep to clean our brains. Through a series of experiments on mice, the researchers showed that during sleep, cerebral spinal fluid is pumped around the brain, and flushes out waste products like a biological dishwasher. The process helps to remove the molecular detritus that brain cells churn out as part of their natural activity, along with toxic proteins that can lead to dementia when they build up in the brain.

Without access to spinal fluid or nicely chunked pieces of garbage like molecules, here are some common tactics for removing software detritus from a system:

Simplicity . Create a system so simple none of these issues apply. This isn’t a good option for a human brain or usually for software. Modern systems are complex. They just are.

Random Reboot . A venerable option. The system is down only for the period it takes to restart and it fixes a myriad of different problems. Some systems periodically do a reboot on purpose just to get a fresh start. It also can remove any active security breaches. In a cloud architecture this is a good design option as you have the resources available to take over for the rebooting node.

Primary-secondary Failover . In high availability designs it’s common to have an active primary and a passive secondary to take over just in case the primary fails. The secondary is usually in some sort of simplified state so that it doesn’t suffer from all the degrading influences experience by the primary. One strategy is to have the primary periodically failover to the secondary so that the system reinitializes, removing many of the flaws that have accumulated over time. Service is offline during this period and the period is proportional to the amount of code, state, and computation that must happen to transition from secondary to primary state. Which isn't to say the system is completely down for all functionality. There's almost always some low level protection and consistency code running regardless of what state the system is in. The period can be minimized by keeping the secondary closer to live state, but this means the secondary will probably suffer the same flaws as the primary, so is counter productive. As failover is a highly error prone process this strategy doesn’t work so well in practice.

Background consistency agents . Run software in the background that looks for flaws and then repairs them. This can lead to local inconsistencies of service, but over time the system cleans itself to the degree it can. When things get complex at scale errors are just a fact of life. Might as well embrace and use that fact in your designs to make them more robust.

Partial shutdown . Shutdown certain lower priority services when metrics show higher priority services are degraded. This hopefully frees up resources, but freeing up resources is always a buggy as hell process.

Whole Cluster failover . The cloud has made an entirely new form of sanity possible: a whole cluster failover. Often used for software upgrades, it will also work as a form of cleaning out software detritus. If you have a 100 node cluster, for example, spin up another hundreds nodes and then move processing over to the newly constructed cluster. Any faults accumulated on the old hardware will magically disappear.

Hardening . Put hardware on a replacement schedule to reduce its probability of failure. Upgrade software in a bug patch only mode to make the software more robust to problems found in the field. This is an expensive approach, but it hardens and bullet proofs a system over time. At least until requirements change and the whole thing is blown away. This is an area where biologics have an advantage in that their mandate never changes (in their ecological niche) while human systems are changing all the time, often at deep levels. So for a biologic techniques like cell repair and immune system responses are quite powerful, but they aren't enough for human systems.

Planned Downtime . It's not uncommon for high availability systems to have maintenance windows for upgrades and other sensitive operations. If a failure in a maintenance window occurs you have a get out of jail free card, it doesn't count, you were on a break. The idea is that it's pretty much impossible to change a running system and provide service to SLA contractual obligations while providing failover capabilities while making sure state is in a known and consistent system. For survival an organism can never really be in a maintenance window or you may die. Sleep is sort of maintenance window, but it's one you can exit quickly and if it's interrupted you may lose data in the form of memories and learning, but that's OK. Some web systems enter a read-only mode during upgrades and operate completely from cache. This doesn't seem to be an option for biologics as the hardware is really the memory and learning mechanism, so it wouldn't be easy to have two orthogonal systems for accomplishing the same tasks.

. It's not uncommon for high availability systems to have maintenance windows for upgrades and other sensitive operations. If a failure in a maintenance window occurs you have a get out of jail free card, it doesn't count, you were on a break. The idea is that it's pretty much impossible to change a running system and provide service to SLA contractual obligations while providing failover capabilities while making sure state is in a known and consistent system. For survival an organism can never really be in a maintenance window or you may die. Sleep is sort of maintenance window, but it's one you can exit quickly and if it's interrupted you may lose data in the form of memories and learning, but that's OK. Some web systems enter a read-only mode during upgrades and operate completely from cache. This doesn't seem to be an option for biologics as the hardware is really the memory and learning mechanism, so it wouldn't be easy to have two orthogonal systems for accomplishing the same tasks.

Defragmentation . Variable sized allocations drawn from fixed sized pools invariably fragment over time. This makes access slower as more lookups are required to find something and it makes it harder to store new data as chunks of storage need to be reclaimed and merged together to get enough room to store new items. This same failure mode happens for RAM and disk. If you profile a long running system it can be stunning to realize memory management can be eating up 50% of your CPU. It's a silent killer. So computer systems have a defragmentation process which compacts all the storage to a more efficient state. Defragmentation is difficult to do continuously so a system often is paused while it occurs, as in JVM garbage collection pauses or an inability to use the disk while running a disk defrag program. Long running embedded systems will often statically allocate memory on startup and never free memory, which removes fragmentation problems, but does create others related to fixed queue and buffer sizes that don't adapt well to change.

. Variable sized allocations drawn from fixed sized pools invariably fragment over time. This makes access slower as more lookups are required to find something and it makes it harder to store new data as chunks of storage need to be reclaimed and merged together to get enough room to store new items. This same failure mode happens for RAM and disk. If you profile a long running system it can be stunning to realize memory management can be eating up 50% of your CPU. It's a silent killer. So computer systems have a defragmentation process which compacts all the storage to a more efficient state. Defragmentation is difficult to do continuously so a system often is paused while it occurs, as in JVM garbage collection pauses or an inability to use the disk while running a disk defrag program. Long running embedded systems will often statically allocate memory on startup and never free memory, which removes fragmentation problems, but does create others related to fixed queue and buffer sizes that don't adapt well to change.

Graceful degradation . Don't fail the whole because a part dies. For both disks and memory when a bad block is detected the block is removed from the available resource list. This degrades overall capacity, but it prevents a failure from corrupting data and bringing the whole system down. This same idea is used for whole servers when it has been detected that a server has gone rogue so other servers shouldn't include it n any reindeer games. On a web page if individual services fail then the application degrades gracefully by substituting an available service. Netflix, for example, if it can't load your personalized recommendations because that service is down, will use a more robust but less useful set of recommendations until the better service is back up. With a little thought you can hide a lot of failures.

. Don't fail the whole because a part dies. For both disks and memory when a bad block is detected the block is removed from the available resource list. This degrades overall capacity, but it prevents a failure from corrupting data and bringing the whole system down. This same idea is used for whole servers when it has been detected that a server has gone rogue so other servers shouldn't include it n any reindeer games. On a web page if individual services fail then the application degrades gracefully by substituting an available service. Netflix, for example, if it can't load your personalized recommendations because that service is down, will use a more robust but less useful set of recommendations until the better service is back up. With a little thought you can hide a lot of failures.

Flaw removal . Of course you should monitor and test your software with an eye towards fixing any and all problems. This is a good and worthy goal. It’s also impossible. The state space for any moderately complex program running in production is astronomical. Especially under the rate of change experience by many online properties. So what you need to do is prioritize the kind of bugs you need to fix and fill in the background with strategies that make your system more robust in the face of inevitable failure.

When you think about it Mother Nature has had a tough design task that in a biological system could only be solved by something as potentially anti-survival as sleep. But the sleep period has been a great hook in which to insert other services like memory consolidation, something that would be difficult to do in an always active system.

In many ways we can now design systems that are more robust than Mother Nature, though certainly not in the same space for the same power budget. Not yet at least.