ROS1 single point of failure

A typical ROS1 topology

If you’ve used ROS in automation applications requiring multiple robotic agents or more than one compute node, you’ve likely experienced the limitations imposed by rosmaster’s single point of failure.

Only one computer can run the rosmaster, and in ROS 1 your only option is to choose wisely and hope luck is on your side. Any unexpected failures on the selected compute node can quickly bring your automation to a halt.

In an ideal world, you’ll never experience a node failure.

In ROS1 a single point of failure stops all new communications :(

Sadly, that’s just not the world we live in. Unexpected failure happens for many reasons:

Memory Exhaustion

Network Fault

Power Fault

Hardware Failure

Operating System crash

In ROS1, an uncommunicative roscore is a show stopper. Suddenly none of the working compute nodes can accomplish service discovery, they can rapidly be orphaned with no way to elect a leader and no clear process for what to do if the roscore does come back online. Parameter lookups immediately begin failing. No new topics or service connections can be established while the roscore is away.

Even if the compute node running roscore recovers, it may have lost all rosgraph data as it is only stored in memory. Rebooting the failing compute node will result in a network partition where the roscore forgets about the remaining compute nodes, even though they may actually still be in a working state.

ROS1 failure recovery? Reboot everything >.<

The only solution to this messy outcome, is to reboot all computers or restart the ROS software. Obviously downtime sucks in the cloud, but in mobile robots and assembly lines, poor failure recovery behaviors can be dangerous.

Flying a drone, driving down a street, or making hamburgers, are tasks that can be disastrous if stopped mid-process!

Vapor makes ROS1 resilient

When failure strikes in a Vapor enabled solution, the failure only affects the failing computer and its direct peers. The other nodes are able to continue using their local Vapor instances to accomplish service discovery. Importantly, no parameter reads or writes are lost and any new or existing topics and services can continue operating and be recontacted when the failing node is recovered.

Winning!

With Vapor, failure is localized and contained

Since all instances of Vapor can be synchronized via a mongodb replica set it is possible for just the failed compute node to rebooted and recover from failure rapidly.

Vapor recovers from failure quickly

Once the replacement is online it can rejoin the mongodb replica set and will automatically sync rosgraph changes that occurred while it was away.

Getting started

Vapor is Open Source built by ROSHub

The ROSHub team is hard at work on cloud platforms that help ROS developers scale faster and accomplish more.

Need help scaling your ROS solutions?