Transcript

Dadgar: Thanks for coming, I wanted to spend some time talking a little bit about research in the real world, or how we bring some of the academic CS research into play at what we do at HashiCorp.

By means of a super quick introduction, my name is Armon Dadgar. I'm one of the co-founders and CTO at HashiCorp. For those of you who maybe don't know HashiCorp, oftentimes people are probably more familiar with the tooling than with the company itself. We make a suite of different tools, everything from Vagrant, Packer, Terraform, Consul, Vault, Nomad, and the core focus for us is really cloud infrastructure. How do we deploy and manage applications at scale in the cloud, and cover a few different bases from how do we do provisioning with things like Terraform, how do we do security with Vault, how to do networking with Consul, how do we do scheduling and deployment with Nomad. Different pieces of the portfolio, but one logical focus, which is how do we deliver applications in the cloud and do it as code in a modern way.

Transitioning heavily, I think one of the interesting things about the origin of the company is that in many ways it's actually based in research. What I mean by that is the origin - Mitchell is the other co-founder, we actually met on a research project. At the time, we were both at University of Washington working on a small project called the Seattle Project, given that name, it's actually impossible to Google for it. Good luck ever finding anything called Seattle Project. The core idea behind it was, if you were familiar with something like Folding@home or SETI@home, this notion of the scientific compute cloud, that I could take a scientific application that's doing protein folding or scanning for extraterrestrial signals or analyzing some compound, whatever that scientific goal was, and fan it out to hundreds of thousands of machines. Then, each person on their own laptop was donating 10%, 20% of their personal cycles.

In some sense, where we started we were thinking about how do you take an application that you don't necessarily touch, trust, package it in some way, some sort of containerized thing, deploy it onto machines you don't really trust and execute it at massive global scale. Many of these challenges were closely related to the things HashiCorp now solves, but I think that gives us this exposure to how does research get done. How do you think about solving these problems through the lens of how academia would solve them? Solving many of the same problems HashiCorp does, but the way academics works is different than the way we tend to work in industry.

The Value of Research

I think when we summarize quickly and say, "Ok, what do we think the value of research is?" Why is research important? Why should we care about it, talk about it? I think it really boils down to a few of these key things. One is this notion of, can we discover what the state of the art is, rather than us trying to reinvent it on first principles? I think this is probably something as an industry we tend to be really quite bad at, this notion that no one has ever solved this problem before, so I'm just going to start with first principles and build a thing. Often, what we end up building is maybe a simplistic thing or a naive thing, because we haven't really spent time soaked in that domain to understand what are the fundamental tradeoffs, what's been done before, what are the improvements to that naive thing? I think you see this over and over again, is this notion of we reinvent what was invented in 1970 or earlier.

I think the other side of it is getting into this habit of when you're solving a problem, looking at the literature, it will point you to relevant adjacent work. You might think how the right way I should solve this is by thinking about this problem as a matrix and doing some matrix operation. When you spend some time with the literature, you might find that actually, if you turn this into a graph problem, it's just a slightly different representation, but that there's this whole world of graph algorithms that work on my problem and solve it much more efficiently than if I treated it like a matrix. I think, often, just going and doing that exploration will point you to these other ways of thinking that you might not have even thought about when you start going down one path.

The other side, and I think this makes you wince when you see this, is sometimes it's important to just understand what the physical limits are, what is the upper bound of what can be done. I think often you'll see database systems, for example, startups, they'll say, ''Hey, we'll give you these properties.'' Folks who are maybe more versed in the literature will tell you those properties are impossible. In fact, it's been proven by literature that you can't have your cake and eat it too, in many cases. I think sometimes it's useful to just understand what the fundamental tradeoffs are, and that will then shape, "Hey, for my problem domain, which trade-offs actually make sense?"

Then the last one, which maybe seems esoteric, is this notion of metrics for evaluation. If I'm building a system, what are the metrics that I want to judge it as being good or bad or stable or unstable? What are the things, the data that I care about, this exhaust coming off of it? It'll tell me, is this thing behaving well or not? Often, what you'll find in research is, as a group goes and builds a thing to solve a problem, there is this body of how do you evaluate, how do you understand the system? I think you get all of these values. I don't want to spend too much time on necessarily the value of research. I want to just set the basis of, here's what we think are the important pieces. There's obviously a lot more there, but I think these are good essential pieces.

Building Consul: a Story of (Service) Discovery

What I want to go through for a little bit is actually a story, the building of one of our large pieces of software Consul. I think, in many ways, it's a story of discovery through research, that got us to the end state, the end final design, and I want to walk you through what that journey was like. To set the context, if we go back to 2011 at the time, there are two trends that were just starting to gain traction. One was this notion of immutability. How do we move away from configuration management and evolving things in place? I think this was the height of config management with Chef and Puppet and Ansible, really this notion that I'm going to provision a set of VMs, and then those VMs are very long-lived. I'm just going to constantly upgrade them in place. I'll just repave on top of the old stuff and do configuration convergence runs.

I think there was this shifting away and saying, “You know what? When you do this at massive scale, you actually introduce a whole bunch of risk and complexity and it’s hard to understand the system when you get these partial failures.” When a config run doesn't fully succeed, or fully fail, you end up in these hybrid states that are hard to understand. On one hand, we saw this push towards immutability. On the other hand, this was when we started saying, "The lamp stack is actually very difficult to scale organizationally. It's hard to have dozens, hundreds of people working on one code base that's just monolithic, so let's blow that up and move into a microservice architecture."

The intersection of these two trends led to this challenge, which is, “Ok, how do you actually configure how these things talk to one another?” If I have my front ends that talk to my API's, my API’s that talk to my databases, but these things live at different IP addresses and they're coming and going, because we're nuking the old version and booting a new version in this immutable regime, how does my front end find the new API server after I've deleted the old one and launched a new one? You have this challenge of discovery.

Common Solutions

If we look at the common solutions, the way we thought about solving this, I'd say there are three modes that we saw. One was hard coding of IP’s. Whether it was the IP of a host, the IP of a virtual IP, a load balancer, we'd hard code and say, "My API that I talked to is always at this IP.” The challenge with that becomes, what if it's not at that IP? You deleted that VM and it moved or it failed or whatnot.

The other way of dealing with this was not have it be a hard-coded IP, but still have an IP that's configured through something like a config management tool, so Chef, Puppet, Ansible, maybe you run them every 30 minutes. This way, if you scale up and add a new API server, 30 minutes later, the next Chef run catches and updates the configs. Then the last more sophisticated large scale cloud-native shops, you'd see custom zookeeper based systems. Think Netflix and their architectures. They would talk about how all of this was dynamic and you'd use systems like Zookeeper to coordinate and actually manage the fact that these IP’s are dynamic and not static. This was where the world was around then.

This got us thinking and starting to imagine what would we want the solution to be like? A blank slate, not worried about constraints or implementation. What we had today was something like this. API layers are hard coding and saying that "This IP and port, here's where I go to talk to my database," where ideally, what we'd like is to really talk about and name, some sort of logical thing. What I want to get to is my database. I don't really care what IP of the database is at. Maybe the database moves around, maybe its IP changes, it's a detail. What you really want is something that's going to do that translation, from the name of the thing, which is logical to the physical thing, the IP, which is a detail you don't really care. That's what we wanted.

Exploring the Literature

At the time, we were fascinated with this idea of could it be done entirely peer-to-peer? Could you do it without a central system like Zookeeper, and just have all the nodes coordinate among themselves? Share and decentralize, solve this problem? We started by exploring this literature. We went down the path of saying, "Ok, if we look at distributed systems and particularly gossip-based peer-to-peer, what do you start to find?" I think what you start to find is it's a very broad literature. It's everything from very large-scale internet-based systems like BitTorrent, that's decentralized peer-to-peer. File sharing has a ton of work in the space towards things like distributed hash tables, maybe not quite as broadly used, to a much more small scale gossip systems. There's a huge range of different literature looking at it in different ways.

On one side, what you start to find is there are these different axes, and it's a spectrum of solutions. One spectrum is how centralized is the system. You find systems that are highly centralized, that depend on strong central servers, maybe a Zookeeper type system. I have a central server, Chubby, when you might put that in that category, to then on the extreme other side is total decentralization, totally peer-to-peer, no notion of central control. It's all self-organizing, and you have things then in the middle. Things in the middle think like BitTorrent, where you might have some notion of super peers, where among a pool of equal peers, some nodes get elected into doing extra work. There are these supernodes, like a BitTorrent tracker, maybe it might be in that category.

Another major access you end up seeing is how structured do we think about our distributed system as being? Is it totally flat, totally decentralized, every node talks to every other node? That's the right-hand side. You'd call that an unstructured gossip, where you might see words like epidemic broadcast, meaning you pick a node, pick three other nodes, broadcast to those three, those three pick another three, keep rebroadcasting. You'll see things like Mesh Networks in there. This notion of flatness randomization will come up a lot. The notion is that I have this large pool of nodes, they're talking to each other, but not in a very structured way. We're just randomly communicating.

On the extreme other side, you have highly structured, where you're saying, "I'm going to try to impose some sort of forced structure." Among nodes, we're going to build a spanning tree or we're going to build a ring or some sort of a more cohesive structure. This is how most networks operate. If you have switches and things like that, you plug a bunch of switches, then they're going to form a spanning tree in terms of how packets flow through the network, they'll create a structure. Then along the middle, you'll find algorithms and straddle it, some things that'll be more hybrid, some things that are adapted and maybe they can switch between different structures as necessary, so you see that there's a spectrum of literature as well.

Then the third spectrum you start to see is around visibility. If I have a very large scale system like BitTorrent, and I say I might have a million peers, it's not practical for every node to know every other node. If every node had to know when a new node joined a BitTorrent cluster, you'd just be spending all of your time broadcasting nodes joining and leaving. You don't really want full visibility; you want enough visibility to be able to do what you need to do.

On one side, you have these notions of limited visibility systems where you say, "How do I have just enough awareness that I can get what I need done?" Or in some cases I'm a privacy-preserving system like Tor, I want to have extremely limited information, I want to know one peer and nothing else. To the very other extreme, it’s full visibility systems. It's a small cluster, I operate all the nodes, it's inside my data center, let's say, and I want to know what are all of the members of my cluster. You get these different tradeoffs in terms of, how would you design a system on one side or the other side? Then again, you have middle systems, where you might know some subset, but not everybody and not a minimal set.

Imposing Constraints

I think what we start to say is, "Great, there's this huge body of research." One of the first things you have to do in almost any field is impose some constraints that matter to you. For the problem you're solving, your constraints are different. In our case, we said, "What we're really trying to do is solve this within a cloud data center environment." We're not solving the bit-torrent use case, where you have a million people and some of it's at home and some of it's disconnected and things like that. There are a few common constraints that we say, "Can we scope down our search of this literature if we said we only care about a few nodes?" where a few in this case is less than 5,000 call it, where in some of the literature you might be talking to millions of nodes.

The other piece is talking about low latency and high bandwidth. What's the operating environment that we're in? Why this is important is there are whole sets of literature around things like IoT devices where you might say it's very high latency, and very low bandwidth. You're connected to an edge network device and you're a little sensor on the back of a truck. You don't have a low latency connection to a data center, you have very little bandwidth. The design around that system is very different than a data center. I think that's part of it.

Then the last one, I think this ties it back into the industrial question is, is it simple to implement? I think this is actually a really interesting one, because this forces you to look at the research through a different lens, which is, there might be some research better than others in terms of, this does a better job handling this set of problems, but there's an enormous complexity of its implementation. You have to balance how much complexity do we want in implementation, because the more complex it is, the more inevitably you're going to screw it up and there's going to be bugs and sort pathologies, and it's hard to understand. I think this one is actually a very important constraint, which is how complex is this implementation?

The SWIM Approach

What this ultimately landed us on is a paper called "SWIM”. SWIM is short for Scalable Weakly-consistent Infectious-style Membership Protocol. If we broke that down, the notion of scalability for this paper is not that it will scale to millions, it's that the amount of work done per node is linear. If I have a 1,000-node cluster, each node is doing the same amount of work as when I was a 10-node cluster, so it scales very well. You don't have any nodes doing some exponential amount of work.

Weakly consistent means there's no central point of consensus. There are no nodes that are strongly super peers that manage the rest of the cluster. The nodes all have some view of what the cluster looks like, and it's weakly consistent. It will converge over time, but at any given moment, every node may not agree. Infectious style, they broadcast in this epidemic broadcast way of pick a node, pick three other random nodes, send it to them and fan it out step-by-step. Then the membership protocol just being what it's solving for, which is who all is in this cluster.

I think when we think back to those different axes, the things that were important to us, were picking and choosing where we want it to sit on that spectrum. On each of those spectrums, you had a choice to make. If you change your choice, you would've ended up with a different paper or different implementation. What we wanted was something that was completely decentralized. We didn't want any super-peer or a central server because we wanted this to go back to our original view of these are just a bunch of random services. They're not really talking to each other. Then there's no central server, so we don't want any super-peer. We wanted something that was unstructured with epidemic dissemination, because it makes it really simple to implement, and very robust to any failure. You don't have a supernode or some structure that if that structure gets disrupted, it's now a really costly process to recover.

Full visibility made sense because you're, "This is such a small system. We only have 5,000 nodes." There is no reason to optimize around every node can't hold a million nodes in memory. The problem is small enough that that's not an issue. The things that we wanted to trade here was we're willing to trade off and say we'll spend a little bit more bandwidth communicating and maintaining that structure, because it's randomized, because everyone knows about everyone. We're going to use more bandwidth as a result, but what we're going to gain is simplicity and fault tolerance. Because we have this practical consideration of this thing is going to run in a data center, you're not really that bandwidth-constraint if we're talking about using 10 kilobytes versus 5 kilobytes a minute. That level of bandwidth is just not important, simplicity and fault tolerance are.

Closely Considered

What were the other things we considered? I think what you end up finding is, as you're choosing those considerations, there are some papers that end up being the obvious peers nearby. If you said, "I do care about the structure. I'm really sensitive to bandwidth, and I really want an efficient dissemination." Well, you might have selected something like Plumtree. You might've said it's actually important to have this hybrid tree because it's actually a much more efficient in the normal mode of operation. The downside of a system like Plumtree was it's a much more complex implementation, so getting it right is hard, and it's less failure resistant. If you lose a few key nodes it has to flip into an expensive reconciliation rebuild process, during which the cluster is churning. It's a different set of tradeoffs.

T-man is a similar thing, adaptive, it was able to switch between, if I have a small cluster or maybe it's unstructured as I get larger, or maybe I go to a ring as I go even larger, maybe I switch into a tree structure. It lets the cluster adapt the underlying structure based on scale. You're, "That's great if I have this dynamic system that can go from five nodes all the way to a million nodes, I can dynamically reconfigure the scale," but the reality was you're inheriting an immense amount of complexity to implement this correctly, and how many users are going to do this? How many users are going to scale from 10 to a million and back? This was not really a problem that you see in practical use.

You might see that if it's a BitTorrent system. Maybe there's one seed that nobody really cares about. It's just you sharing a PDF with your buddies, great. There are 10 things versus something that's really popular and it gets to a million, and that sort of adaptivity is useful. Then you have systems like HyParView, where you really only want subsets of the nodes known. If I'm going to have 10 million people participate, you can't have everybody know all 10 million nodes. These were based on what trade-offs you're going to make. Ultimately, I think one of the biggest determinants for us was really this question of complexity. How difficult is it to implement this thing correctly? The size of cluster, not really a big deal. They expect an amount of traffic, also not really that high. All we're using this for is to know what nodes are in my cluster, what's a web server, what's a database. The amount of traffic is low.

Adaptations Used

That said, I think one of the things that ends up happening is as you do this survey of literature, you're not necessarily taking pieces in a whole. You might say, "I actually don't like the overall design of this system, because it's not ideal for my constraints, but they had this little nugget of a good idea." That little nugget can be lifted up and used with our implementation if we rejigger things a little bit. I think part of it is not necessarily rejecting systems outright, but really understanding, "What are their novel contributions, and can I pick and choose pieces of that?" I think that's really valuable.

Some of the stuff that we picked up that wasn't in SWIM, we took from the "Bimodal Multicast" paper, around this notion of active push-pull sync. Instead of just randomly sending gossip messages out and hoping for the best, you pick some interval on what you connect to other nodes and do a full state transfer and say, "Tell me everything about your worldview. I'll tell you everything about my world view, and we'll figure out where we disagree." It turns out that actually adds relatively low overhead, but makes the system much more stable under a large churn.

One thing that I think a lot of the structured implementations do is think about two different data paths. One data path is, "Am I in happy path? The system is stable, there are no failures, we're just trying to do things efficiently,” versus, "Oops, we've detected something is wrong, nodes have failed. Things are not working correctly. Can we flip over into a more expensive recovery mode?" I think making that distinction, you can look at systems like SWIM and say, actually there are opportunities to say, “If we're in this happy path, we can actually gossip much faster and more efficiently. We don't need to necessarily have a bunch of message redundancy if we're in a good place.” Then if we do detect, “We're starting to see message loss”, you can dynamically tune. I think you can look at that and say, ''You can't.'' It's not quite the same. Our system doesn't operate in two different modes, but you have a notion of dynamic tuning based on your state.

Then there are other pieces that we were able to pull in from related work, not directly a tradeoff, but an augmentation and improvement. You can look at something like Lamport clocks and say, "If we're going to send these broadcast messages, I'm going to pick a node and broadcast to them randomly”, messages are all going to arrive out of order, any given notice, just receiving random messages. There's no real ordering to this. It can be hard to say, “If I see a node that says this node joined” and someone else says, “This node left,” it's hard to sequence that and say, "Did the node leave and then join, or was this an old node that died, and then it rejoined?" The ordering is actually important here. What we found was you can actually pull in vector clocks and Lamport clocks to impose an ordering and some causal ordering and have some notion of this thing occurred before this other message. These make the system more resilient to message ordering problems.

Then ultimately we were able to pull in some other network, Vivaldi, which is this notion of, if you have nodes that are communicating over a network, what we can do is measure the latency between our communications, actually use that to start building a network coordinate system, almost a GPS for nodes on a network, where it's not going to assign you a GPS address. We can use this to estimate a distance - network distance, not geo distance. For any two given nodes, you can actually query and say, "How far apart are they on the network?" This starts to let you do interesting optimization like nearest neighbor routing. Instead of sending me to any database or any Memcached, send me to the nearest one, because now I have a notion of what nearest means without manually maintaining a bunch of, you're on this rack, this switch, this cage, etc., because that would get tedious.

Serf Product (serf.io)

Ultimately, this research work was really around how do we explore this and get to some version of an idealized product. You have to package up that research as an implementation, polish it, make it usable, put it behind an interface. For us, that became a product called Serf. Serf is still a product, the goal is really these three things. It gives you a decentralized membership so you can know who all is in the cluster. It gives you a failure detection, so a node dies without gracefully leaving, you can know, "The web server is dead, don't route traffic to it anymore." And orchestration; if I want to broadcast an event and say, "Deploy, get commit X, Y, Z," I can broadcast that as an event efficiently. You get all these properties from the underlying SWIM gossip.

This was the view of how we thought the world would work with SWIM. That you run a SWIM agent on all of the nodes, each of the nodes says tags, what type of a service it is. You might say A is a web server, B is a database, etc. They all start broadcasting this information, and this gossips through the cluster. Now, when the web server wants to talk to the database, it can look at this local membership list, it knows, because it's full membership, what are all of the databases in the cluster, what are all of the caches in the cluster. We can look at that and make a determination of, "What's the IP of the database? Go route traffic to it." This is how we thought this would be applied in the real world.

Serf in practice - there was good and bad. The good was, our goal was for immutable systems that have this microservice architecture. How do we move away from these convergence runs or complex ZooKeeper deployments? For that kind of stuff, it worked quite well. What's nice is you bake in this agent that's stateless, you boot an immutable image that joins the cluster, and it very quickly figures out who's where, what are the dynamic IPs. You don't need to have configuration management running in production. You can quickly just join a cluster and figure out what's running where and dynamically handle this.

The other aspect of it, what's nice is it's insanely fault-tolerant. You could have 90% of your fleet fail and the cluster is totally fine with that. Because there is no central point of management, it quickly detects what are the dead nodes, works its way around them and the cluster stabilizes. You can have incredible churn of nodes coming, going, dying, etc., and the system's tolerant of that. Now, the downsides of it, is where you get eventual consistency problems. I think this is a fascinating one, and we're going to have a panel about this after that I recommend. We often get asked this all the time, which is around Consul, which is why should you treat service discovery as an eventually consistent problem? You message things out. Eventually, people figure out what are the web servers, the databases, etc.

The problem was not that it doesn't technically work, the problem is human operators. Human operators have a very hard time dealing with the fact that, if I have 5,000 nodes and they disagree on what's currently a web server, what's currently a database, what's healthy, what's unhealthy. Then you start getting these weird pathologies, and the pathology makes sense from the computer's perspective. If you log in and really do the debugging, you're, “Yes, there's this period where this set of web servers didn't think the database was healthy, so they weren't routing to it, that's why that database didn't see any traffic.”

As a human operator, you're looking at a graph and you're seeing these random variations and traffic and you're, "Why is this happening?" You'll log into one node and that one node says, "Yes, the database is healthy." From that one node's perspective, you're extrapolating that, "Every node must think that node is healthy," but the reality is no. The system is eventually consistent, these views disagree. This behavior makes perfect sense if you say I have 5,000 nodes and they have different views of the cluster, but human operators don't think that way. We like to think about a single source of truth. This eventual consistency was actually not a technical problem, it was an operator problem. It was hard for users to wrap their head around.

The other challenges were there was no key-value configuration. If I want to do key-value configs, how big should my connection pool be, or should I turn feature flag fu on and off, etc., I might have a very large corpus of key-value. You have some deployments that have millions of config keys that they have. The challenge becomes gossip is a bad way to disseminate this. Once you start getting into high volume of data that you're trying to reconcile, it's not a good approach.

If you think about membership, it's relatively small. I have 5,000 nodes, they're tagged as either a web server, an API database, whatever it is, so the amount of metadata is small. Each node has a little bit of metadata. Once we get into key-value, you're not coupled to a node anymore. It's not metadata per node, it's just metadata that you have that exists that's unbounded, so gossip starts to become a bad dissemination. I might have a million keys, but only two nodes need these set of five, and five nodes needed these set of 10 keys. Every piece of config is not relevant to every node, but when you're disseminating over a broadcast, every node has a full copy of the database. There's this very expensive distribution, like the blockchain problem.

The other part of it was, because there are no central servers, there's no central API. When you're trying to build automation tooling around this, the challenge becomes, "What's the API I hit to query and say, 'what are the nodes in the cluster?' Or fire off an event?" In some sense, the nice answers are you can talk to any node. Every node is equivalent, every node is up here, so you can query any node or you can tell any node to broadcast a message, and it's the same. The challenge is as an automation, it becomes very hard, because there's no server. There's no “What is the thing I talked to?” it's just like picking a random node. Operators don't really like that, they like to know “This is the server I talk to that it exposes it a certain API and I can firewall it and restrict who talks to it.” Versus what you say, any node is the same, it becomes sort of awkward. So serf in practice, had some drawbacks, that people didn't feel necessarily that it was a perfect fit for the problem.

Central Servers Challenges

Going back to the drawing board, what people really wanted was something more like a classic client-server. They're, "Yes, we like aspects of the gossip. It makes it really easy to deal with immutable architectures, but if we had a central server, then we'd get a bunch of nice benefits. We'd have somewhere to define our key value, we'd have a central API to talk to. I can firewall this thing logically, and say only some people can hit these APIs. It gives me that point of centrality to talk about the cluster in a way that I don't have today, I'm, "Ok, great."

You start to think about what are the challenges of building an architecture that looks like this. One thing you're going to care about is if this is the lifeblood of how my systems find other things, how my web server finds this database, it better be highly available. If these things are down, then my whole data center goes down with it.

The other thing I care about is durability of state. I don't want this to be best effort and "Oops, a node restarted and now it's total data loss. I lost all your key-value config." You'd relatively care that if I'm writing things like key-value, that persists, it doesn't get lost. Something like relatively lower on the scale is what's a web server. If I restarted the node and you forgot what was a web server and 10 seconds later, it re-registered, that's more ok. I think when we talk about service discovery, you'll see systems like Eureka. Netflix's system is very stateless. If you restart Eureka, it's total state loss. That's ok because the client's checking in every 15 seconds and it's eventually consistent, and you're like, "The data is easily repairable." If you restarted your database and it had total data loss, you'd be, “That's a bit unfortunate.”

The other aspect of it is you want strong consistency. The reason is not because there's any technical need for it, for service discovery; you really don't. It's actually for your operators. Operators have a very hard time wrapping their head around the behavior of a large system when you get these pathologies that come from eventual consistency. Our requirement for this was really driven more around user experience than anything. I think that's an important industrial concern sometimes. From an academic perspective, you don't necessarily care if it's usable by people. It's like, is it novel? But from an industrial perspective, it has to ultimately be usable.

Paxos or How Hard Is It to Agree?

This brings you into the world of consensus. You're, “The thing that has that shape, we need it to be highly available, meaning better be able to do leader election and failover dynamically.” You want data replication, so there should be some consensus on what the data actually is. You ultimately end up in the world of consensus. The very bottom, the keystone work on which everything is built is really Paxos. You go all the way back to the early work, it's “The Part-Time Parliament” by Leslie Lamport, which if you've ever tried to give it a read, is a relatively inscrutable document. What you find about Leslie Lamport is the man is a genius, and unfortunately, he writes for other geniuses, so it's not great if you're not a genius.

He decided, "People keep hemming and hawing about how complicated this is. Maybe I should just make this simple for them." He himself wrote a follow-up to it called ''Paxos Made Simple.'' I think it has my favorite abstract of all, this is literally the abstract of the paper. You can tell he was trying, he was making more of an effort, but if you read this paper, it's clear that it's still not that simple. In good practice of academic tradition, you're, “What are the related works, what are the extensions?” Where have things gone since? When you click into Paxos, you realize, dear God, where do we even begin? There are so many branches of this. You quickly get into Multi Paxos, Egalitarian Paxos, fast, cheap, better, stronger, whatever you want, every version of it is there.

The challenge then becomes, "How do I pick and choose which of these things is relevant?” In many cases they compose. You can get multi-egalitarian, and you're, "Oh, my God.” It's on you to figure out how these things start composing together. This research quickly spirals, there's a lot going on here. This to me always felt like the perfect culmination of trying to understand Paxos. By the time you have read the fifth paper, just trying to explain how the fourth paper was the simplification of the third paper, you're, "I'm feeling like I'm being told off here."

Thankfully, there finally was a Paxos made simple, several years ago now, maybe 2013, 2014. It was a paper called "Raft," which I would argue maybe could actually be called ''Paxos Made Simple,'' which is just a distillation of many of those other things. It takes ideas from the other variations of Paxos, but I think really packages it into something relatively straight forward and usable.

I think a lot of the Paxos implementations is a lot of exercise left to reader. It's, "Here's how you do this one little thing," but then exercise left to reader in terms of how you'd build this into a larger system, or how you'd handle changes on membership or whatever. It's a simple matter of engineering to do the extension. Until you try and do it and you're, “It's not so simple.” I think what Raft did well was really saying, “There are a million ways you could do this, but we're going to simplify it down and make some opinionated choices of this is how the system's going to work.” I think what's nice about that is, it eliminates degrees of freedom and allows the system to become much simpler in its overall design.

We ultimately landed on Raft as the desired implementation. Is it perfect? No. Are there advantages of some of those previous implementations? I think a lot of these ones you can get systems that are more high-performance, or able to really squeeze out availability under different kinds of failure conditions, by making different tradeoffs, the difference being that they're all very complex. If you're going to do that, you're opening up this Pandora's Box of complexity. It goes back to that Kim Hagan simple principle, which is that's great for an academic work. If you just needed to compile and barely get through the evaluation of the paper, fine. If it's a production-worthy system that you understand and can debug and you're going to be on the other end of a support call, hopefully, it's a little bit more stable and understandable.

Consul Product (consul.io)

Ultimately, this is what got packaged up into our Consul product. Consul, in some sense, was the result of us doing a V2 of Serf. We acknowledged that we liked the ideas behind Serf, it had some good things going on, but it missed the mark of user experience. It was too complicated for operators to wrap their head around, and eventual consistency and lack of centrality were a problem. We said, "Can we end up with this hybridized design?" which is sort of Consul. This is why it's awkward when people ask, "Is Consul a CP or an AP system?" You're, "It depends how you squint," because the core servers, the central servers, use Raft, they're strongly consistent. They run a consensus protocol, they do leader election, they replicate, they write data to disk, etc.

The center of the system is a CP, consensus-driven system, but it also still uses gossip. All the nodes still gossip with one another. The underlying Serf implementation is still a part of it. It's used as a library, so all the nodes are talking to one another, and what this gives us is still a lot of the nice properties. The system can be highly immutable. You can move the IPs of your servers and have thousands of nodes and not reconfigure them, because the gossip lets us deal with the fact that servers live at a dynamic address. We can have nodes that fail and we can detect it very quickly. We've got a one-sec sub-second failure detector, without the central servers doing all the work. You can have enormous clusters where the servers are idle. You get into this hybrid design where in our world, our view at least, was you got a bit of the best of both worlds. You've got the centralized API and the logically consistent state. It's easy to wrap your head around, but you've got this decentralized operation and a lot of the niceness that comes from gossip.

When you take a step back and really look at what is all the research work embedded in Consul, what you end up finding is there's actually a ton in there. It's not obvious when you use the product, but the core servers use consensus among themselves. All of the nodes use gossip protocols, that Vivaldi paper of figuring out how far away is a node and being able to use that for routing, is a whole subfield known as network tomography. How do I map the tomography of the network and use that to do interesting routing?

There's a twist on some of the security we've used of a capabilities-based model, which isn't necessarily as commonly seen. You see more core screen role-based access control in most systems. The core state management of the system is based on an in-memory immutable radix tree implementation, and the key advantage of this is multi-version concurrency control, which lets us basically have many readers; thousands of reads can be taking place at the same time without doing any locking.

We can have a highly paralyzed read path where we don't have to bother taking any locks, and this is hugely important when you think about a system that might be handling 100,000 queries a second. You don't want lock contention in that. Then a final piece, things like using Lamport and vector clocks to able to do things like message ordering in a large distributed system, and still be able to do reliable event delivery and have some notion of before and after.

Research across Products

There's a ton in there, and the goal is making that invisible. I think, this was a journey, a story around Consul, but I could share a similar one of these stories with almost all of the HashiCorp products. Terraform, really at the heart of it, is really a graph processing engine that happens to know how to make API calls. Really, that's what it is.

The core of Terraform itself does nothing about any API call, nothing about any cloud, it's really just processing a graph internally. Internally makes use of a lot of graph theory, type theory, automata theory. The original design of Terraform was actually dramatically different than the Terraform of today. It was originally a server-based system that used finite state machines and it would do transactions over them. It was one of these things where it was really going through the literature and trying to understand the limitations of things like two-phase commit that moved us away from a design like that, into Terraform of how it looks today.

Vault, it's funny. In many ways, it's literally Kerberos. If you look at the original Kerberos papers and reread those, you're, “Yes, there are some slight variations.” But honestly, I think our starting point was really saying, “What's done security in a very principled way?” I think one of the most foundational works in that space is Kerberos. It's still the most widely deployed, if you think about active directory, it’s really built on Kerberos. It's the most widely deployed security system.

There are a ton of things Kerberos got right. I think we looked at it and said, "How do we borrow a lot of the good design decisions?" Then really look at what were the things that we have now 30, 40 years with Kerberos in practice. What were the bad design decisions? I think the bad design of Kerberos was having a very specific SDK that every app in the universe needs to integrate with. That makes it very challenging for people to actually leverage Kerberos.

Vault says, "What if you largely took Kerberos and invert the integration model?" and say actually, “Vault will have a series of plugins that talk out to the other system and speak their native API, rather than trying to modify every system in the world to speak Kerberos as API.” Ultimately, it was really looking at it and saying, "There's a lot they got right here” in terms of designing a key distribution center and pluggable authentication mechanisms and time-based expiration of tokens, etc. There's a ton of other stuff in there around security protocols and cryptography and access control systems, but a lot of it's based on that core research. Nomad, in some sense, is the most straightforward from research. It literally is an implementation of Google's Omega. If you read the Omega paper, you're, "That's literally the design of Nomad."

I think that's super valuable because you can really look at a company like Google with decades of scheduler experience and, "How do we go from our monolithic scheduler of Borg to really making it a more extensible, higher throughput, being able to push new capabilities into the system?” They talked about their evolution from Borg to Omega, some of the architectural changes, but that's a super-rich area because there's a lot of pure CS problems. If you look at things like bin-packing or classic knapsack problems, its been used in everything from how does Amazon put items in a warehouse, to how do you pack the order of transactions, to process, to how should you place nodes on a machine.

It was very theoretical computer science work around this. It's not really applied when you look at knapsack problem space, that actually is very much directly attributable to Nomad. Same with things like preemption. I think there are classic problems theoretically, of if I have a work queue, how do I know what works should be preempted to make room for other high priority work? How do you interact with that? How does that impact, if you think about queue theory and control theory?

There are a lot of good things you can learn from. With a lot of these things, what you realize is if you do them naively, there are enormous sharp edges. The value of really being able to go into the research is understanding what are those sharp edges and how do I avoid them? In some cases, like Omegas cases, what are the clever tricks you can use that will then be, "Now this system actually has 10X the scheduling throughput, because of this one weird trick. Sometimes, that one weird trick is not obvious in a naive approach to it. You really have to be soaked in the world.

Forming HashiCorp Research

Shifting gears a little bit. A lot of this exploration of research took place in the early formation of the products. A lot of this was possible because me and Mitchell were closely involved in a lot of the early products and we brought a bit of this research background with us. We knew, "For this Terraform shape thing, maybe we should go look at graph theory. There might be some interesting things there." How do you formalize that? How do you bring that up to scale? This led to this notion of HashiCorp research.

It's odd, even as a very small company, we were 30 people at the time. We had an industrial research group. We hired in a guy, Jon Currey, who has actually spoken at QCon London as our director of research. He formerly worked at Microsoft Research and MSR. He was at Samsung's research labs. He'd spent a lot of time doing this industrial research work, which is halfway between engineering and prod work, versus pure academic research.

He joined us to found this group. The key when you talk about industrial research, is you have to make this distinction around what's the goal? What's the output of this? Often, when you think about research, particularly theoretical computer science, it's "How do I take the upper bound of this algorithm from n to the 10th to n to the 9th?" You're, "Yes, that's an interesting contribution.” It’s novel, but practically unlikely to be useful to have an algorithm that runs at n to the ninth. Not going to be that useful in the near future. For us, we really defined this charter as it's this focus on industrial, meaning we want this to only be 18 to 24 months ahead of what would be useful in product or engineering.

The other piece of this is, what makes this different from an advanced engineering group is that it has to be novel work. If it was just a group that was saying, "You're just working 18 months out ahead of engineering, but none of it is novel. You're just implementing things that are well-known or exist in the world," then it's not research. That's an advanced engineering group, and there's value in that as well. The key here is the novelty means that it has to be publishable. We have to be able to say that one of the outputs of this is a paper that we submit and that we're adding something to overall scientific knowledge. That's how we very narrowly defined the focus of this group.

Research Goals

We think about how this group wants to approach a problem. It starts on the left by saying, "What's an interesting problem?" It might be interesting because we just think it's interesting, because our customers think it's interesting, because it's a problem our users have all the time. The first pass is doing an exploration of the research. It's going out, looking at the literature and saying, "Is this a solved problem? Has someone else solved this?" Is there a paper we can look to or a system that's input implemented or an algorithm that exists that solves this, in which case, great, there's an existing solution. Someone has solved this. What we can do is hand that over to our engineering team and say, ''Hey, you're having this problem with Terraform. Lucky for you, there's an answer to this. Here's the algorithm, go implement this thing.'' That becomes part of it.

The other side then is, can we create a novel solution? If we can find a novel solution, then great, this is the goal of where we should spend the bulk of our time. From there, there are two outputs, one is publishing - there has to be a paper that comes out of this, and the other is that we integrate it ultimately into a product work. One customer problem, one version of this, I was working with a large scale user of Consul, where what they found was when they were getting DDoS-ed right on their front end, that random nodes on the interior of the cluster would be marked as unhealthy. This is a strange problem, you wouldn't expect this. You'd expect the front end servers to be impacted, but not the backend.

We went through the cycle where we said, "What you have to do is start by collecting data.'' Why are we seeing this? What is the cluster doing? Why is it flapping? Then you make a hypothesis, "I think this is the problem, and let's design an experiment that we can do in-house, a no cleanroom that'll reproduce the problem.” Step one, can we reproduce? If we can, we validated that our hypothesis is probably correct. From there, can we go design some novel solution to this? Now, the validation of it can be done in that same research testbed. In the same testbed that allows us to create the problem, we can now validate, did we solve the problem? This is the core loop of the research team.

If we look at the gossip protocol, this is a simplification of it, but it falls through a simple FSM like this. You're either healthy, or if you don't respond to a ping in time, you're suspected of being unhealthy. If you're a suspect for long enough, you're dead. Then there are arcs back to being alive, if you come back, you can rejoin and rebroadcast. The hypothesis was there's a timing issue here, because the core algorithms are simple enough. You're, "How are we seeing this weird behavior?" Nodes that are healthy are being marked unhealthy.

The supposition was, “Then what must be happening is that there's some set of these nodes that are not processing the reputation in a timely way.” They're getting the message back. They think they're unhealthy, but that node responds, "No, I'm perfectly fine," but they're not processing it in a timely fashion, maybe because they're being DDoS-ed. This is causing it to freak out and think that that node is dead, even though it's perfectly alive.

Reducing Sensitivity

From that hypothesis, we ultimately ended up building a solution where we said it's actually three different things that will come together to solve this. One is a notion of our own local nodal health. If I'm in a degraded state, maybe I shouldn't go around claiming that other nodes are unhealthy. Maybe the problem is local and not actually with the other node. So how do we build some notion of local health?

The other is one looking at the literature and this notion of having this exponential convergence. Actually, some of the inspiration for this was bloom filters. Seemingly irrelevant work, but one of the core ideas behind bloom filters was, if I do a hash function, I'm going to get a good distribution, my input is going to get distributed over a broad range of keys, so hashes are nice for that. I'll have these different keys, but if I have a limited number of bits that I'm working with, the odds that there's a collision, that I hash to the same value if I have a small output size, is still pretty good. I might hash the same key accidentally because I'm only hashing down to 16,000 keys.

One way to do that is to have multiple different hash functions that hash to different buckets. Then the odds that multiple independent hash functions all hash into the same key while having a different value is very low. There's this notion of K independence. Each of these K things makes it exponentially less likely for this to happen. We took that nugget and said, "What if multiple nodes have to confirm that you're actually dead, not just one node?” It's exponentially more unlikely that I have five different nodes, all think you're dead and all five of them are wrong. Then the last one is a simple optimization, which was just like early notification. The earlier I can tell you I think you're dead, the more time you have to respond to me.

Ultimately, we went to, “Great, we built this solution” that we thought was novel, went through that evaluation process. What we found - and we can see these are log scales - is that we can get hundred-fold reductions in detecting detection of false positives, even under extreme DDoS-like circumstances. A 100X improvement, in a theoretical benchmark of a torture test translates to, in real-world, an even more impactful improvement, because you're not generally being DDoS-ed all the time. Ultimately, this got published as a paper, "Lifeguard," earlier last year at DSN. If you're interested, you can go check it out. The paper is called ''Lifeguard.'' You can find it on archive.

Then the final step is ultimately integration with product. The goal then, you can see these are just highlights of our changelog as this got slipstreamed back into the release for both Serf and Consul as part of their release. It's like, great, here's this improvement. Now these systems are just more stable at scale. That kind of completes the arc.

Picking the Problem

With the Lifeguard, what we ended up doing is really, starting with a problem that a customer handed us, “Here's a real-world problem that the customer has.” In research, you have the luxury of sometimes if you can pick your problem. You can pick what you want to solve. One thing that we started looking at late into 2017, was this notion of, if a customer is using something like Vault, then every time there's a user action, there's an audit trail from it. Wouldn't it be neat if we could look at this audit trail and detect anomalies, and say, "Armon is stealing data. If something bad is happening, you should look into this." When you think about it, at a high level how an anomaly detector works, it's something like this: an event comes in, then the detector is evaluating that event against some model, as some internal model that says what's normal behavior, what's bad behavior, etc., and it's making a decision, "Yes, this is expected. Ignore it," or "No, this is unexpected," trigger a user to come think about it.

From there, you jump in and say, “It's about construction of this model.” This is the heart of it as that model construction. When you start exploring the literature, you see that there's actually a spectrum, just like there was with gossip. You have systems that optimize to have fewer false positives, and you have systems that optimize for fewer false negatives. It turns out you can't have your cake and eat it too. You can't say, "I just want it to be perfectly better than everything." The problem is, both of these are bad in different ways, because one is saying, "If I have fewer false negatives, I have lots of false positives. If I have fewer false positives, I have lots of false negatives.”

When you translate that into what would that mean for Vault, these are your choices. You're either, "I get to screen millions of events that are most likely fine, the system's just overly sensitive”, or, “The system’s quiet all the time, and I actually missed the real issues." These are your fundamental choices. It turns out anomaly detection is a sad space.

Refining Configuration

The crux of the problem is this model. What we're trying to do is learn a model in an unlabeled way. No one's labeling the data and saying, "This is an attack" or "This is good behavior," because if you knew that, then why aren't you stopping the attack? Why are you labeling it? These models are building these heuristics and they can only become so good. This led to a different way of thinking about the problem, which is what if instead of detecting an anomaly, what we help you do is build a model? This changed the question of what we're really researching. If you think about a system like Vault, put a firewall in there, pick any security system, there's some set of configuration that says who is allowed to do what. That configuration represents a model. There's a config that says Armon can access 500 secrets, that means you have some model which says it's normal for Armon to access these 500 secrets by definition. You've granted me access.

Can we look at this and build a closed-loop where we can actually say, "Yes, you granted Armon access to 500 secrets, but in practice, he only ever uses five. We've never seen him use anything other than these five. So those other 495 things you gave him access to, are actually a failure of your model. Your model is badly designed. You should actually remove access." In this way, what you're really doing is instead of detecting the failure, you're trying to prevent that credential from ever leaking in the first place. If you can remove my credential from even having access to it, then if I decide to go rogue one day, well guess what? I can't go rogue. This model itself has already been updated, the configuration has been tightened. I don't have access.

I'm not going to go into depth of how the system works. If you're interested, we've talked about it. There's a good video on YouTube from our research team talking about Vault Advisor in-depth and the system design and how the research behind it works. It turns out there's a lot of interesting research problems about approximate set cover, and that it turns out it's a terrible NP hard problem everywhere you look.

Where we are with that is we're at this interesting place where we have a novel solution, there was no existing solution exactly to this. The next steps are, how do you publish and how do you integrate to product. With the Lifeguard work, this optimization around gossip was much simpler, because it's an existing product. What the research team did is basically create a project fork internally, it's just the research version of Consul, implement our solution on top of that, right on our private fork. When it's good and we like it and we're happy and we are, "Great, it's in a good place," we basically just create a pull request, send it upstream, engineering strips puts it in and it shows up in a changed lot.

The problem is something like Advisor; it's a whole new project. It's a whole new product. It's not just "Here's a 500 line difference to an existing algorithm, existing system." The challenge is, if you look at how the research team has been working, there's this long prototyping phase of exploring, building different algorithms, building the core of the system, and then ultimately it's switching of gears into publish mode, where we're working on evaluations and paper writing and things like that. What we've done is basically parallelized the track with Eng to say, "At some point as we get closer to this thing where we're like, 'Yes, the system is getting pretty close to the final shape,' we're going to start training the engineering team on, 'Here was the design, here's the RFCs, here's how the overall system works,'" doing this shared model, and then ultimately having an embedded research model.

One of our research engineers is embedded with a production engineer to say, "We're going to do the knowledge transfer. The context sharing of how does the system work, how do the core algorithms operate?" Then with the goal that this ultimately hands-off where it's no longer researcher's product, it becomes engineering's product. Now it goes into it's a product and there's long-term engineering and product management, and all of that sort of associated stuff, because once research does their publication, they wipe their hands. It's no longer their thing. That brings us to these two, which hopefully you'll see both of these come out of this work, and we'll publish the Vault Advisor work and you all can get a view into the inner guts of the system.

Research Culture

The last thing that I want to end on is, how do you build a notion of a research culture? I think why this is important is, if you look at our product and engineering org, it's about 100 times bigger than our research org. So you need to take a cultural approach. It has to be part of the culture, because it's very hard to scale and say we're going to just create an embedded research model, it's a cost-prohibitive. What you really want is that for the most part, product and engs, you don't expect them to produce research. You want them to consume it.

There are a few things that we do to encourage this. One is, all of our engineering teams, all of our research teams, publish internal product requirement docs and RFCs. We have internal mailing lists that these all get broadcast to. This becomes a really good way, for example, for the research team to expose what they're doing in spark thinking to product and Eng, but also the reverse. When product and eng publish, the research team can look at it and say, "For this shape of problem, this really is a knapsack problem," or "This really is a problem, you should go look at that literature.'' It's a good way for the team to plugin and say, "Here's a relevant literature to go look at.'' It's a good way for us to just keep abreast.

The other side of it is creating space for people to talk about it internally. We have a talk research channel internally that anyone can join, it’s open. People have freely discussed papers "Hey, this might be relevant to Terraform," or "This might be relevant to Vault." You'll often see things like the morning paper gets shared and there'll be commentary around, "Is this useful for anything that we do?"

The other one is building internal brown bags. Our research team will do internal brown bags and expose people to different interesting research or the work that we're actively working on. Half of the research team talks at conferences as well and it just creates that awareness of this research, and sparks that interest in these different areas.

Then the last one is giving access to people, this is underrated. A lot of these papers actually sit behind paywalls. You need to have access to ACM or IEEE or whatever, to actually be able to download it. Internally, we have a policy that will reimburse people for that. In many of these cases actually, HashiCorp is either a sponsor or a corporate member or whatever, and we give engineers access. Anyone can say, ''I want an ACM account," and the company pays for it. Part of it is, how do you lower that barrier to give people actually access to the papers if you want them to consume it, and then encourage people to go attend things like Papers We Love Conference. We've sponsored it, we've spoken at it, we've sent engineers to it, so actively encouraging people to go, not just a pure industry events, but going into more of this hybridized industrial research. Papers We Love is a great venue, I think most practitioners would feel out of place at a USENIX conference.

I think the cultural goals for us are really, how do you build an awareness of research? Sometimes it's about just knowing what area should I go look at? "I should go think about graph theory." The other piece is giving access. I think that's a huge part in enabling it. Creating engagement channels internally, so you know who to go speak to, and then promoting involvement in the community. The real-world value of it is really building on the state of the art.