The Challenges & Opportunities in Distributed Networking

Leadership & Org Update 13

Intro

In our regular Ask Me Anything (AMA 30) last week we focused on several technical questions with our co-founder and networking lead, Eric Harris-Braun. We wanted to take a closer look at his long and deep response to a very specific question: What are the technical progress/blockers/challenges with networking? We’ll let Eric’s words speak for themselves (with slight readability edits) by quoting him directly about the four challenges he mentioned and then we’ll provide some additional context and commentary so that you and the rest of our community can better understand the challenges and opportunities with distributed networking.

The Challenge & Opportunity of Rust

One of the challenges with networking is our use of Rust as a programming language. I wouldn’t call this a blocker but it certainly is a challenge. And the challenge is that as a programming language, Rust makes you front load a lot of effort that other programming languages don’t require you to do. The challenge is that it takes time. And that front-loading of time is necessary because the Rust compiler is really picky and catches a bunch of errors for you. So it makes you do a lot of work upfront and that’s a challenge because we would like to do it more quickly. What we realized is that the extra work on the front end that makes it take longer is reaps greater speed later since there are way fewer bugs. There’s a thing called the asynchronous model in Rust which allows you to not block the program workflow when waiting for the results of certain actions.

Opportunity of Rust

So that had a bit of jargon in it about Rust — which is the programming language we selected for building Holochain. There were several important reasons we chose to switch languages from Go, which our prototype was built on, to Rust. Rust is a strongly typed, multi-paradigm programming language which means it was designed and built more for systems software rather than applications. It is also highly secure, stable, and safely concurrent — all of which are incredibly important for Holochain. However, the tradeoff for that stability, safety, and security is that programming in Rust requires extreme specificity from developers while they are coding. This front loads a lot of the time required for development, but it also means that once it is built it will have fewer errors when it is running and living in the wild. If you’re familiar with human languages, it might be akin to the extreme grammatical specificity Spanish has baked into every object and predicate — whereas in English it can often take more time and words to untangle the errors in multiple possible meanings. Back to computer languages, when we saw the specificity and efficiency gains in using Rust, the tradeoff feels very worth the extra time up front.

The Challenge of Distributed Testing

Planning for large scale testing is a challenge. We know that Holochain is designed to work at the scale of millions of nodes. What that means though is that our developers have to be able to test in simulated modes, to be ready for when we’re running at large scale, by finding bugs and figuring out when things go wrong and why they’re going wrong. One of the things we’re adding into our networking module is this thing called tracing. Tracing allows you to see where everything is going. A request that happens at the very top level about Holochain makes its way down to different components of the networking model. This, in turn, makes its way into a networking request that goes out to another computer with a little ID that follows that request all the way through the chain. Additionally, this makes its way to another computer that is also (if it has tracing turned on) recording that ID and allows us to see these requests from the top all the way around. We have to do this testing when we’re creating a complicated system such as this. So planning for that and building that into our testing framework is complicated — it’s a challenge to do.

The Challenge of Distributed Testing

So tracing sounds a little scary, doesn’t it? And that’s okay because tracing is not something that will ever be in a final product. Tracing is a necessity for the large scale testing we need to do but it will only be used within our test framework. In centralized systems, the time when something happens, is typically managed for — in the sense that you enforce a form of synchronicity of time or a single view of time that your database then manages for and your business rules interact with. However, in a distributed system and specifically with a distributed state, there is more than one perspective of time. There is no central place where it is worked out within the app. Often apps will use logging as a way to report about the sequence of events in support of debugging, but in async systems debugging with logging is is still incredibly challenging.

In order to build a functional and robust distributed app in the run up to a final product, we obviously have to minimize bugs and have enough information so that when something does go wrong, we will know how it can be fixed. Tracing is about getting that information in the dynamic conditions of a Holochain app during testing. You can think of it as a private investigator on the transaction. Again, tracing will NOT be in the final version of Holochain. We just need to turn the lights on so we can directly observe the problems when they happen.

The Challenge of State Management & Determinism

Another thing that’s really complicated is what’s called determinism. This is ‘the big problem’ in computer science. This is what Holochain, Ethereum, and other blockchain solutions are building around. The fact is that there are different views of time, so when something happens in a different order, you have a different result. When you’re testing things, you need to be able to record or see in which order things happened so if things don’t work in one time sequence of events you can find out why. For example, how is it that your code didn’t anticipate that particular timed sequence of events? To figure that out, we’re adding some really interesting things into our testing code which creates random number seeds. Random number seeds have things happen in a random order, but so that we can figure out later what that order was. We can play it all back in that exact same order to find out what the bug was. Otherwise, it just happened in some random order without being able to track it. It’s like forcing randomness in a way that we can still know and visualize using this tool called Pseudo Random Numbers. It’s a challenge to get that right — get all that additional code wrapped up into your code. This is so that you can run it during testing and then not run it when you’re not testing.

The Challenge of State Management & Determinism

The challenge of State Management and Determinism is a massive one in all of computer science, not just with Holochain or blockchains. It essentially boils down to tracking the exact time any one event happened in a system, and when it happened in the sequence of every other event that happened in that system. It may sound simple, but it is not.

Every blockchain is working to solve this challenge in a distributed setting rather than a centralized setting like the rest of the world. However, those blockchain solutions still mostly require a centralized global state. Solving for this from Holochain’s agent centric perspective requires a different solution. Simply put, Holochain is solving for distributed consensus (like blockchains) without a centralised global state (unlike blockchains).

The Holochain whitepaper is probably the best place to look for how Holochain solves for distributed consensus without a centralised global state. You can also chat with others about this on the Forum if you have questions that are more technical.

In a nutshell, what Eric’s talking about here is that during testing we need to provide for a wide variability in scenarios, but that in order to retest the code after a bug has been found and fixed for a particular scenario, we have to be able to replicate that exact scenario, again. The ‘pseudo random number generator’ creates the randomness from a specific seed, so that when we force a test to use that same seed, we get the replicability of the scenario.

The Challenge of Staging

Staging is another thing that’s both a technical and a project management challenge. As you’re building any project, you have to start with parts that are simple versions of what you’re going to replace later with a more complicated version. You do this as you go since you can’t build it all at once. That’s just the nature of building software. That’s why we create mocks, AKA simplified versions of things. For example, one of the things that we have in our networking module, is what we call in-memory networking. Instead of the signals going over the network they actually stay inside the computer. The code that is sending a message through a ‘web socket’ connection are made exactly the same — they look the same to the calling code but it didn’t go out to the internet; instead it stayed in memory. So we have that example and we have to run that simultaneously while also being able to switch that out with the web socket memory networking transport. That means that we have to write code that can handle multiple types of transport. So you create a greater level of abstraction. It takes more time. It’s more complex. It’s a little bit harder to think about but it’s absolutely necessary in doing staging because of how long different parts take. A similar example to the networking one is the DHT when we’re doing our distributed hash table with full sharding and our full implementation. Well that’s taking awhile to build out and we’re working on that and at the same time we have a mock version which is a full-sync DHT, where everybody on the distributed hash table gets everything. But again you have to have that be switchable and run and play it.

The Challenge of Staging

Admittedly, this is a challenge that all organizations face no matter what they’re creating. In our case however, there is added complexity in both the technical and project management aspects. From a technical perspective, it’s because no one has built a system that gives distributed consensus without a centralized global state, so we have to continuously build, test, and create new solutions. What Eric is pointing to in his comments here are that from time to time we analyse our situation and sometimes we elect to develop a short term implementation for some aspect of the system that will allow other Holo teams who have a dependency on that aspect of the system to move forward if the longer term solution is otherwise blocking progress. If you are familiar with web page development, it is a little like putting the Lorem Ibsen text into a web design before you have you real content ready for your site to go live.

Close

So, those are some of the things that are challenging us — testing, determinism, staging and how to make it all play out as we build it. I’m very excited about our progress. Things are moving along quite well from where I’m sitting but of course we would like them to be moving along more quickly but that’s the way it goes.

Thanks for tuning into this multimedia, multi-perspective review of networking here at Holo. We hope the peek into our development process and the way that we are enabling peer-to-peer technology in innovative ways is as fascinating for you as it is for us. In some of our future posts, we’ll dig into the scalability question in some more detail so that we can tease out in language that non-techy devs can understand why distributed state is so radically different from what other blockchain projects are doing.