Over my last two posts, I’ve talked about the challenges facing real-time applications like League that arise from the internet’s architecture, and how Riot is tackling some of those challenges by creating our own network. In this post, I’d like to look forward - what’s next, and how can we collectively get there? This topic has inspired a lot of reflection on my own experience building networks, and has galvanized my perspective that things are changing for the better. At its core, this post deals with how we can enact meaningful and positive change to the internet itself.

We started with this question: what more can Riot, a game company, do to improve the internet? After much soul searching, ideation, and experimentation, surprisingly, the answer is: “a lot!” We think Riot can participate in a paradigm shift in how tech companies approach the economics, deployment, and business of the internet. To give a few examples, we believe we can help lead the way on driving down the cost of owning and deploying networks, reducing the complexity of maintaining them, and improving the peering workflow. I’ll dive into each of those three points, but these are just a few areas that highlight how the whole network ecosystem is changing.

Before I do, however, allow me one brief digression on scale. It’s easy to think that challenges like these should be handled exclusively by companies with a profile like that of Amazon, Facebook, and Google - but I don’t buy that. Those three are doing amazing work in this space, manufacturing their own switches and routers and participating in the Open Compute Project. But I see no reason that the lessons learned there can’t apply to firms that aren’t quite their size. We are driving towards an ecosystem where companies of any size can contribute to how networks, networking protocols, and even networking devices are developed, deployed, and evolved. We want to allow the users of networks to own their own destiny, and manage it accordingly.

Driving down the cost of owning and deploying networks

Networks are confusing. Worldwide networks even more so.

Think about all the things that make them up: routers, switches, network engineers, IOS, JUNOS, SROS, MPLS, IS-IS - that’s a whole lot of s’s. Who the heck understands all that? I can tell you that when we first started, Riot didn’t. Without sharing too much (#Garland4Life), let’s just say we made a lot of mistakes. And those mistakes have impacted our development velocity and network environment stability.

Riot understood software development - we wanted to make games, not build networks. We thought that our applications running slowly or our players experiencing lag was just a reality we had to accept. Then, when we realized that we might be able to fix some of this, we didn’t know where to begin. So we did what most companies do when faced with a problem they don’t totally understand - we threw money at it. In this case, that involved hiring vendors to help us with connectivity problems. And while we have many great partners, we’re often not 100% aligned - they want to sell hardware and network access; we want to provide the very best League experience possible. To them, 300ms latency was the same as 60ms - to us those numbers are worlds apart. This situation is how we ended up with NA game servers in Portland - our partners wanted us to leverage the great work being done by Facebook and Apple. That’s certainly a valid argument, but it wasn’t as focused on players as we endeavor to be. (I couldn’t be happier we remedied that situation with the server move to Chicago last year.)

In short, we really didn’t have the toolset to fix our networking issues by ourselves, and (as the adage says) if all you have is a hammer, everything looks like a nail. So we kept buying things that vendors told us would fix our problems, whether it be a new piece of hardware, or a new data center location. I think in part we got caught up in the challenges, and lost sight of what we value most: players.

Fortunately, somehow we found the space to take a step back, and the best parts of our tech culture kicked in. We quickly formed teams that understood the value we could deliver to players with the correct approach. And over the past year we have positively impacted the way players experience LoL, and provided a much better level of stability for our developers. Riot now spends less on internet access and internet facing networking than ever before, and we have better performance across the board.

To that end, I think three strategies we have employed are of particular note:

Bring technology-agnostic expertise in house Create knowable and measurable networks Utilize agile and lean development methodologies

The first point is probably the most difficult. Finding network engineering talent that isn’t preprogrammed to do things just one way is hard. During my time at Time Warner Cable, we built everything with Cisco products and life was great - until some things started breaking that Cisco couldn’t fix. (Not Cisco's fault, this is just the nature of how the industry evolved. Vendors are motivated to capture 100% of your business.) We brought in Juniper routers, but we had problems getting them to work in our Cisco environment. We couldn’t even get them to bring up an OSPF session. The culprit was different MTU values - a problem that never arises if you work exclusively with one vendor. Some network engineers decided the situation proved Juniper’s equipment wasn’t good - but others knew better. The former gave up. The latter did the work to understand the differences, learn an entirely new coding system (JUNOS versus IOS), and start thinking at a deeper level about the protocols involved.

The second point revolves around building transparency into the work of network engineers. To best focus on players, from the start Riot Direct’s mission included the ability to measure the performance of the technology we built. As we put load on the system, we could immediately show how it impacted the player experience. We could also measure our work in order to accurately forecast what kind of impact we could make on players’ behalf. While tracking metrics like the number of OSPF sessions, amount of data passing through an interface, or the number of processes per CPU are interesting, they might not be telling the whole story. To best understand the experience, we track latency, packet loss, and jitter as more impactful values.

The third point is that I believe it’s essential that we change the way we work as network engineers. In my experience, most networks are built slowly and with massive inertia - changing or pivoting them is extremely difficult. I believe we can change the product by changing development - Riot Direct works in two weeks sprints that allow us to quickly change and execute on our goals when appropriate.

Reducing the complexity of maintaining networks

While the last section dealt with improving how we work with the networks of today, this section addresses working with the networks of tomorrow. To spoil the ending: those networks won’t be built by vendors, they will be built by us, the networking community.

Open source has dramatically increased the velocity at which we develop software, and now that revolution is going to hit networking - and it’s overdue. Currently it can take years for features to be developed for networking equipment, which is still not nearly as robust as compute equipment. Consider a web server - I can easily set it up to stream data all day long. But if I ask a router one too many questions via SNMP, the entire box might reload. I’m not talking about super frequent polling - I have to poll less than once every 5 minutes! That is an eon in networking time.

Two concepts are helping bring routing into a place where we, the networking community, can actually tackle the issues above: DPDK and open source routing code. Today, using these two resources, we have built a router that can process 400M packets per second. That may not be huge in the service provider world, but it’s a start! And we believe that it is something that any company can do, at any size.

This completely changes the game on our internal network. We can now use common compute to build routing tables, and instead of polling devices for information using SNMP, devices can now live stream me data about the health of the network with updates every millisecond. We could even develop our own internal routing protocol, or make updates to an existing protocol if we didn’t want to wait on standards bodies. Similarly, we could even start building systems that diagnose network problems as they are happening , not after the network has already failed. This changes the game in the way we can understand and operationalize our networks. Below is output from our latency analysis tool - you can see several events that affect the player experience. In the future, we hope to mitigate such events automatically.

Latency analysis tool