

Author: “No Bugs” Hare Follow: Job Title: Sarcastic Architect Hobbies: Thinking Aloud, Arguing with Managers, Annoying HRs,

Calling a Spade a Spade, Keeping Tongue in Cheek

Over the course of the last 10 years there is a strong trend for more and more games becoming substantially network ones. While adding network support to a game opens a whole world of challenges, my recent experience (both as a player and as a consultant) has shown that way too many game developers out there violate the very basic principles of reasonably good network application. It can usually be observed as “frozen” UIs, unmotivated disconnects (when the rest of the Internet is accessible), sporadic crashes, and server overloads during peak time. The bad news is that these issues directly affect customer satisfaction (much more directly than management and graphics developers usually think). “The good news is that dealing with all these issues is not a rocket science, provided that network engine developers do have a clue of what they're doingThe good news is that dealing with all these issues is not a rocket science, provided that network engine developers do have a clue of what they’re doing – and that they did read this very article 😉 .

In this article we’ll concentrate on certain aspects of network development, which are apparently not so obvious to many game engine developers. If something I write here is obvious to you – my apologies, but please don’t hit me too hard for writing it. I assure you that for each of the items on the list (except for maybe one or two), there is a hugely popular game with a player base in millions, which violates this item. So, this article is intended as a list of advice, which will help you to avoid the most annoying (and at the same time most popular) mistakes which developers make when it comes to implementing network layer for highly interactive applications such as games or stock exchanges.

In the Part I of this article, we will discuss issues which are common to client-side-app network development, regardless of the protocols used. Upcoming parts include:

0. Scope

The subject of network support for game engines, taken as a whole, is very large. That’s why for the purposes of this article we limit the scope of our advice. More specifically:

We will concentrate on the games-which-have-client-app and won’t consider browser-/AJAX-based games; while many things are quite similar for app-based and browser-based games, they do have enough differences to consider them separately.

However, this article attempts to cover most of the other aspects related to the development of networking layer for games:

“ Surprisingly, MMORPGs, social games, casinos, and stock exchanges have quite a bit of similarities Part IV. Great TCP vs UDP debate).

Part IV. Great TCP vs UDP debate). We also don’t restrict ourselves to one platform: in fact, we strongly advocate writing cross-platform engines, this includes network engines. In practice, I’ve personally written a network engine which is running on 5+ quite different platforms (the list is available below under item #6).

While this article is written from a game engine writer point of view, we should note that rather often developers need to write their own engines for their own games. In such cases, most of the advice within this article is still applicable.

While the question “which of existing engine/network engines is the best one” is outside of the scope of this article, this article still is expected to be useful to answer it; however, the answer depends on specifics of your game, so you need to read the article and decide what is applicable to your case and what is not. In other words: if your game engine/framework provides a way to handle networking – you can use this article as a tool which allows take a look at their framework, and see if their network implementation makes sense for your specific game.

Now, with preliminaries out of the way, let’s go down to business:



1. DO use Event-Driven Programming Model on the Client Side

“How should this event-driven model change when we add networking to our executable/app? The answer is: 'it shouldn't'Most of client-side UI frameworks out there, have a so-called “main thread” (or „main loop“ which implicitly runs within this “main thread”), and this “main thread” essentially just processes certain events (originally – UI events). This model applies through the whole spectrum of client-side frameworks, from Windows GUI, Direct X, and Cocoa, to Unity 3D, Android, and iOS. And there is a really good reason for it: because otherwise programming becomes a nightmare. In fact, I know of only one framework which doesn’t work like this: it is original Java’s AWT, and programming an app in AWT was quite a well-recognized pain-in-the-ahem-neck (to the point that AWT has never became popular; in particular, Google needed to develop the whole new GUI framework for Android).

How should this event-driven model change when we add networking to our executable/app? The answer is: „it shouldn’t“. All game network communications logically consist of the messages sent and received; each of received network messages should be considered as a yet another event for the game event-driven logic (alongside with traditional UI events such as mouse clicks or keypresses). This usually can be implemented quite easily by injecting a message into „message queue“ of your main thread (for example, in Win32 it is done via PostMessage() or PostThreadMessage() ); if the graphics framework you’re using (such as Unity3D) doesn’t support this concept – you may need to simulate it via creating your own queue and polling it (see, for example, [Unity3D2012]). The question whether to pass events as data (as in Win32), or as callbacks (as in [Unity3D2012]) is not that important compared to the mandatory processing of all the events (which include both UI events and network messages) within one single thread. NB: if using Unity, this trick is rarely needed, as Unity’s own built-in networking (which already uses Unity’s event-processing thread) is usually good enough for “real-time world simulator” games; however, using Unity’s networking implies UDP transport, which, as we’ll see in Part IV, may or may not be the best thing depending on the game – especially if deviating from “real-time world simulator” stuff).

In some cases your event-processing thread might be different from your framework’s “main thread”, but what is important is to keep all the processing of your at-least-somehow-related events within one single thread. However, purely communication-related things (which are not related to the game logic at all), such as marshalling, en/decryption and (de)compression, can (and if possible – should) be done outside of the “main thread”; some further details of thread separation are discussed in the item #3 down the road.



2. DON’T call Application Callbacks outside of Event-Processing Thread

“This one measly callback has caused quite a lot of inconvenience for the fellow developersWhen I was a young and relatively inexperienced developer bunny, I wrote a network framework for a stock exchange (don’t ask how it could have happened for an inexperienced developer to be responsible for such a supposedly Big Task – I have no idea myself). And I have to admit that despite it being reasonably good for the first attempt at writing a network library, I made one significant mistake there. I made a network framework’s own thread, to call a callback in the application layer (if I’m not mistaken, it was a callback in reply to my own sendMessageOverTheNetworkAndCallbackOnReply() -style function). This one measly callback has caused quite a lot of inconvenience for the fellow developers who used the framework. First, interactions (and potential races(!)) became quite difficult to understand for the fellow developers (for me everything was obvious, but it still was my problem, and for a reason: it was an avoidable problem which I have forced them to deal with). Second, it has caused quite a few difficult-to-track bugs and races. Eventually, it wasn’t too bad and the overall program worked really good, but development could have been much more smooth than it was, if not for this single callback.

In a few years I was tasked with writing a network framework for quite a large multi-player game (half a million users simultaneously online and half-a-billion network messages per day is something I like to brag about 😉 ). This time I learned my lesson, and avoided this kind of threaded callbacks. The whole thing worked like a charm (and was also much simpler to port to a multitude of platforms).

Bottom line: if you need to implement a callback from network layer to application layer, first pass your event to your event-processing thread (usually your ‘main’ thread), and then process your event within your network-layer library call originating from your event-processing thread, calling your application-level callbacks when necessary.

In other words, the following approach is good:

network thread –> inter-thread-communication –> event-processing thread –> network-library-call –> application-callback –> no-thread-sync-needed

network thread –> network-library-call –> application-callback –> thread-sync-required

and the following one is workable, but is not so good for the other developers in the long run:

For the „good“ approach described above, the callback is always called in the context of event-processing thread, which simplifies application development greatly. All application-level processing becomes strictly deterministic (which translates into “much less opportunities for races to arise”), and without any thread synchronization necessary at application level. The wording above is admittedly bulky, and the approach may sound complicated, but it will save game developers a lot of trouble down the road.



3. DON’T call potentially Blocking Network Functions from Event-Processing Thread

“If you called such a function from your GUI thread, it usually means that for the user GUI looks as 'frozen'/'hanged' for all the time while the function is blocked, which is a Big No-No from the user experience's point of view. This is one of the most annoying single fallacies a network developer can commit. As noted above, you SHOULD have your events processed within one single thread. This is fine and convenient, however, calling an innocently looking gethostbyname() within one of event handlers (which are usually implicitly called from within the event-processing thread) will usually work without any apparent problems in your office environment, but in some cases for some real-world users it can block for minutes (!). If you called such a function from your GUI thread, it usually means that for the user GUI looks as “frozen”/”hanged” for all the time while the function is blocked, which is a Big No-No from the user experience’s point of view.

The proper way of doing network interaction with GUI – is to have all the network function calls either as non-blocking, or in a separate thread(s). In this case, you’ll need to make you event state machine more complicated (you’ll effectively get states such as “waiting for DNS resolution”), but at the same time it will allow to avoid “frozen” GUI (which is a Good Thing per se), and will additionally allow you to handle networking delays, including:

notifying user when it is appropriate. For example, whenever after a second or five of extra waiting you know that there is a problem, user usually knows it too, so it is better to let her know that you’re aware of the problem and working on it

to abort the operation and initiate retry (it is related, for example, to application-level keep-alive discussed in item #46 in Part VI) when necessary

to allow the user to terminate request/application gracefully (instead of forcing her to resort to using task manager)

It should be noted that while this item may seem to contradict items #1 and #2 above, it does not. To the question: “hey, so should I do it single-threaded or multi-threaded?” the answer is: „system-level network calls SHOULD be either non-blocking, or from non-event-processing thread; at the same time, all event processing SHOULD be within event-processing thread“. It means that if using threads, you should call something like blocking recv() in a non-event-processing network-processing thread, convert result of this call to an event, and pass this event to the event-processing thread via some kind of queue (see item #1 above for details). Stuff such as decryption/decompression may be, strictly speaking, processed in either of these two threads, though to avoid event-processing thread to become a bottleneck, it is usually better to leave the encryption/compression to the network-processing threads.

An alternative to the network threads is non-blocking IO. Here are quite a few caveats (including that gethostbyname() and getaddrinfo() don’t have a non-blocking counterpart at least on one major platform), and in general, I am not sure that going non-blocking is worth the trouble for the client side (server side is a different story, which will be described in Part III of the article).



4. DON’T use User as a Freebie Error Handler

“There is a little problem on the server. Please retry.There are some developers out there who’re using a very simplistic (and I’d say sadistic) approach to handling network errors. Namely, they just throw the error in the face of the user and say something like “There is a little problem on the server. Please retry”. This is terribly annoying and serves no purpose (except that to make life of the developer a little bit easier at the expense of the user). There is absolutely no reason (except for developers being lazy) not to handle network errors internally, and not to retry automatically. Some kind of a notification to the user that there are some problems, should be made, but it should not require any user input. To implement such a notification, you can do it either as a message in some prominent area on the screen, or to make a dialog box (without ‘ok’ button, just with a ‘cancel’ button(!)), which will automatically disappear when you deal with the problem (yes, if the user was looking away while you fixed the problem, and you were able to recover from it meanwhile, there is no reason to bother him with your problems).

A note to those who will argue that relying on user reduces network congestion: being a strong advocate of the position that it is Internet which should serve users’ needs, and not the other way around, I am sure that it is our responsibility as developers, to make life of the users easier. And while I agree that congestion control is important, the needs of the end-users should still come first. On the other hand, to reduce network hit in case when user doesn’t really need it – a reasonable timeout (for example, a few minutes – which is often enough for user to get frustrated and go away from computer) to stop retrying and display “Sorry, we tried hard but cannot do anything at the moment” would be a good thing.

Oh, and to comply with the items #1-3 above – you generally should detect a network problem in your network-processing thread, convert it to an event, and process the event in your event-processing thread (for example, by showing a dialog box).

5. DO provide Error Messages which are Meaningful for User

From the end-user’s perspective, there is no real difference between “Network not reachable”, “Connection refused”, and “Connection has been terminated”; if you could – you MIGHT want to tell him that his network cable is unplugged, or your server is down, or there is something in between, but cluttering his space with meaningful-for-you-but-meaningless-for-him details such as above – is generally a Bad Idea. Even worse is to hide these technical details behind messages which try to “translate” it into such things as “server is having a little problem” and “you have lost connection”, while at the same time having more than one such message (even worse is having different messages with differently looking UIs).

By all means, do “translate” error messages from your space into the end-user-space, but as long as the error is indistinguishable from the end-user’s point of view – make the error message look the same (with a non-intrusive dislpay of error code or ‘more info’ button to make life of support people easier).



6. DO support Multiple Platforms

“Are you sure that your game won't be ported to any other platform, never ever? If you're sure, you shouldn't be.Given modern gaming landscape, single-platform game engines are not generally attractive. Even if your engine is intended for one single app, are you sure that it won’t be ported to any other platform, never ever? If you’re sure, you shouldn’t be.

In practice, making networking code cross-platform is much less of a problem than for graphical stuff (that is, unless you’re so crazy about one specific technology that you ignore everything else out there – which is a Really Bad Thing), so there are no real reasons to make your network layer single-platform (unless your whole game engine is already single-platform).

For reference – I’ve personally seen my own networking library working pretty much without changes on Windows, Linux, Mac OS X, FreeBSD, iOS, and even Android (on the last one from within NDK). Heck, it was even ported to Java in a line-to-line manner, but this is a different story.

6a. DO use Berkeley Sockets on the Client Side

Berkeley sockets Berkeley sockets... is a computing library with an API for internet sockets and Unix domain sockets, used for inter-process communication (IPC).— Wikipedia —If you’re implementing your network engine in C/C++ and think of your application as “Windows-only”, it may be tempting to use Windows-specific functions (those with WSA*() prefix) for communications. Don’t do it, use Berkeley sockets instead (those socket() / connect() / send() / recv() functions; for details on their usage – Google it, for further details – refer to [Stevens]).

For other programming languages, which provide their own cross-platform APIs, choosing a portable network library is usually much less of a problem.

7. DO consider providing a way to Auto-Update App

Usually, automated updating is not considered a part of game engine. However, IMHO there is a good case for it to be included into the network layer. The reasoning behind is the following:

users MAY want to get some optional stuff (from themes to DLCs)

they WILL appreciate if they can download it while playing

they WON’T like if downloads are interfering with gameplay

by keeping optional downloads within your own network layer, in certain cases you CAN prioritize traffic and minimize impact of downloads on the game (some relevant tricks will be discussed in item #17 of Part IIb) as QoS doesn’t work on the Internet (see item #17b of Part IIb), two parallel connections are much more likely to interfere with each other

if you support optional downloads – they need to be auto-updated too, so integration of optional downloads and auto-updates is a good thing to have

therefore, the whole auto-updated stuff is better to be implemented as a part of networking engine.

as an additional benefit, you’ll be able to download auto-updates while users are playing, maximizing their playing time

The reasoning above is very far from being absolute, but – such a system has been implemented and I’ve seen it to work extremely well.

One note of caution: despite integration with network library, you SHOULD implement initial auto-update (the one which is launched before the game app starts) over HTTP (and not over your own protocol); this doesn’t add too much to the complexity, but does allow to change your network protocols drastically.

The rest of the auto-update topic is quite complicated, so I will probably write a separate article on it.

To Be Continued…

To avoid getting hit by a TL;DR syndrome, the article has been split into several parts. Stay tuned for Part II. Protocols and APIs.

EDIT: The series has been completed, with the following parts published:

Part IIa. Protocols and APIs

Part IIb. Protocols and APIs (continued)

Part IIIa. Server-Side (Store-Process-and-Forward Architecture)

Part IIIb. Server-Side (deployment, optimizations, and testing)

Part IV. Great TCP-vs-UDP Debate

Part V. UDP

Part VI. TCP

Part VIIa. Security (TLS/SSL)

Part VIIb. Security (concluded)

Acknowledgement

Cartoons by Sergey Gordeev from Gordeev Animation Graphics, Prague.