

Author: “No Bugs” Hare Follow: Job Title: Sarcastic Architect Hobbies: Thinking Aloud, Arguing with Managers, Annoying HRs,

Calling a Spade a Spade, Keeping Tongue in Cheek

I hate Mondays — Garfield the Cat —

We continue our series of posts on network development for game engines.

Previous parts:

In today’s part IIIb, we’ll concentrate on those server-side issues which were not addressed in the part IIIa. Mostly our discussion will be related to deployment (and front-end servers), optimizations, and testing.

Upcoming parts include:

Part IV. Great TCP-vs-UDP Debate

Part V. UDP

Part VI. TCP

Part VIIa. Security (TLS/SSL)

Part VIIb. Security (concluded)



20. DO support „front-end“ easily replaceable servers

If you’ve followed an advice in the item #16 of Part IIb, you have implemented those publisher/subscriber interfaces. And if you’ve followed an advice in the item #14, you do have your own addressing schema. Now, it’s time to reap the benefits. These two protocol-level decisions open an opportunity to implement one wonderful thing – “front-end servers”. The idea is to support deploying gaming system as follows:

a few game-servers -> a few dozens of front-end servers -> million of end-users

This approach (as has been observed in practice) does help to take the load off the game servers quite significantly. The thing is that, handling TCP connections is quite a burden. And if you need to support any kind of encryption – it becomes much worse (especially if we’re speaking TLS, as public crypto is one hell of CPU-cycle-eater, especially at the moment when your million users reconnect after BGP convergence or something). And, if we have this kind of burden, that we can offload to cheap-and-easily-replaceable servers, why not do it?

In such deployments, game-servers and front-end servers are very different:

Game servers Front-End Servers Process game logic Process only connections, no game logic is processed here Do carry game state Don’t carry any game state Mission-critical (i.e. failure causes lots of trouble) Easily replaceable on the fly (at the cost of automated reconnect of the players) Expensive (such as a 4-socket $50K box and up) Cheap (such as a 2-socket $10K box and down)

If trying to serve half a million of players from 5 game servers – it would cost you a whole damn lot to buy such servers (especially if each game server needs to talk to all the half a million of users). Serving the same half a million from 5 or so game servers + 15 front-end servers in the architecture above – is easy and relatively cheap, I’ve seen it myself.

Ideally, each of the players should be served only by one of front-end servers at any given time, even if the player interacts with multiple game servers. This single-client-single-front-end-server approach tends to help in quite a few ways, usually reducing overall traffic (due to better compression and less overhead) and overall server load. It plays well with the item #47 (single-TCP-connection) from Part VI, but doesn’t really require it, and can be implemented for both TCP- and UDP-based communications.

20a. CDN-like and Cloud Deployments

CDN is a large distributed system of servers deployed in multiple data centers across the Internet.— Wikipedia — One additional benefit of ‘front-end server’ deployment architectures is that they allow to have CDN-like deployments, where your game servers sit in one central datacenter, and front-end servers are distributed over several datacenters all over the world; if you can afford a good connection (such as frame-relay with a good SLA) between your datacenters, you can improve latencies for your distant users, and make the game significantly more fair; while this is not an often option because of the associated costs (though it can be as low as $20K/month), for one special kind of games known as “stock exchanges”, this is one option to be taken into account.

Alternatively to (or concurrently with) CDN-like deployments, you can deploy such an architecture into the cloud, keeping in mind that SLAs for the front-end cloud servers can be significantly worse (and therefore significantly cheaper) than SLAs for game servers.

21. DO start with Berkeley-socket-based version, improve later

“It is usually not a good idea to start with a network engine optimized for a very specific APIFor servers, there are tons of different technologies which work with sockets: WaitForMultipleObjects(), completion ports and APC on Windows, poll/epoll on Linux, etc. It might be very tempting to say „hey, everything except for <your-favorite-technology-here> sucks, let’s do the latest greatest thing based on <this-technology>“. However, way too often it represents a bad case of premature optimization. In the experiments which I’ve observed, good Berkeley-socket-based implementation (the one with multiple sockets per thread, see item #21a below) was about on-par (within margin of error) of Completion-Port- and APC-based ones. On the other hand, using shared-memory for same-machine communications has been observed to provide 20% performance improvement (which is not much to start with, but at least makes some sense). While your mileage may certainly vary, the point here is that it is usually not a good idea to start with a network engine optimized for a very specific API (in the worst case, affecting your own APIs, which would prevent you from changing underlying technology, and from cross-porting your library in the future).

I’m not claiming that all these “newer and better” technologies are useless (in particular, there may be substantially different for serving large files, or it may be that we’ve implemented them poorly for our experiments); what I’m saying is that it is very unlikely that good Berkeley-socket-based implementation will make a “life-or-death” difference for your project, and that if you have Berkeley-socket-based implementation, you’ll be able to change it to <whatever-technology-works-best-for-you> later – without changing all the game logic around. Also, Berkeley-socket implementation has an another advantage – it will run both on Windows and Linux, which may come handy (see also item #22 below).



21a. Optimization: DO use multiple-sockets-per-thread

“This optimization works extremely well with those front-end servers, especially in gaming environments.The very first thing on my list of optimizations is making sure that you can support multiple sockets per networking thread. Context switches are damn expensive (if we take cache poisoning into account, context switch cost goes of the order of 10000 clocks (!)), and if you have a single-socket-per-thread in a typical gaming traffic, your program will spend much more time on switching back and forth between threads, than on actual work within those threads. Exact effect of switching from single-socket-per-thread to multiple-sockets-per-thread heavily depends on the game specifics, but I’ve seen around 2x improvement, which is by far the largest effect I’ve observed from any optimization at this level.

This optimization works extremely well with those front-end servers (see item #20 above), especially in gaming environments. This is because for games, per-socket traffic usually consists of small and very sparse packets, so with one-socket-per-thread, threads are heavily underutilized, causing too many those extremely expensive context switches. Multiple-sockets-per-thread help to mitigate this problem.

Packet Delay Variation is the difference in end-to-end one-way delay between selected packets in a flow with any lost packets being ignored. The effect is sometimes referred to as jitter, although the definition is an imprecise fit.— Wikipedia —I’ve heard arguments that multiple-sockets-per-thread introduce unnecessary latency, and is therefore unfair to players. Let’s take a closer look at this potential issue. First of all, even if multiple-sockets-per-thread does introduce observable latency, it is still fair as long as distribution of players to threads is random (which usually is as long as you’re not making any preferential treatment). Second, as long as number of sockets per thread is small (such as 32-64 or so – this is enough to get optimum speedup), and the number of threads is still much larger than the number of cores, latency patterns will be most likely indistinguishable from single-thread-per-user model. What we can say about latency-wise for multiple-sockets-per-thread (assuming that there is always a free CPU to run the thread when the packet comes in) is that it exhibits occasional single-digit microsecond-range-delay (due to the queueing within the thread), and occasional microsecond-range-speedup (due to the lack of context switch). Combining with typical over-the-Internet jitter being at least of the order of single-digit milliseconds (which is orders of magnitude larger than those microseconds), the effect of those occasional latency changes will be lost-beyond-detection in the much larger jitter noise. Moreover, for most of the games microsecond-range deviations, even if theoretically detectable, are well within acceptable range (as long as it is fair).

“Bottom line: multiple-sockets-per-thread is the only optimization which I'd consider for your first implementation. However, in quite a few cases it might be better to start with classical single-socket-per-thread and improve it later.If you’re satisfied that this issue is out of the way for your specific game, let’s see what we have implementation-wise. Good news in this department is that multiple-sockets-per-thread can be easily supported for Berkeley-socket-based implementation (via non-blocking sockets). Even more of good news is that as long as you have strict Berkeley-socket implementation (using select()), changing it to poll()/epoll()/WaitForMultipleObjects() is next-to-trivial, so you can easily try any of them as soon as you’ve got a basic select()-based one working. A bit of bad news is that it still might be a premature optimization (or it might be not).

Oh, and if you need security/TLS: OpenSSL can be made working in multiple-socket-per-thread environment (via non-blocking BIOs), though it is quite a bit of work.

Bottom line: multiple-sockets-per-thread is the only optimization which I’d consider for your first implementation. However, if you do need to support OpenSSL and/or non-trivial over-the-network protocols – it might not be worth the trouble; in such cases it might be better to start with classical single-socket-per-thread model and improve it later.

21b. Optimization: DO consider shared-memory for intra-server communications

In my experience, when going along “start with socket implementation, improve later” route, one of the most significant improvements comes from re-implementing intra-server communications with shared memory (overall gain was around 20%). It is not too much (and YMMV for different systems), and you can live without it for a while, but in a hunt for the absolute-best performance, it is extremely difficult to beat the shared memory.

21c. Optimization: DO consider non-blocking and half-blocking queues

If you’re following advice from item #19 in Part IIIa and building your system around Store-Process-and-Forward architecture, you have quite a bit of queues (FIFO style ones) in your system. At a certain point (when you’ve already optimized your engine enough) you’re likely to find that those queues start to take a significant time (due to waits on queues and associated switches).

The solution is to avoid those expensive context switches by using non-blocking data structures (based on Compare-and-Swap, a.k.a. CAS operations, which correspond to atomics in C++11). Such “lockfree” structures are available as Windows API (Interlocked Slist, [MSDN] ), or as boost::lockfree:: family. Both these structures do as advertised – they provide fully lock-free data structures.

However, what we often need for inter-thread communication, is not exactly a fully non-blocking queue, but a queue where the writer never needs to block on the queue, but the reader has an option to block on it (ideally – only if the queue is empty).

This kind of “no-writer-blocks-blocking queue” can be implemented using non-blocking primitives such as those mentioned above (NB: code below is based on Windows and Interlocked*() functions, but similar thing can be implemented using boost::lockfree stuff):

class NoWriterBlockQueue { PSLIST_HEADER slist; HANDLE event; //constructor goes here // event is initialized as an auto-reset event void noblock_push(SLIST_ENTRY item) { //having SLIST_ENTRY in queue API is ugly, // done here only to demonstrate the idea, // SLIST_ENTRY SHOULD be wrapped for any practical implementation PSLIST_ENTRY prev = InterlockedPushEntrySList(slist,item); if(prev==NULL) SetEvent(event);//not exactly optimal // but still more or less non-blocking } SLIST_ENTRY noblock_pop() { //returns a list of data to be processed return InterlockedFlushSList(slist); } SLIST_ENTRY wait_for_pop() { //returns a list of data to be processed for(;;) { SLIST_ENTRY dequeued = InterlockedFlushSList(slist); if(dequeued!=NULL) return dequeued; WaitForSingleObject(event,INFINITE); //spurious returns (with slist empty) // are possible but harmless } } };

While the implementation above is not exactly optimal (it causes not-strictly-necessary SetEvents() and spurious wake-ups from WaitForSingleObject()), it might be a reasonably good starting point for further optimizations in this field.

21d. Optimization: DO experiment further

The list of optimizations above is certainly not exhaustive; there can be many more things which you’ll be able to optimize. In particular, I don’t want to discourage you from trying all those latest-and-greatest I/O technologies (such as completion ports/APC/whatever else). The key, however, is to have working and stable system first, and to improve it later.



22. DO consider Linux for servers

Even if your game engine is Windows-only – you should consider making your server side to support Linux. On the one hand, Windows (contrary to popular belief) is capable of running huge loads (such as hundreds of millions of messages per day) for months without reboot, so this is not that bad. On the other hand, contrary to another popular belief, protecting Windows from attacks is still significantly more difficult that protecting Linux. In particular, for non-money-related game I would even be ready to commit the ultimate security fallacy – to run it on a Linux server wide-open to the Internet without a hardware firewall; for Windows I would still consider it a suicide.

One more reason which makes Linux somewhat better suited for production servers than Windows, is ability to make tcpdump (without installing anything and at almost-zero performance cost) right there on server, and to analyze results offline (see item #25 below for further details). On Windows, to achieve the same thing, you’d need to install 3rd-party software (such as Wireshark) on your production server, and the less 3rd-party stuff you have on the server – the better (both from reliability and from performance point of view).

22a. DO fight dependencies, especially on Windows

When I was saying in the previous item that Windows is capable of running huge loads – I meant it, but there is a caveat. Windows are indeed capable to run huge loads reliably – but only as long as you severely curtail your dependencies.

“Windows team has done a good job promoting vendor lock-inThe less services you need to run on your production server – the better, the less DLL’s/.so’s you need to link – the better, and so on, this applies to any platform. This helps both to improve reliability and to improve security, both for Windows and for Linux. However, for Windows programs, there are usually many more dependencies than for Linux programs (which means that Windows team has done a good job promoting vendor lock-in), so dependencies tend to be much more of a problem for Windows-based programs. In any case, you should keep a list of all your dependencies, and each and every new dependency needs to be discussed before being accepted.

A tiny real-world story about dependencies: the best thing I’ve ever seen in this regard, was a server-side Windows process, which directly linked exactly one DLL – kernel32.dll (which, of course, indirectly linked ntdll.dll). The process used shared memory to communicate with the other processes on the same machine, and the whole thing was able to run for months without any issues. While having even less dependencies is certainly not an option, I’m usually trying at least to come as close to this “absolute dependency minimum” as possible; it tends to help quite a bit in the long run.

23. DO implement application-level balancing

When you have those front-end servers (see item #20 above), there is a question of “how different clients should reach those different front-end servers?” There are three answers to this question – two are classical ones, and one is unorthodox but quite handy for our specific task of “game-with-our-own-client”.

The first classical answer to the balancing problem is “use hardware load balancer”. This is a box (and a very expensive one) which sits in front of your front-end servers and balances the load between them. This is what your network admins will push you (and very hard at that) to do. It might even work for you, but there are inherent problems with this approach:

each such a balancer needs to handle all your traffic (sic!) as it is inherently overloaded, it might start dropping packets which in turn will degrade your player experience

these boxes are damn expensive

such a box is either a single point of failure, or redundancy will make it even more expensive redundancy implementations of the balancers have been observed to be a source of failures themselves

it is yet another box which might go wrong (either hardware may fail, or it may get misconfigured)

it can’t possibly work to balance across different locations

On the positive side – I can see only that hardware balancers can at least in theory provide better balancing, in cases when one single player can eat up 10+% of the single server load (which I don’t see happening in practice, but if your game requires it – by all means, take it into account).

“As you can see, I'm not exactly an avid fan of hardware load balancers. This is not to say that they're useless in general – there are cases when you don't have any better options.As you can see, I’m not exactly an avid fan of hardware load balancers. This is not to say that they’re useless in general – there are cases when you don’t have any better options. One big example is when you need to balance web traffic, and then it is either DNS round robin described below, or hardware load balancers, with both options being rather ugly. Fortunately, for our game-with-client purposes we don’t need to decide which of them is worse (see ‘unorthodox-but-my-favorite approach’ below).

The second classical solution for the balancing problem is “DNS round robin”. It is hacking your own DNS to return different addresses to different clients in a random manner. Unfortunately, this approach has One Big Fat Problem: it doesn’t provide for easy failover, so if one of your round-robin servers goes dead, some of your clients won’t be able to play until you fix the problem. Ouch. Certainly not recommended.

An unorthodox-but-my-favorite approach to the balancing problem is to have balancing performed by the client (it is our own client anyway). The idea is to have a list of servers embedded into client (probably into some config file), and try them in a random manner until a good one is found. It addresses most of the problems with two approaches above, handles all kinds of failures in front-end servers in a very simple and natural way (the only thing you need to do in your client – is to detect that connection has failed, and try another random server from the list), and in practice achieves almost-perfect balancing. While 99% of network engineers out there will tell you that application-level balancing is a Bad Idea (preferring hardware “load balancers“ instead), you still should implement it. In addition to overcoming those disadvantages of hardware load balancers described above, there are several additional reasons to prefer your own application-level balancing:

With application-level balancing, you can balance between different datacenters. Depending on the nature of your application and deployment architecture, it might help you to deal with DDoS attacks, and help a lot.

With application-level balancing, you can easily balance between different cloud nodes residing whenever-they-prefer-to-reside.

With application-level balancing, you don’t risk that “load balancer“ itself becomes a bottleneck, and that it introduces some packet loss etc., which in turn might affect end-user experience (while this is a bit of repetition of disadvantages of hardware load balancers, it is important enough to be mentioned again)

Bottom line: just implement application-level balancing as described above. It will take you two hours to do it (including testing), and in the extreme case, your clients just won’t use it.

23a. DO use both DNS-based addresses and numeric IP addresses

“If your game is the only one working when all the competition is down, it improves user perception about your app a lot When implementing your application-level balancing as described right above, store both DNS-based and number-based addresses of the same server within your client-side app. While this advice is somewhat controversial (and once again, network engineers will bash you for doing it), this allows to handle scenarios when the end-user has Internet access, but his ISP’s DNS server is down (which does happen rather often).

By adding number-based IP addresses to the mix, you’ll make your app able to work when your competition (which is usually about any other game out there) is not working for that specific user. It won’t happen often (around 0.x% of time, with x being a small integer), so this difference might be insignificant. However, if your game is the only one working when all the competition is down, it improves user perception about your app a lot (at essentially zero cost for you). And with the modern app updates being much more frequent than IP address changes (plus you should keep DNS address too, there is no reason not to do it), all the arguments against using numeric-IPs-which-might-change-all-out-of-a-sudden become pretty much insignificant.



24. DO test your engine/game over a Bad Connection and over Trans-Atlantic Connection

“With these two test setups (one being “the worst connection you want to support”, another being trans-atlantic) – you can be reasonably confident that your game won’t have too many problems when deployed to a real world. Way too often I saw network applications which worked perfectly within LAN, and failed badly when moved to the Internet. If your application is targeted for the Internet – test it over the Internet from the very beginning of development. I really really recommend it. This kind of testing will save you a lot of effort (and a lot of embarrassment) later.

Moreover, you should test your application not on “just any Internet connection”, but to have “the very worst connection you want to support” (which is often “the very worst connection you can find”) just for testing purposes. In one of my large projects back in 2000 or so, we’ve used AOL dial-up for the testing purposes, and it worked like a charm. I don’t mean that the connection was good; to the contrary, it was pretty bad, but it meant that after we’ve made our application work over this pretty bad connection, it worked without any issues over any connection.

Another set of tests which you SHOULD do is testing your game over a trans-atlantic connection. Once I saw a (business) application which worked ok in LAN, but when deployed to a trans-atlantic connection, opening a form began to take 20+ minutes. The problem was an obviously excessive number of round-trips; rewriting it to use a different network protocol has helped to improve the performance 400x, bringing opening a form to an acceptable few-seconds level. Writing it this way from the very beginning would save a few months of work and a lot of embarrassment when showcasing it to the customers.

With these two test setups (one being “the worst connection you want to support”, another being trans-atlantic) – you can be reasonably confident that your game won’t have too many problems when deployed to a real world.

In addition to the testing your engine as described above, you should also encourage developers which write games on top of your engine, to do the same. Quite a few networking issues can easily apply not only to the engine but also to the game itself, and sorting them out from the very beginning is an invariably Good Thing.



25. DO analyze your traffic with Wireshark

Wireshark is a free and open-source packet analyzer.— Wikipedia —There is a wonderful tool out there which every network developer should use – it is Wireshark. If you didn’t do it yet – take a look at your application’s traffic with Wireshark. In the ideal case, it won’t tell you anything new about your game engine, but chances are it will show you something different from what-you’d-expect, and you might be able to improve things as a result. In addition, the experience with Wireshark comes very handy when you need to debug network problems on a live production server. If your server is a Linux one, you can make a tcpdump of the traffic of the user who has problems, get it to your local PC and using Wireshark analyze what is happening to this unfortunate user. Neat and very useful in practice!

And if you’re developing game engine intended for lots of games, consider developing a Wireshark plugin, so game developers are able to analyze your traffic. While this may be at odds with security-by-obscurity, let’s face it that if your engine is popular, all your formats and protocols will be well-known anyway.

26. DO Find Metrics to Measure your Player Experience Network-Wise

This one is a bit tricky, but the idea is the following. When deploying a large system, there is always a question about system health. And for an over-the-Internet game system, a question of network health is a Really Important one. While you can use different metrics for this purpose, practice has shown that the best metrics are those which are observable in user space.

For example, if you’re using TCP as a transport, you can use “number of non-user-initiated disconnects per user per hour”; if you’re using UDP as a transport – it can be something like “percentage of packets lost” and/or “jitter”. In any case, what is really important – is to have some way to see “how changes in deployed system affect user experience”.

Why this is so important? Because it allows to analyze lots and lots of things, which are very difficult to find out otherwise. Just one practical example. At some point, I’ve seen a “great new blade server” installed instead of a bunch of front-end servers, for a large multiplayer game. So far so good, but it has been observed that those users connected via this “great new blade server” were experiencing a bit more disconnects per hour than those connected to an older-style 1U boxes.

“Most importantly, however, this approach allows to keep your players happy – and this is one thing which really mattersAn investigation has lead to a missing flow-control on the specific model of blade chassis hub (!) – which of course was prompty replaced. While this one single example didn’t make much overall difference from player perspective, but over the course of several years there were dozens of such issues (including comparisons with-hardware-balancer vs without-hardware-balancer, comparison of different ISPs and inter-ISP peerings, comparisons before/after datacenter migration, etc.). I feel that having this metrics has significantly contributed to the reputation of “the best connectivity out there” which has been enjoyed by the game in question.

BTW, the changes which can affect such metrics are not restricted to hardware stuff – certain software changes (such as protocol changes) were also observed to affect user experience metrics. It means that these metrics will be good both for admins-who-deploy-your-game, and for you as developer to see if your recent changes didn’t make life of your players worse. Most importantly, however, this approach allows to keep your players happy – and this is one thing which really matters (it is players who’re paying the bill, whether we as developers like it or not [NoBugs2011]).

To be Continued…

This post concludes Part III of the article. Stay tuned for Part IV, Great TCP-vs-UDP Debate.

EDIT: The series has been completed, with the following parts published:

Part IV. Great TCP-vs-UDP Debate

Part V. UDP

Part VI. TCP

Part VIIa. Security (TLS/SSL)

Part VIIb. Security (concluded)

Acknowledgement

Cartoons by Sergey Gordeev from Gordeev Animation Graphics, Prague.