

Author: “No Bugs” Hare Follow: Job Title: Sarcastic Architect Hobbies: Thinking Aloud, Arguing with Managers, Annoying HRs,

Calling a Spade a Spade, Keeping Tongue in Cheek





Pages: 1 2 3 4

[rabbit_ddmog vol=”3″ chap=”Chapter 9(a) from “beta” Volume III”]

After drawing all that nice client-side QnFSM-based diagrams, we need to describe our server architecture. The very first thing we need to do is to start thinking in terms of “how we’re going to deploy our servers, when our game is ready?” Yes, I really mean it – architecture starts not in terms of classes, and for the server-side – not even in terms of processes or FSMs, it starts with the highest-level meaningful diagram we can draw, and for the server-side this is a deployment diagram with servers being its main building blocks. If deploying to cloud, these may be virtual servers, but a concept of “server” which is a “more or less self-contained box running our server-side software”, still remains very central to the server-side software. If not thinking about clear separation between the pieces of your software, you can easily end up with a server-side architecture that looks nicely while you program it, but falls apart on the third day after deployment, exactly when you’re starting to think that your game is a big success.

Deployment Architectures, Take 1

In this Chapter we’ll discuss only “basic” deployment architectures. These architectures are “basic” in a sense that they’re usually sufficient to deploy your game and run it for several months, but as your game grows, further improvements may become necessary. Fortunately, these improvements can be done later, when/if the problems with basic deployment architecture arise; these improvements will be discussed in Chapter [[TODO]].

Also note that for your very first deployment, you may have much less physical/virtual boxes than shown on the diagram, by combining quite a few of them together. On the other hand, you should be able to increase the number of your servers quickly, so you need to have the software able to work in basic deployment architecture from the very beginning. This is important, as demand for increase in number of servers can develop very soon if you’re successful. We’ll discuss your very first deployment in Chapter [[TODO]].

First, let’s start with an architecture you shouldn’t do.

Don’t Do It: Naïve Game Deployment Architectures

Quite often, when faced with development their very first multi-player game, developers start with something like the following Fig VI.1:

It is dead simple: there is a server, and there is a database to store persistent state. And later on, as one single Game World server proves to be insufficient, it naturally evolves into something like the diagram on Fig VI.2:

with each of Game World servers having its own database.

My word of advice about such naïve deployment architectures:

DON’T DO THIS!

Such a naïve approach won’t work well for a vast majority of games. The problem here (usually ranging from near-fatal to absolutely-fatal depending on specifics of your game) is that this architecture doesn’t allow for interaction between players coming from different servers. In particular, such an architecture becomes absolutely deadly if your game allows some way for a player to choose who he’s playing with (or if you have some kind of merit-based tournament system), in other words – if you’re not allowed to arbitrary separate your players (and in most cases you will need some kind of interaction at least because of the social network integration, see Chapter II for further discussion in this regard).

CSR Customer service representatives interact with customers to provide answers to inquiries involving a company's product or services.— Wikipedia —For the naïve architecture shown on Fig VI.2, any interaction between separate players coming from separate databases, leads to huge mortgage-crisis-size problems. Inter-DB interaction, while possible (and we’ll discuss it in Chapter [[TODO]]) won’t work well around these lines and between completely independent databases. You’re going to have lots and lots of problems, ranging from delays due to improperly implemented inter-DB transactions (apparently this is not that easy), to your CSRs going crazy because of two different users having the same ID in different databases. Moreover, if you start like this, you will even have trouble merging the databases later (the very first problem you will face will be about collisions in user names between different DBs, with much more to follow).

To summarize relevant discussion from Chapter II and from present Chapter:

A. You WILL need inter-player interaction between arbitrary players. If not now, then later. B. Hence, you SHOULD NOT use “naïve” architecture shown above.

Fortunately, there are relatively simple and practical architectures which allow to avoid problems typical for naïve approaches shown above.

Web-Based Game Deployment Architecture

If your game satisfies two conditions:

first, it is reeeeallyyyy sloooow-paaaaaced (in other words, it is not an MMOFPS and even not a poker game) and/or “asynchronous” (as defined in Chapter I, i.e. it doesn’t need players to be present simultaneously),

and second, it has little interaction between players (think farming-like games with only occasional inter-player interaction),

then you might be able to get away with Web-Based server-side architecture, shown on Fig VI.3:

Web-Based Deployment Architecture: How It Works

The whole thing looks alongside the lines of a heavily-loaded web app – with lots of caching, both at front-end (to cache pages), and at a back-end. However, there are also significant differences (special thanks to Robert Zubek for sharing his experiences in this regard, [Zubek2016]).

The question “which web server to use” is not that important here. On the other hand, there exists an interesting and not-so-well-known web server, which took an extra mile to improve communications in game-like environments. I’m speaking about [Lightstreamer]. I didn’t try it myself, so I cannot vouch for it, but what they’re doing with regards to improving interactivity over TCP, is really interesting. We’ll discuss some of their tricks in Chapter [{TODO]].

Peculiarities in Web-Based Game architectures are mostly about the way caching is built. First, on Fig VI.3 both front-end caching and back-end caching is used. Front-end caching is your usual page caching (like nginx in reverse-proxy mode, or even a CDN), though there is a caveat. As your current-game-data changes very frequently, you normally don’t want to cache it, so you need to take an effort and clearly separate your static assets (.SWFs, CSS, JS, etc. etc.) which can (and should) be cached, and dynamic pages (or AJAX) with current game state data which changes too frequently to bother about caching it (and which will likely go directly from your web servers) [Zubek2010].

CAS Compare-And-Swap is an atomic instruction used in multithreading to achieve synchronization. It compares the contents of a memory location to a given value and, only if they are the same, modifies the contents of that memory location to a given new value.— Wikipedia —At the back-end, the situation is significantly more complicated. According to [Zubek2016], for games you will often want not only to use your back-end cache as a cache to reduce number of DB reads, but also will want to make it a write-back cache (!), to reduce the number of DB writes. Such a write-back cache can be implemented either manually over memcached (with web servers writing to memcached only, and a separate daemon writing ‘dirty’ pages from memcached to DB), or a product such as Redis or Couchbase (formerly Membase) can be used [Zubek2016].

Taming DB Load: Write-Back Caches and In-Memory States

“One Big Advantage of having write-back cache (and of the in-memory state of Classical deployment architecture described below) is related to the huge reduction in number of DB updates.One Big Advantage of having write-back cache (and of the in-memory state of Classical deployment architecture described below) is related to the huge reduction in number of DB updates. For example, if we’d need to save each and every click on the simulated farm with 25M daily users (each coming twice a day and doing 50 modifying-farm-state clicks each time in a 5-minute session), we could easily end up with 2.5 billion DB transactions/day (which is infeasible, or at least non-affordable). On the other hand, if we’re keeping write-back cache, we can write the cache into DB only once per 10 minutes, we’d reduce the number of DB transactions 50-fold, bringing it to much more manageable 50 million/day.

For faster-paced games (usually implemented as a Classical Architecture described below, but facing the same challenge of DB being overloaded), the problem surfaces even earlier. For example, to write each and every movement of every character in an MMORPG, we’d have a flow of updates of the order of 10 DB-transactions/sec/player (i.e. for 10’000 simultaneous players we’d have 100’000 DB transactions/second, or around 10 billion DB transactions/day, once again making it infeasible, or at the very least non-affordable). On the other hand, with in-memory states stored in-memory-only (and saving to DB only major events such as changing zones, or obtaining level) – we can reduce the number of DB transactions by 3-4 orders of magnitude, bringing it down to much more manageable 1M-10M transactions/day.

As an additional benefit, such write-back caches (as long as you control write times yourself) and in-memory states also tend to play well with handling server failures. In short: for multi-player games, if you disrupt a multi-player “game event” (such as match, hand, or fight) for more than a few seconds, you won’t be able to continue it anyway because you won’t be able to get all of your players back; therefore, you’ll need to roll your “game event” back, and in-memory states provide a very natural way of doing it. See “Failure Modes & Effects” section below for detailed discussion of failure modes under Classical Game Architecture.

A word of caution for stock exchanges. If your game is a stock exchange, you generally do need to save everything in DB (to ensure strict correctness even in case of Game Server loss), so in-memory-only states are not an option, and DB savings do not apply. However, even for stock exchanges at least Classical Game architecture described below has been observed to work very well despite DB transaction numbers being rather large; on the other hand, for stock exchanges transaction numbers are usually not that high as for MMORPG, and price of the hardware is generally less of a problem than for other types of games.

Write-Back Caches: Locking

As always, having a write-back cache has some very serious implications, and will cause lots of problems whenever two of your players try to interact with the same cached object. To deal with it, there are three main approaches: “optimistic locking”, “pessimistic locking”, and transactions. Let’s consider them one by one.

Optimistic Locking. This one is directly based on memcached’s CAS operation.1 The idea of using CAS for optimistic locking goes along the following lines. To process some incoming request, Web Server does the following:

reads whole “game world” state as a single blob from memcached, alongside with “cas token”. “cas token” is a thing which is actually a “version number” for this object.

we’re optimists! 🙂 so Web Server is processing incoming request ignoring possibility that some other Web Server also got the same “game world” and is working on it Web Server is NOT allowed to send any kind of reply back to user (yet)

Web Server issues cas operation with both new-value-of-“game-world”-blob, and the same “cas token” which it has received if “cas token” is still valid (i.e. nobody has written to the blob before current Web Server has read it), memcached writes new value, and returns ok. Then our Web Server may send reply back to whoever-requested-it if, however, there was a second Web Server which has managed to write after we’ve read our blob – memcached will return a special error in this case, our Web Server MUST discard all the prepared replies in addition, it MAY read new value of “game world” state (with new “cas token”), and try to re-apply incoming request to it this is perfectly valid: it is just “as if” incoming request has came a little bit later (which can always happen)



Optimistic locking is simple, is lock-less (which is important, see below why), and has only one significant drawback for our purposes. That is, while it works fine as long as collision probability (i.e. two Web Servers working on the same “game world” at the same time) is low, but as soon as probability grows (beyond, say 10%) – you will start getting a significant performance hit (for processing the same message twice, three times, and so on and so forth). For slow-paced asynchronous games it is very unlikely to become a problem, and therefore by default I’d recommend optimistic locking for web-based games, but you still need to understand limitations of the technology before using it.

Pessimistic Locking. This is pretty much a classical multi-threaded mutex-based locking, applied to our “how to handle two concurrent actions from two different Web Servers over the same “game world” problem.

In this case, game state (usually stored as a whole in a blob) is protected by a sorta-mutex (so that two web servers cannot access it concurrently). Such a mutex can be implemented, for example, over something like memcached’s CAS operation [Zubek2010]. For pessimistic locking, Web Server acts as follows:

obtains lock on mutex, associated with our “game world” (we’re pessimists 🙁 , so we need to be 100% sure before processing, that we’re not processing in vain). if mutex cannot be obtained – Web Server MAY try again after waiting a bit

reads “game world” state blob

processes it

writes “game world” state blob

releases lock on mutex

This is a classical mutex-based schema and it is very robust when applied to classical multi-thread synchronization. However, when applying it to web servers and memcached, there is a pretty bad caveat 🙁 . The problem here is related to “how to detect hanged/crashed web server – or process – which didn’t remove the lock” question, as such a lock will effectively prevent all future legitimate interactions with the locked game world (which reminds me of the nasty problems from the early-90ish pre-SQL FoxPro-like file-lock-based databases).

For practical purposes, such a problem can be resolved via timeouts, effectively breaking the lock on mutex (so that if original mutex owner of the broken mutex comes later, he just gets an error). However, allowing to break mutex locks on timeouts, in turn, has significant further implications, which are not typical for usual mutex-based inter-thread synchronizations:

first, if we’re breaking mutex on timeout – there is a problem of choosing the timeout. Have it too low, and we can end up with fake timeouts, and having it too high will cause frustrated users

second, it implies that we’re working EXACTLY according to the pattern above. In particular: having more than one memcached object per “game world” is not allowed “partially correct” writes of “game state” are not allowed either, even if they’re intended to be replaced “very soon” under the same lock



In practice, these issues are rarely causing too much problems when using memcached for mutex-based pessimistic locking. On the other hand, as for memcached we’d need to simulate mutex over CAS, I still suggest optimistic locking (just because it is simpler and causes less memcached interactions).

Transactions. Classical DB transactions are useful, but dealing with concurrent transactions is really messy. All those transaction isolation levels (with interpretations subtly different across different databases), locks, and deadlocks are not a thing which you really want to think about.

Fortunately, Redis transactions are completely unlike classical DB transactions and are coming without all this burden. In fact, Redis transaction is merely a sequence of operations which are executed atomically. It means no locking, and an ability to split your “game world” state into several parts to deal with traffic. On the other hand, I’d rather suggest to stay away from this additional complexity as long as possible, using Redis transactions only as means of optimistic locking as described in [Redis.CAS]. Another way of utilizing capabilities of Redis transactions is briefly mentioned in “Web-Based Deployment Architecture: FSMs” section below.

Web-Based Deployment Architecture: FSMs

You may ask: how finite state machines (FSMs) can possibly be related to the web-based stuff? They seem to be different as night and day, don’t they?

Actually, they’re not. Let’s take a look at both optimistic and pessimistic locking above. Both are taking the whole state, generating new state out of it, and storing this new state. But this is exactly what our FSM::process_event() function from Chapter V does! In other words, even for web-based architecture, we can (and IMHO SHOULD) write processing in an event-driven manner, taking state and processing inputs, producing state and issuing replies as a result.

As soon as we’ve done it this way, the question “Should we use optimistic locking or pessimistic one”, becomes a deployment implementation detail

In other words, if we have an FSM-based (a.k.a. event-driven) game code, we can change the wrapping infrastructure code around it, and switch it from optimistic locking to pessimistic one (or vice versa). All this without changing a single line within any of FSMs!

Moreover, if using FSMs, we can even change from Web-Based Architecture to Classical one and vice versa without changing FSM code

If by any chance reading the whole “game world” state from cache becomes a problem (which it shouldn’t, but you never know), it MIGHT still be solved via FSMs together with Redis-style transactions mentioned above. Infrastructure code (the one outside of FSM) may, for example, load only a part of the “game world” state depending on type of input request (while locking all the other parts of the state to avoid synchronization problems), and also MAY implement some kind on-demand exception-based state loading along the lines of on-demand input loading discussed in [[TODO]] section below.

Web-Based Deployment Architecture: Merits

Unlike the naïve approach above, Web-Based systems may work. Their obvious advantage (especially if you have a bunch of experienced web developers on your team) is that it uses familiar and readily-available technologies. Other benefits are also available, such as:

easy-to-find developers

simplicity and being relatively obvious (that is, until you need to deal with locks, see above)

web servers are stateless (except for caching, see below), so failure analysis is trivial: if one of your web servers goes down, it can be simply replaced

can be easily used both for the games with downloadable client and for browser-based ones

Web-Based Architecture (as well as any other one), of course, also has downsides, though they may or may not matter depending on your game:

there is no way out of web-based architecture; once you’re in – switching to any other one will be impossible. Might be not that important for you, but keep it in mind.

it is pretty much HTTP-only (with an option to use Websockets); migration to plain TCP/UDP is generally not feasible.

as everything will work via operations on the whole game state, different parts of your game will tend to be tightly coupled. Not a big problem if your game is trivial, but may start to bite as complexity grows.

as the number of interactions between players and game world grows, Web-Based Architecture becomes less and less efficient (as distributed-mutex-locked accesses to retrieve whole game state from the back-end cache and write it back as a whole, don’t scale well). Even medium-paced “synchronous” games such as casino multi-players, are usually not good candidates for Web-Based Architecture.

you need to remember to keep all the accesses to game objects synchronized; if you miss one – it will work for a while, but will cause very strange-looking bugs under heavier load.

you’ll need to spend A LOT of time meditating over your caching strategy. As the number of player grows, you’re very likely to need a LOT of caching, so start designing your caching strategies ASAP. See above about peculiarities of caching when applied to games (especially on write-back part and mutexes), and make your own research.

as the load grows, you will be forced to spend time on finding a good and really-working-for-you solution for that nasty web-server-never-releases-mutex problem mentioned above. While not that hopeless as ensuring consistency within pre-SQL DBF-like file-lock-based databases, expect quite a chunk of trouble until you get it right.

Still,

if your game is rather slow/asynchronous and inter-player interactions are simple and rather far between, Web-Based Architecture may be the way to go

While Classical Architecture described below (especially with Front-End Servers added, see [[TODO]] section) can also be used for slow-paced games, implementing it yourself just for this purpose is a Really Big Headache and might be easily not worth the trouble if you can get away with Web-Based one. On the other hand,

even for medium-paced synchronous multi-player games (such as casino-like multi-player games) Web-Based Architecture is usually not a good candidate

(see above).