

Author: “No Bugs” Hare Follow: Job Title: Sarcastic Architect Hobbies: Thinking Aloud, Arguing with Managers, Annoying HRs,

Calling a Spade a Spade, Keeping Tongue in Cheek

[rabbit_ddmog vol=”3″ chap=”Chapter 8(b) from “beta” Volume III”]

The Myth of Stateless-Only Scalability

For quite a long time (and especially in the webdev world), there exists a perception that to achieve scalability, all our request handlers need to be as stateless as possible (in particular, RESTful web services don’t allow for in-memory states1). In the world of the all-popular Docker containers, it means that all the app containers need not only to be immutable, but also should be ephemeral (a.k.a. disposable).2

From a practical perspective, it translates into the following observation:

It is widely (MIS)believed that stateless Server-Side Apps are The Only Way™ to scale Server-Side.

In this statement, it is “The Only” part which I am arguing against. Sure, having perfectly stateless request processing is a Good Thing™ – but only as long as you can afford it <sad-face />.

Pushing Scalability Problem to the Database

I am innocent of the blood of this just person: see ye to it. — Pontius Pilate, Matthew 27:24 —

The problem with stateless processing (such as RESTful-style stateless request handlers) is that

If our functional specification requires storing the state on Server-Side, 3 and we’re using stateless request handlers – then all the state inevitably ends up in the database.

This, in turn, means that by going for stateless request processing:

There is no scalability problem on the request processing side anymore In other words, we DO have perfect scalability for request processing apps, yay!

“ The whole scalability problem rears its ugly head on the database level And as we’ll see below – it becomes much worse at that point, up to the point of being completely unmanageable <sad-face />.



In other words –

keeping our request handlers stateless, does NOT really solve the scalability problem; instead – it merely pushes it to the database.

Sure, if we’re working in a classical Huge-Company environment, we (as app developers) can say “it is not our problem anymore” (washing our hands of the matter Pilate-style). However, if our aim is not only to cover our ***es to keep our current job while the project goes down, but rather want (in a true DevOps spirit) to make sure that the whole system succeeds – we need to think a bit further. Most importantly, we need to realize that pushing the problem from us to DBAs isn’t the end of the scalability problems; instead – we should ask ourselves:

with the kind of load we’ll be throwing at the database, will it be feasible to scale the database (that is, without spending millions on hardware/licenses/maintenance)?

Databases and Scalability

As mentioned above – we as app-developers DO need to think how much load we want to throw at the database. In this department, there are some pretty bad news for us. In spite of what your DBA may (and your database salesman will) tell you – in general,4 databases certainly do NOT scale trivially in a linear manner. Worse than that –

In a pretty much any serious real-world interactive system, it is database which is The Bottleneck™.

I remember a discussion with very knowledgeable architect-level guys from a pretty large company as early as in 2000; during the discussion, we had quite a few disagreements, but one thing was very obvious to everybody involved: scaling everything-besides-database is trivial, it is database which is going to cause trouble scalability-wise. Since that point, I’ve seen (and built) quite a few serious systems – and haven’t see anything which might have changed my opinion about it.

When speaking about real-world OLTP databases in 2017,5 the following very practical observations usually stand (we’ll discuss it in more detail in Vol. VI’s chapter on Databases):6

Pretty much any kind of load up to approximately 10 write-ACID-transactions/second is trivial

When you need loads of the order of 100 write-ACID-transactions/second – it is usually doable to use usual database optimizations (indexes, caches, physical layout, BBWC RAID, etc.).

Getting to 1000 write-ACID-transactions/second becomes severely non-trivial. If the database structure and loads don’t allow for trivial sharding (in particular – if we need to allow players to play with anybody-they-want-to-play-with) – things start to get ugly. We’ll discuss one way to do it, in Vol. VI’s chapter on Databases.

“ for a non-trivially-shardable database, 100 billion write-transactions/year inevitably becomes The Ultimate Nightmare™ for DBAs

Everything above this is both (a) very rarely really necessary,7 and (b) in practice, works only without ACID requirement, or for trivially-shardable scenarios.

In other words –

Increasing DB load by a factor of 10x, can easily kill the whole thing.

This, in turn, means that

When writing our apps, we MUST care about DB load.

As we’ll see below – app-level in-memory state can easily reduce number of write-ACID-transactions by factor of 10x to 1000x(!), which will make huge practical difference; as a result – at the very least we should include stateful apps into our consideration.

NoSQL to the rescue? Not really

When speaking about databases and scalability – these days we’re often told “hey, there are lots of NoSQL databases which can handle Big Data and scale to infinity”, implying that the whole scalability problem goes away.

Unfortunately, it is not the case. While NoSQL databases indeed shine in certain scenarios when we need to perform read-only queries – they tend to have very significant problems when dealing with OLTP-like load, with lots of writes and very high coherency requirements. We’ll discuss these issues in detail in Vol. VI’s chapter on Databases, but for now let’s note that vast majority of NoSQL databases does not support multi-object database transactions with ACID guarantees (and proposed equivalents either don’t scale, or aren’t usable, or both). Yes – however sad it might sound, ACID support in most of NoSQL DBs out there is limited to single-object ACID transactions (which is not enough for 99% of OLTP processing tasks); moreover – extending this support to multiple objects under traditional NoSQL architectures will be at odds with their scalability.

BTW, I don’t mean that NoSQL is a Bad Thing(tm); it is just that each technology should be used within its own applicability realm. In particular, these days OLTP is still better to be performed over traditional RDBMS; however – as we’ll see in Vol. VI’s chapter on Databases – it is perfectly possible to have eventually-consistent replica of this OLTP in Big-Data-oriented NoSQL, to process all kinds of historical queries there.

Scaling: Stateless vs Stateful Server-Side Apps

Let’s compare two approaches to scaling our MOG (or any other Server-Side-centric interactive distributed system for that matter). One scaling approach will use classical Stateless Server-Side Apps, and another – will use Stateful Server-Side Apps (with an in-memory state).

We’ll be comparing these two approaches to scaling from several different angles; in particular – from the perspective of performance, durability and scalability.

Performance Perspective

As noted above, in case of Stateless Server-Side Apps, we’re bound to store everything to the DB. And for a vast majority of game-like systems, this is going to be prohibitively expensive. A few examples from different genres:

For a virtual world simulation, writing everything to the database is going to be a non-starter; as the state of each player usually changes 20 times per second, making this much transactions per player is going to kill the whole thing even for a very modest number of players.

For a casino-like game such as poker, we’ll need to write every single player action to DB. This means making on average about 20 DB transactions per hand.

Even for a social farming game, we could easily end up with several dozens of clicks per player-currently-using-farm, per minute.

Let’s compare it to the app with an in-memory state, the one which writes changes to Game World State to DB only at the end of the Game Event (as it was discussed in detail in the [[TODO]] section above):

For a virtual world simulation, we can write changes to player state, only at the end of Game Events such as fights (conversations having consequences, etc.). It will often allow us to write things to DB once-per-minute or so (which is a 1200x(!) improvement compared to the stateless approach above).

For a poker game, we’ll need to write only the outcome of each hand to DB, corresponding to approximately 20x savings compared to naïve stateless approach.

For a farming game, most of the time we can make an artificial Game Event (which ends either on a Really Important Achievement™, or after a certain timeout). In practice – we can easily save up to 10x DB-activity-wise (compared to the stateless app).

As we can see –

Server-Side Apps with an In-Memory State can easily save us 10x-1000x of database load.

“it is DB which is usually The Bottleneck™ – it means that we’re saving this enormous amount of load, exactly where it really matters.Moreover, as it is DB which is usually The Bottleneck™ – it means that we’re saving this enormous amount of load, exactly where it really matters.

BTW, this observation goes far beyond traditional games. Some of us remember that epic migration of Uber first from MySQL to PostgreSQL in 2013, only to migrate back to MySQL (with a custom extension) 3 years later.

However, I heard an opinion that Uber would fare much better if they’d avoided the supposedly perfectly-scalable stateless-app architecture, and kept the most common update (the same source has reported it as storing “current position of the car”) as mostly-in-memory only (writing to DB at large intervals, like “write the whole history of the car positions once per hour per car”8) – pretty much along the lines discussed above. Sure, all the trips still need to be saved immediately (they have direct financial implications, and do need to be durable even if the Server-Side App crashes), but with mere 1 million trips per day which Uber has (that’s just 30 transactions/second even accounting for intra-day load variations) – writing it down is trivial even for a single-writing-DB-connection OLTP system (in fact – the single-writing-DB-connection was seen handling 50M+ real-world transactions/day9); for details – see Vol. VI’s chapter on Databases.

Sure, as I didn’t make this analysis for myself – I cannot really vouch for it, but I have to say that given the numbers above – it looks quite plausible. Moreover, I did see real-world systems (which I unfortunately cannot name here), which experienced exactly this kind of problems (and exactly due to making everything stateless, effectively increasing database load 10x+-fold, and causing lots of trouble for the database, DBAs, and ultimately – for end-users).

Durability Perspective

Of course, these performance benefits of Stateful Server-Side Apps don’t come for free (nothing does). The currency we’ll be paying with for this drastically improved performance, is Lack of Durability. In other words – if our Stateful Server-Side App crashes, we’ll lose all the state which haven’t been saved yet to the DB.

“On the first glance this lack of Durability may look Really Bad™, but on the other hand, for most of MOGs – it is exactly the behavior we wantOn the first glance this may look Really Bad™, but on the other hand, for most of MOGs – it is exactly the behavior we want (see the [[TODO]] section above for discussion). Moreover, even for a non-gaming interactive systems such as Uber, going along these lines is perfectly acceptable at least for some of the data. Using Uber data as an example – if in case of App crash we lose trips – it would be a significant problem, but losing half an hour of historical data about historical positions of the cars won’t be noticeable (as these historical data is used only for statistical purposes – losing a very minor random portion of it won’t change the stats).

Alternatively – there is a possibility to make our Stateful Apps Fault-Tolerant (for a relevant discussion – see Chapter 10). Still, to be honest, as Fault Tolerance doesn’t prevent from software-bug-induced crashes – for fast-changing business- and game-like apps I’d rather not risk to rely on it (in other words – in business world,10 crash costs and crash prevention costs are balanced in a way that implies that sooner or later, crashes will happen).

Scaling Perspective

[Bart] – Well, you’re damned if you do,

[Chorus] – Deep, deep trouble.

[Bart] – Well, you’re damned if you do,

And you’re damned if you don’t. — The Simpsons —

As already noted above, performance != scaling, so let’s take a look at our Stateful-vs-Stateless Server-Side Apps from scaling perspective too.

Scaling Stateless-App-Based System

As discussed above, when speaking about a system based on Stateless Apps, scalability is trivially achievable: we just need to create new (or use an existing) instance of our Stateless App – and bingo! – we got our scalability.

A very high-level diagram of this approach is shown on Fig 8.1:

It all looks very simple: after Clients come to our Load Balancer, asking for a service from App 1 – they’re randomly directed to one of the instances of App 1 (these instances may run on one Server Box or can be spread over several different ones); the same happens for App 2 (or any other type of App).

In this model, all the Apps are perfectly stateless (i.e. they carry no meaningful state between requests), and therefore they can be created/destroyed as necessary. From the point of view of scaling Apps – it is a perfect scenario.

“as discussed above, the real-world task is always about scaling the whole system, including database; and in this regard Stateless-App-based systems exhibit significant problems.On the other hand, as discussed above, the real-world task is never formulated in terms of scaling only apps; instead – we need to scale the whole system; and in this regard Stateless-App-based systems exhibit significant problems.

In particular, as discussed above, for Stateless-App-based architectures, all the scalability work is pushed down to the database. Of course, for some of Server-Side developers it means merely pushing the responsibility to somebody-else with relief, but we’re currently wearing our architectural hat, so assuming that “somebody will do it for us” (without an understanding how it will be done) is not really an option.

Moreover, in practice, this DB (which needs to handle all the state updates merely because there is no other place to store them) becomes a very bad white-hot bottleneck. Once, a DBA of such a system told me about a nightmare he had – it was about the servers which got so hot that they started melting. Fortunately, I never found myself in such position – but I can understand him perfectly:

Pushing unnecessary stuff to DB-which-is-already-The-Bottleneck, is a Pretty Bad Idea™.

As discussed above – scaling DBs is a very well-known huuuuge headache (a.k.a. Deep Trouble™); achieving even 1000 real-world transactions11 per second (which – taking into account usual MOG-like load patterns – corresponds to about 30M transactions/day, or 10 billion transactions/year) is already pretty difficult; going above this number has several very unpleasant consequences:

Non-trivial solutions are required.

Costs go through the roof, but as the dependency between load and costs is highly non-linear, spending more doesn’t help much. 12

job of DBAs becomes extremely difficult.

overall reliability suffers, starting from a very simple observation: the more DB Server Boxes you need to run – the higher chances are that at least one of them crashes. This, in turn, leads to convoluted fault tolerant systems (with fault tolerance further taking its toll both in terms of bugs and in terms of reduced performance) Just to illustrate my point: to handle a million writes per second, in Uber uses at least a few dozen boxes (using Cassandra on top of Mesos) [Verma] 13 ; to achieve desirable reliability of 99.99%; this, in turn, required them to build a full-scale fault-tolerant system with transparent handling of box failures etc. etc. In an alternative architecture (the one using some kind of in-memory caching 14 ) – a similar task of writing one million rows per second in a highly parallelizable environment can be achieved on one single server box running a decent RDBMS, which tends to have comparable reliability 15 at at least 10x less cost.



“for multiplayer games simplistic sharding will rarely work Of course, in some cases simple sharding will do the trick; in particular – for most of the single-player games sharding is trivial (each player sits in her own shard, with no interactions between the shards). However, as we’re speaking about multiplayer games (where it is usually impossible to restrict “which players are allowed to interact with each other” – see also discussion on it in Vol. I’s chapter on GDD) – sharding will rarely work (at least not without significant help from the app level).

Still, as we’ll discuss it in the Vol. VI’s chapter on Databases – it is usually possible to scale an MOG OLTP DB to 100M DB transactions/day and probably beyond; however – it is a very significant effort, which requires lots of complicated work (and while it seems having no apparent scalability problems – I didn’t see it scaling beyond 100M DB transactions/day, so I cannot really vouch for it).

Based on all the discussion above, we can make one very important conclusion:

if we can avoid (or at least significantly postpone) this effort by reducing DB load by a factor of 10x-100x – we should do it.

Note that I am not arguing for pursuing optimizations which save mere 20% at architectural stage; these are usually too small16 to shift the balance from one architecture to another one; however, a 10x performance improvement due to better architecture, most of the time qualifies as a game changer (pun intended).

Scaling Stateful-App-Based System

With a system which is based on Stateful Apps (and assuming that our business logic/GDD is ok with relaxed Durability guarantees for the data which we decide to keep in-memory only) – we’re often able to reduce DB load by a factor of 10x-100x (and as observed above, for DB load, this kind of difference usually means a Damn Lot™).

A corresponding diagram is shown in Fig 8.2:

Compared to Stateless-Based approach shown on Fig 8.1, we can see the following differences:

Our Apps got In-Memory state. In turn, it means that: This In-Memory State is not durable, and can be lost (as discussed above, this is exactly the behavior we want while Game Event is in progress) Achieving balance between different Apps is not that trivial (in practice, I’ve never seen imbalance to be a significant problem, but I do know scenarios when it may happen 17 ).

As our Apps are now Stateful, it means that (unlike with stateless web apps), there are potentially two separate aspects for our Load Balancing: First, we need to make sure that server-box-which-carries-our-state, has enough CPU power; in other words – we need to Load-Balance our Stateful Apps between different Server Boxes. OTOH, for slower-paced games, this can be bypassed by moving state around as necessary; see Chapter 9 for discussion on Web-Based Architectures as an example. Most of the time, this balancing will be less perfect than Load Balancing of Stateless Apps; from what I’ve seen – it is possible to keep these discrepancies in check, but in certain cases it may become a rather significant headache. For games, usually, this flavor of Load Balancing (of the objects among Server Boxes) is performed by Matchmaking Server (not shown on Fig 8.2). The idea is that it is Matchmaking Server which decides to run a game (and where) – so it decides where to create the next instance. A completely separate Load Balancer (which may include moving Game World instances around) – is also possible. Interesting variations of such Load Balancing include balancing of “seamless” MMO worlds as discussed in [Beardsley] and [Baryshnikov]. Second – we may need to balance incoming players (or requests) among the Servers. While usually it is not a problem (as each of the Apps tends to serve about the same number of players) – in cases when we’re broadcasting some Very Important Game™ (like some Big Final) to everybody-who-wants-it, this MAY become a problem. For an example of a solution – see discussion about Front-End Servers in Chapter 9.

“ The main advantage of this approach is about reduced DB load In turn, it means that we need to identify those Game Events which allow/require writing to DB 18 , and to understand implications of rolling back to the beginning of the Game Event in case of crash.



Overall, from what I’ve seen in the wild, Stateful-App-Based systems tend to both perform and scale much much better than Stateless-App-Based ones. Of course, if Durability is a firm requirement, these architectures won’t fly (at least without ensuring fault tolerance for the Apps, preserving their In-Memory State).

Scaling System Based on Stateless-App plus In-Memory Write-Back Cache

At this point, we have to note that strictly speaking, having an In-Memory State somewhere in the system does NOT necessarily imply that it is our Apps which need to be Stateful.

Instead of using Stateful Apps, to save on the DB load while keeping our Apps stateless, we can have a centralized In-Memory write-back(!) cache sitting between our apps and DB, as shown on Fig 8.3:

From the point of view of scaling, this model is a kind of “hybrid” between the Stateless-App and Stateful-App models. In particular, with such a Stateless-App-plus-In-Memory-Write-Back-Cache model:

Like with Stateful-Apps, we do reduce DB load a lot. As discussed above, this simplifies scaling DB greatly

Like with Stateful-Apps, we do need to identify our Game Events, and to ensure DB writes at the end of Game Events (though with In-Memory Cache, it will be done by write-back cache on instructions of our App)

Like with Stateful-Apps, we do sacrifice Durability between Game Events (i.e. crash of Write-Back Cache kills all the stuff which wasn’t written to DB yet)

Like with Stateless-Apps, we can Load-Balance only the incoming requests (or players) – and there is no need to Load-Balance the Stateful-Apps. Scaling In-Memory Cache is rarely a problem.



“This approach tends to work pretty well at least for social games in Web-Based Deployment Architecture – and may work for medium-paced games such as casino games too.This approach tends to work pretty well at least for social games in Web-Based Deployment Architecture (which we’ll discuss in Chapter 9) – and may work for medium-paced games such as casino games too. On the other hand, for really fast-paced games (especially simulations) this model won’t really work problems because of (a) latencies to retrieve the state, and (b) because of enormous traffic between Stateless Apps and In-Memory Cache.

Scaling System Based on Disposable-Stateful-Apps

In some cases (in particular, for stock exchanges) there is a firm requirement to have all the modifications to the state of our system Durable (which means that they should go to DB, there is no way around it).

In such cases, and if the latencies are important – a kind of “Disposable-Stateful-Apps” can be used. The point here is to have a more-or-less usual Stateful App, but in this case our Stateful App will be merely serving as a read-only cache for DB information; as a result – in a case of crash it becomes trivial to restore the data from DB (which in turn makes such Stateful Apps disposable (in Docker-speak – ephemeral)).

An example of such a system is shown in Fig 8.4:

This approach is very similar to the one shown in Fig 8.1 (the one for Stateless Apps) – except that Apps are no longer stateless <wink />. However, while Apps in Fig 8.4 are Stateful – their state is merely a read-only app-level cache, so in case of App crash (or App relocation/creation) the state can be easily reconstructed from the Database. While this approach does not reduce DB load compared to purely Stateless Apps <sad-face />, it does improve latencies significantly (which can be a Big Plus for stock exchanges etc.).

BTW, in a certain sense this approach shares some ideas with Front-End Servers as discussed in Chapter 9 (in a sense, Front-End Servers can be seen as read-only caches of “master” state published by the source, too).

From scaling point of view, this model is an another “hybrid” between the Stateless-App and Stateful-App models. In particular, with such a Stateless-App-plus-In-Memory-Cache model:

Like with Stateless-Apps, we cannot reduce DB load <sad-face />

Like with Stateful-Apps, we do reduce latencies

Like with Stateless-Apps, we do NOT sacrifice Durability.

Like with Stateless-Apps, we can Load-Balance only the incoming requests (or players) – and there is no need to Load-Balance the Stateful-Apps (they’re both disposable and interchangeable).

Stateful vs Stateless

With all the different scalability models discussed above (BTW, each of them has its own niche when it is The Right Thing To Do™), it would be nice to have a simple guideline to know where to start. Not pretending that I have a definite answer which will work in all the scenarios, from my experience, I’d say that the following qualifies as a reasonably good starting point for your analysis:

“ If limitations to Durability are not a concern, and we can save at least 3-5x of DB load by using In-Memory State – we should go for it! One obvious solution in this direction is to use Stateful Apps. This approach does work – but has quite a few complications In a sense – when moving from Stateless Apps to Stateful Apps, we’re trading DB scaling complications (which are typical for Stateless Apps) for App scaling complications (typical for Stateful Apps). From my experience, such a trade-off is well-worth it. On the other hand, as discussed above, in some cases (in particular, if the game is not too fast) we can both reduce DB load, and avoid Stateful Apps (via using an In-Memory Write-Back Cache). Still, it is not a silver bullet (and won’t really work for most of fast-paced games such as simulations)

If 100% Durability is a requirement (such as for stock exchanges) – then the choice becomes less obvious. If optimizing latency is a requirement – some kind of Disposable-Stateful-Apps is likely to be necessary. Personally, I’ve architected a stock exchange on top of such Disposable-Stateful-Apps (which can be seen as being along the lines of a usual game, but with DB commits on each trader action) – and with a very significant success too If optimizing latencies is not really needed (which includes pretty much all polling architectures) – then a classical Client-Server web architecture will do.



[[To Be Continued…

This concludes beta Chapter 8(a) from the upcoming book “Development and Deployment of Multiplayer Online Games (from social games to MMOFPS, with social games in between)”.

Stay tuned for beta Chapter 25(c), where we’ll discuss System Monitoring]]

Acknowledgement

Cartoons by Sergey Gordeev from Gordeev Animation Graphics, Prague.