

Author: “No Bugs” Hare Follow: Job Title: Sarcastic Architect Hobbies: Thinking Aloud, Arguing with Managers, Annoying HRs,

Calling a Spade a Spade, Keeping Tongue in Cheek

[[This is Chapter 20(d) from “beta” Volume VI of the upcoming book “Development&Deployment of Multiplayer Online Games”, which is currently being beta-tested. Beta-testing is intended to improve the quality of the book, and provides free e-copy of the “release” book to those who help with improving; for further details see “ Book Beta Testing “. All the content published during Beta Testing, is subject to change before the book is published.

To navigate through the book, you may want to use Development&Deployment of MOG: Table of Contents.]]

After we finished with all the preliminaries, we can now get to the interesting part – implementing our transactional DB and DB Server. We already mentioned implementing DB Server briefly in Chapter VII, but now we need much more detailed discussion on this all-important topic.

“Transactional / operational DB is a place where all the automated decisions are made about your game (stock exchange, bank, etc.)First of all, let’s re-iterate what we’re speaking about. Transactional/operational DB is a place where all the automated decisions are made about your game (stock exchange, bank, etc.). It stores things such as player accounts, with all their persistent attributes etc. etc.; it also stores communications related to payment processing, and so on, and so forth. And “DB Server” is our app handling access to DBMS (as noted in Chapter VII, I am firmly against having SQL statements issued directly by your Game Servers/Game Logic, so an intermediary such as DB Server is necessary).

As discussed above, ACID properties tend to be extremely important for transactional/operational DB. We don’t want money – or that artifact which is sold for real $20K on eBay – to be lost or duplicated. For this and some other reasons, we’ll be speaking about SQL databases for our transactional/operational DB (while it is possible to use NoSQL for transactional/operational DB – achieving strict guarantees is usually difficult, in particular because of lack of multi-object ACID transactions in most of NoSQL DBs out there, see discussion in [[TODO]] section above).

And now, we’re finally ready to start discussing interesting things 🙂 .

Multi-Connection DB Access

As it was mentioned in Chapter VII, there are two very different approaches to organizing the access to your DB: multi-connection one and single-write-connection one. Let’s start with much more usual (and IMO quite deficient) approach of having multiple database connections to your DB.

Multiple DB connections are so common out there, that very few people will even think about considering anything different. The idea is on the surface – we have multiple connection to the DBMS, and throw all the requests we may have, at the database, and it will handle these requests for us. More importantly, scalability is not expected to be our problem – scaling is presented as a mere problem of buying more hardware (though this is not really the case, see below).

In spite of the multi-connection DB access being pretty much ubiquitous, it needs to be noted that it has quite a few shortcomings:

As soon as we get two concurrent transactions (of writing/modifying kind) – we need to deal with transaction isolation (discussed above in [[TODO]] section). “ Dealing with transaction isolation is very far from being a picnic In a sense, dealing with transaction isolation at most common Read Committed level is akin to writing multithreaded code, with all the related problems. In particular; There is a risk of not locking some data you need to lock – leading to rarely-occurring and impossible-to-test invalid data modifications within your DB. Moreover, due to racing/non-deterministic nature of these problems: These problems are non-reproducible and non-testable 🙁 These problems tend to hide for a long while, and then – when the load goes over certain threshold – their probability tends to grow near-exponentially. In other words – there can be a problem (usually several dozens of them) sitting quietly and just waiting to hurt you badly on your Big Day (Tournament of the Year etc.) Tracing the problem later when it occurs in the wild is next-to-impossible too (well, except for thorough code review, but this can take a loooooong while). There is also a risk of deadlocks. Note that DB deadlocks are not THAT devastating as multi-threaded deadlocks (one of deadlocked transactions will be eventually rolled back), but they’re still not a picnic. To deal with them, project usually determines a well-defined order of obtaining locks, and as long as everybody within the project follows this order – you’re more or less ok. 1 On the other hand, if even one query violates this order – you’ll get deadlock sooner or later. And once again this problem is racing/non-deteministic, so the non-reproducible/non-testable/hiding-for-a-long-while/non-debuggable complications listed above – are still hiding in the dark waiting for the worst possible moment to manifest themselves 🙁 BTW, even if you have Serializable isolation level – you will still either experience locks (write/write ones), or one of your transactions which would cause a write/write lock – will be rolled back. “ there are two VERY STRONG prerequisites to handling these problems you DO need a very experienced DB person on your development team. Moreover, this person should have an experience when working with highly loaded OLTP-style DBs (at least at millions-per-day level, with transaction monitors etc. etc.) – and such people are Really Difficult to find 🙁 . And without such OLTP real-world experience – there is a chance that this person simply didn’t run into some (or most) of the problems with highly loaded OLTP DBs (and in practice the list of potential problems goes well beyond those things mentioned above) – which can easily be devastating for your whole project. You absolutely MUST separate your DB code from your Game Logic code (well, as discussed in Chapter VII, you need to do it regardless of relying on multiple connections and transaction isolation levels 🙂 ). Thinking about Game Logic and writing transaction-isolation-aware SQL statements at the same time is a surefire recipe for a mortgage-crisis-size disaster 🙁 . As mentioned in Chapter VII, interface between Game Logic code and DB Server code (so-called DB Server API) MUST be expressed in terms of Game Logic (and not in terms of SQL). Moreover, every request MUST correspond to one single transaction (without any chance to leave some things changed but uncommitted; it would cause too many locks to stay for unspecified time, hitting performance and raising probability for deadlocks dramatically). Having this clean separation between Game Logic Server and DB Server, will, in particular, allow your experienced DB developer to concentrate on DB stuff (including dreaded transaction isolation levels) – while keeping everybody else away from this dangerous field. “ In spite of what marketing materials of your RDBMS will tell you, for real-world systems scalability with multi-connection DB access is never linear. Moreover, usually it is MUCH worse than linear The reason for this phenomenon is that any multi-connection DBMS has quite a few resources which different DB connections are fighting over (one of the most important resources where it is not really possible to avoid fighting for – it is DB log file, which needs to be written sequentially and with flush()/sync(), ouch!) And any fight over resources automatically leads to non-linear scalability. That being said, you MIGHT be able to scale your OLTP DB to multiple servers (again, if you have that OLTP DB guru) – but this is not going to be easy, that’s for sure.

Having too many concurrent requests causes the whole thing to slow down. To address this problem, TP Monitors are usually used (see [[TODO]] section below)

While pretty much each and every DB vendor will say you that with their RDBMS you’ll get perfectly linear scaling – well, it is not really the case (to put it mildly). As a Really Big and Fat Rule of Thumb, real-world DBs do NOT scale in linear fashion; it means that even with perceived scalability being inherent for multi-connection approaches, in practice you will still need to address scalability at app level sooner or later (and from what I’ve seen – this moment won’t be too different from the moment when you need to do it with a single-writing-DB-connection, see discussion on it below).

MTBF Mean time between failures (MTBF) is the predicted elapsed time between inherent failures of a system during operation — Wikipedia — 2 and of the order of 1e10-2e10 write transactions between RDBMS crashes for a multiple-connection DB app. This is of course, just an anecdotal evidence (and your RDBMS may differ in this regard) – but on the other hand, it correlates with an observation that in single-connection mode, there is MUCH less potential for the races within RDBMS; with this in mind – well, I wouldn’t write this observation off without any specific evidence to the contrary for your specific DBMS. We’ll discuss more Server-Side MTBFs in Vol. 3 (tentatively Chapter XXXI on Deployment Architecture Take 2).

As you can see – I’m not a big fan of the multi-connection approach for your OLTP database (pretty much for the same reason why I’m not a fan of massive multi-threading, though for DBs these problems are admittedly a tad milder than for multi-threading). Still, I admit that you CAN write OLTP with multi-connections.

That is, as long as you have that OLTP-DB-guru with real-world experience with millions-writing-transactions-per-day DBs.

Without such a person (and no, 20-year experience with multi-terabyte mostly-read DB is not a substitute) – multi-connection DB access for your OLTP database is very risky at the very least. To make things worse (and for reasons described above) – your DB Server will pretend to work while in testing and in “beta”, but after deployment and after reaching certain level of load – it can easily become a never-ending story of lost money/artifacts, player complaints etc. etc. etc. 🙁

On TP Monitors

“if we try to run all those outstanding requests exactly in parallel – we’ll have a severe performance degradation due to excessive thread context switches and fighting for resourcesOne practical note in case if you still want to go for multi-connection DB access: more likely than not, sooner or later you’ll need to run a “transaction processing monitor” a.k.a. TP Monitor. The idea behind TP monitor is simple – if we try to run all those outstanding requests exactly in parallel – we’ll have a severe performance degradation due to excessive thread context switches and fighting for resources. In other words (putting it very roughly) – TP monitor makes sure that only a few requests are executed against DB at any given time (in practice, number of requests to be run in parallel, is related to hardware where your DB is running, such as number of disks and/or number of CPU cores).

These days, two most popular TP monitors out there are Microsoft COM+ and Oracle Tuxedo (formerly BEA Tuxedo).

On the other hand, being a big fan of Reactors, I usually prefer to build my own kinda-TP-monitor from them. This approach is discussed in more detail in Chapter VII, but very shortly it goes along the lines of having several DB Server Working Reactors (with one connection to DB each), and one DB Server Proxy Reactor, which receives all the requests from the rest of the system, and forwards them to a “free” (least loaded) DB Server Working Reactor. Overall, there is no rocket science involved in TP Monitors, and most of the time I feel that DIY is quite appropriate here (though as always, YMMV).

The Ultimate DB Heresy – Single-Write-Connection DB Access

Ok, now as I’ve sufficiently criticized the multi-DB-connection approach ;-), it is time to speak about the single-write-connection one. In original form it is very simple – we have one DB Server app, which keeps a single DB connection; this DB Server app receives incoming requests, chooses relevant prepared SQL statement, binds its parameters and issues an API call to execute this SQL. On receiving the reply, it packages obtained result to form the reply, sends the reply back to requestor, and waits for the next incoming request.

One important thing to be noted here is that we’re speaking about single WRITE connection; in parallel to this single write connection, it is perfectly viable to run any number of read-only connections. On the other hand, for performance reasons (and to avoid locking) you’ll want to use the lowest transaction isolation level provided by your RDBMS; in other words, if you’re using RDBMS with a lock-based concurrency (see discussion in [[TODO]] section above), these parallel read-only connections may need to use Read Uncommitted transaction isolation level; for MVCC-based RDBMS (with a notable exception being MySQL+InnoDB) it will usually be Read Committed. This has quite a few implications, but for requests about historical data – and 99% of all the reporting falls under this category – both these isolation levels will work perfectly fine.

As we can see, with single-write-DB-connection there is no concurrency at all – at every given moment there is one and only one SQL statement executed. It means that all the strange problems with isolation levels mentioned above, go away by themselves with absolutely zero efforts from our side (and regardless of transaction isolation level used(!)).

“Of course, this simplicity comes at a price – and this price is an apparent lack of scalability.Of course, this simplicity comes at a price – and this price is an apparent lack of scalability. In fact, this apparent lack of scalability is a Big Fat Reason why 99% of people out there consider single-write-connection stuff for OLTP DBs as an ultimate DB heresy (i.e. you will be told “it cannot possibly work”). Still, I’ve seen heavily loaded games (and stock exchanges too) working exactly this way, more than once. Moreover, at some point, one of such DBs was referred to as “the most loaded DB/2 instance on Windows we know” – by no less than people from IBM Toronto Labs (this is where DB/2 UDB was developed at the time – and to the best of my knowledge, is still developed). So, in real world things don’t look that grim for single-write-connection DBs; let’s see the reasons why this gap between theory (which tells “it cannot possibly work”) and practice (which proves that this statement is wrong by demonstrating a counter-case ;-)) exists.

Single-Write-DB-Connection: per-connection Performance

First, let’s start with discussing performance of single-write-DB-connection architectures. While, strictly speaking, no kind of performance can be used as a substitute for scalability, it certainly can affect the point when we need to start thinking about scalability.

App-Level Cache – a BIG help performance-wise

As soon as we have exactly one DB connection which can modify DB3 – we can add an app-level cache, and we can make this app-level cache perfectly coherent with DB. This becomes possible because with single-write-connection DB, we have 100% of the information about all the DB modifications – simply because we’re the only ones who is causing these modifications 😉 .

“In practice, I’ve seen these app-level caches to help a LOT with performanceIn practice, I’ve seen these app-level caches to help a LOT with performance; I’ve seen the overall performance improvements from using app-level caches of the order of 5x-10x (!). This result is not that surprising if we take a look at the processes involved in both cases. If we’re using app-level cache (the most popular such cache will be a cache of PLAYERS table), then all we need to do for getting necessary player data (which is needed all the time and then more) – is to calculate hash of the player ID, and then to get the Player in-memory structure by this hash. Overall, we’re speaking about 100-200 CPU clocks (or of the order of 0.1μs).

On the other hand, if going for the same thing to DBMS – we need to (a) bind a prepared statement, (b) to issue an API call, (c) the API call will marshal our data, then (d) it will go to a different process over some IPC (most likely – with usermode-kernelmode-usermode transition on the way, and at least one thread context switch), there our request will be (e) unmarshaled, (f) an execution plan which corresponds to the prepared statement will be found, then (g) execution plan will be executed – getting and parsing (!) several index pages, and at least one data page, then (h) data page will be parsed, (i) data of our user will be retrieved, (j) marshaled, (k) sent back (causing once again usermode-kernelmode-usermode transition and another thread context switch) – then it will be (l) unmarshaled, and (m) delivered to our app. As a result, there should be no surprise that going to DB takes MUCH longer than checking things in-app; in practice, for DB access we’re usually speaking about 10-100μs or so – a 100x-1000x difference from a search in app-level cache. The reason why it doesn’t speed things up more than 5x-10x I’ve mentioned earlier, is because there are other things which need to be done via DB anyway (in particular, DB transactions do need to go through DB synchronously as long as we’re going to provide durability – though see below on kinda-write-back caching).

Still, even 5x-10x of overall improvement which can be observed in practice, is a Very Good Number :-). It should be noted, though, that app-level cache is not the first thing you should do to optimize your DB Server; we’re discussing app-level cache here more as a concept to understand and justify single-write-DB-connection approach as a valid temporary step before full-scale scalability will need to be reached. The whole process of optimizing your DB (including indexes, physical DB layout and RAID levels, denormalisation, and app-level caching) will be discussed in more detail in Vol.3, tentatively Chapter [[TODO]].

Kinda-write-back App-level Cache with Durability

[[TODO: mention equivalence to “Batching Statements” in Hibernate]]

“It is further possible to make this app-level cache a sorta-write-back cache – and providing 100% correct ACID-style Durability too.BTW, this 5x improvement mentioned above, is for write-through app-level cache. It is further possible to make this app-level cache a sorta-write-back cache – and providing 100% correct ACID-style Durability too. This can provide additional performance benefit (however, I didn’t try it myself, so I cannot tell how much additional benefit can be obtained this way).

The idea in this case goes along the following lines:

We have some transaction (let’s name it Larger Transaction) running against DB most of the time; it is not a really logical transaction – but a way to achieve durability when we want it.

When an incoming request comes in (still in perfectly serial fashion), we’re executing it in the context of the Larger Transaction – but are NOT committing it (not yet). In the process, we’re checking pretty much everything from our app-level cache; in particular, we change all the data which corresponds to DB constraints (which BTW means that we can drop quite a few constraints at DB level, optimizing it further). Moreover, as our app-level cache is guaranteed to be coherent, we can check pretty much everything. When we think that transaction succeeds, we also modify our app-level cache.

We DO prepare replies to the requests, but we do NOT send them (again, not yet); they sit as “Delayed Replies”

At some point, we’re deciding to commit the Larger Transaction. This can happen because we just don’t want to Delay Replies any longer or because we’re about to issue a “risky” transaction (the one which can fail at DB level – for example, because we don’t have enough information in cache to check everything we need).

With us performing all the necessary checks in advance from app-level cache, 99.9999% of the time the Larger Transaction will succeed. Then, if the transaction succeeds – we can release Delayed Replies to their respective receivers and go ahead If the transaction doesn’t succeed (for whatever reason, but extremely rare) – we can: Drop all the Delayed Replies. Revert changes to app-level cache made since the start of Larger Transaction – to do this, we may need to keep a kind of app-level “rollback list” with “old” values of the items-modified-since-the-start-of-Larger-Transaction. retry processing of the incoming messages (the ones which were a part of the Larger Transaction, and which have their respective replies delayed) one by one with transactions one by one.



This whole thing provides strict Durability, because we don’t release replies until appropriate transactions are committed and Durable at the DB level (in a sense, Delayed Replies are just “tentative replies”). On the other hand, in this processing model we’re saving a LOT on committing transactions – and the cost of commit is very high latency-wise (see discussion below). As noted above – I haven’t tried this particular kinda-write-back-cache model in real world, so I cannot provide an estimate of “how much it may help”; on the other hand, I’d make a not-so-uneducated guess that (a) there is a reasonable chance of saving 1.5x-2x, and (b) that most likely, there isn’t much sense in combining more than 5-10 requests into one Larger Transaction.

[[TODO: rename “Larger Transaction” => “Group Commit”; also refer to Infrastructure-Level implementation of DIY Group Commits, discusssed in Vol. VII’s chapter on Database Optimizations Take 1]]

[[TODO: DIY fault tolerance]]

Performance Caveat: Latencies and More Latencies

“single-write-connection DB configurations are very sensitive to latencies.When speaking about (IMO Really Good ;-)) performance of single-write-connection DBs, it would be unfair not to mention one important caveat of such configurations: single-write-connection DB configurations are very sensitive to latencies.

Latencies we’ll be talking about, are two-fold: communication latency and “DB log flush()/sync() latency”. The first one is simple – it is a latency of communication between our DB Server app and DBMS; dealing with it is also trivial – let’s just put our DB Server app onto the same hardware box as DBMS, play a bit with connectivity options to find the best one, and we’re done.

The second latency (“DB log flush()/sync()” one) requires more elaborate explanation. To understand it, we’ll need to describe how your usual production-ready RDBMS works with transactions:

while we’re issuing SQL statements within a transaction (i.e. before COMMIT), RDBMS is writing them into two places: (a) to the DB pages which actually hold data; 4 and (b) to the all-important DB log file (a.k.a. transactional log). It should be noted that neither of these writes requires syncing with disk (yet); in other words – both these writes are “lazy” and do NOT require for the app-requesting-SQL-statement-to-be-executed, to wait for disk write. Hence, no additional latency due to disk syncing/flushing is involved here.

and (b) to the all-important DB log file (a.k.a. transactional log). It should be noted that neither of these writes requires syncing with disk (yet); in other words – both these writes are “lazy” and do NOT require for the app-requesting-SQL-statement-to-be-executed, to wait for disk write. Hence, no additional latency due to disk syncing/flushing is involved here. So far so good, but at the moment of COMMIT, situation becomes very different. While writes to DB pages can be (and usually are) still “lazy”, DB log has to be flush()ed/sync()ed with disk. This tends to represent a significant slowdown for single-write-DB-connection apps – that is, unless we’re careful with our choice of underlying hardware. In particular: BBWC Battery-backed write cache, a technique for accelerating disk writes — Wikipedia — flush()/sync() with SSD (again, without a BBWC RAID) – depends on SSD heavily, but I’ve seen the order of magnitude for DB transactions committed to SSD (and it is also consistent with [steveshaw]), is <= 1ms. Much better, but it can be still improved 🙂 flush()/sync() with BBWC RAID. BBWC (=”Battery-Backed Write Cache”) is a very nice hardware feature – it allows to have write-back cache on your RAID PCIe card, and with all the guarantees against crashes and power-downs (if your server crashes, then “Battery-Backed” part comes into play, preserving your data for 2-3 days). What’s important from our perspective, it allows to have very minimal latencies for your sync()s: BBWC RAID card reports data as written as soon as it is written to that (battery-backed) on-card RAM. In fact, we’re speaking about PCIe latency and not much more than that; it is extremely difficult to get faster than that, and DB transactions committed to BBWC RAID card, are within a few hundred microseconds; similar results should be achievable when using NVMe (though I haven’t try it myself yet 🙁 ). Oh, BTW – if BBWC RAID is in use, it doesn’t matter (at least for write latency purposes) whether you’re using SSD or HDD behind it. SAN A storage area network (SAN) is a network which provides access to consolidated, block level data storage. — Wikipedia —



As a result, when working with high-performance single-write-connection DBs, we should aim for in-server HDDs/SSDs (or direct-attached SCSI/SATA storage), and a BBWC RAID for our server hardware (more on RAIDs in [[TODO]] section below); fortunately, it is rarely a problem (BBWC RAID cards are readily available from all major server manufacturers, and are often – though not universally – available from hosting ISPs too). With cost of BBWC RAID cards being of the order $1K-$2K (that’s a price from major server manufacturers), and you need it only for a very few DB Servers, they’re not going to eat too much out of your budget.

[[TODO: refer to group commits, including DIY group commits in Vol. VII’s chapter on DB Optimizations Take 1]]

Single-Write-DB-Connection: Real-World Performance

Ok, all this theorizing is good – but what about real-world numbers? I have a few of them up my sleeve to share ;-).

I’ve seen a few real-world systems (with hundreds of different real-world OLTP transactions happening, most of transactions were modifying multiple rows, adding even more rows for audit trails, etc. etc. etc.) based on single-write-DB-connections.

One of such real-world systems was consistently processing over 30M real-world write transactions/day over one single DB connection (and without an in-memory DB), supporting ~100K simultaneous players

(that is, after optimizing DB and adding app-level cache for USERS table). Average time per transaction was in the range of 800 μs (this would allow to process 86400 seconds/day*1000 milliseconds/second /0.8 milliseconds/write transaction ~= 100M write transactions / day if load is even, but due to intra-day load variations and not-100%-load during peak time, it was only 30-50M write DB transactions per day in practice). Number of different SQL statements was in hundreds, and average number of rows modified and/or added per ACID transaction, was (very roughly) around 10 (these included rather complicated inter-player interactions, and various audits). In short – it was a kind of system you can expect to run yourself (opposed to artificial tests such as TPC-C which are not exactly representative of real-world systems); in fact, it was serving a game with hundreds of thousands of simultaneous players – and BTW was by far the most stable among direct competition too.

And knowing how it was implemented and the data involved – I feel that there was nothing too special about the data they processed. In other words –

I’m pretty sure that your OLTP DB can achieve the same order of magnitude of performance on a single-write-DB-connection.

Of course, YMMV (let’s say, your results can easily differ by half an order of magnitude5), and of course, achieving this kind of performance will take a lot of effort (including both DB-level optimizations and app-level cache) – but I’m confident that for most of OLTP DBs out there it is perfectly doable.

Single-Write-DB Connection: Scalability

“Of course, Performance (even pretty good one as mentioned above) is NOT a substitute for ScalabilityOf course, as noted above, Performance (even pretty good one as mentioned above) is NOT a substitute for Scalability. In other words – at some point, even 30M DB transactions/day won’t be enough. On the other hand, there is a way (which has been used in real-world too) to implement Scalable system based on multiple Single-Write-DB-Connection(s) in a share-nothing fashion; in a sense – it can be seen as a rather close cousin of three microservices-related patterns: Database-Per-Service pattern (see [Fowler] and [Richardson.DatabasePerService]), accompanied by Event-Driven Architecture (see [Richardson.EventDrivenArchitecture]) and Application Publishing Events (see [Richardson.ApplicationEvents]). Moreover, there is a way (also shown to be successful on a serious real-world system) to migrate from simplistic single-write-DB-connection architecture to this Share-Nothing lots-of-single-write-DB-connections one gradually. We’ll discuss all of it in a jiff :-).

[[TODO: refer to https://abdullin.com/sku-vault/2017-07-15-high-availability-and-performance/]]

[[TODO: OLTP and app-level batch processing; snapshots]]

[[To Be Continued…

This concludes beta Chapter 20(d) from the upcoming book “Development and Deployment of Multiplayer Online Games (from social games to MMOFPS, with social games in between)”. Stay tuned for beta Chapter 20(e), where we’ll continue discussion of single-write-connection DBs – in particular, with respect to achieving scalability.]]

Acknowledgement

Cartoons by Sergey Gordeev from Gordeev Animation Graphics, Prague.