

Author: “No Bugs” Hare Follow: Job Title: Sarcastic Architect Hobbies: Thinking Aloud, Arguing with Managers, Annoying HRs,

Calling a Spade a Spade, Keeping Tongue in Cheek

[[TODO: move this section into “NoSQL vs SQL for MOGs”]]

NB: this picture implicitly refers to a 1st round of comparison between SQL and NoSQL, available in a previous post NoSQL vs SQL for MOGs.

[[This is Chapter 20(i) from “beta” Volume VI of the upcoming book “Development&Deployment of Multiplayer Online Games”, which is currently being beta-tested. Beta-testing is intended to improve the quality of the book, and provides free e-copy of the “release” book to those who help with improving; for further details see “ Book Beta Testing “. All the content published during Beta Testing, is subject to change before the book is published.

To navigate through the book, you may want to use Development&Deployment of MOG: Table of Contents.]]

When comparing SQL to NoSQL – we need to set a certain number of criteria for comparison; otherwise – these DBs are soooo much different that we won’t be able to compare them properly. OTOH, we should try to come up with more or less practical criteria. Saying “hey, it is not document-oriented” or “hey, it is not normalized” are just synonyms of “{NoSQL|SQL} is the best one” and won’t be of any use; on the other hand – saying “hey, it leads to O(N^2) vs O(N)” is a very practical argument, which (IF it happens) should be taken into consideration.

So, what do we really want from our DB? I would say that from a 50’000 feet point of view – and speaking about firm requirements – we want just four things:

Identification. We certainly want to identify individual records in our DB. While having a table without a primary key is technically possible in SQL – it is universally recognized as a Bad Idea (and for a good reason too). Overall – pretty much nobody in his right mind argues with having a unique identifier for each record.

We certainly want to identify individual records in our DB. While having a table without a primary key is technically possible in SQL – it is universally recognized as a Bad Idea (and for a good reason too). Overall – pretty much nobody in his right mind argues with having a unique identifier for each record. No Undue Duplication. The next thing we want is to store data without unnecessary duplication. 1 Let’s note that we’re not speaking about formalities such as normal forms (though they’re one of the ways to implement this “no undue duplication” rule). Rather, we’re speaking about two very practical issues which arise from undue duplication and which we want to avoid: First of all, saving the same piece of data a million times wastes space. Pretty obvious – and practical too. “ while duplication MAY indeed improve performance – undue duplication also MAY hit performance pretty badly However, even more important is the second issue: if there is undue data duplication – we can easily get into consistency issues. I.e. if we have N separate copies which are meant to be consistent – we will need to update our DB in N places (and not to forget any of them). Ouch, and double-ouch.

The next thing we want is to store data without unnecessary duplication. Let’s note that we’re not speaking about formalities such as normal forms (though they’re one of the ways to implement this “no undue duplication” rule). Rather, we’re speaking about two very practical issues which arise from undue duplication and which we want to avoid: Perform reasonably well for those queries we want to throw at it. This requirement is a rather obvious one; however, I want to emphasize that it is all defined by specific queries we have against our DB (and those we MIGHT have in the future). If some queries are impossible as a matter of principle – we don’t really care whether there is a possibility to perform them well. Such situations are quite common specifically in OLTP databases (where the set of queries tends to be very limited) and admittedly much more rare for reporting/analytics (especially if we’re speaking about ad-hoc queries).

Most importantly, as far as I can tell – this is quite an exhaustive list (let me know if you think I am missing something!). As for ACID transactions – it is an all-important property for OLTP DBs; on the other hand – for reporting DBs it can be usually relaxed to BASE quite easily. Overall, ACID properties and areas where they’re necessary, are discussed in detail in [[TODO]] section, so we won’t address them for the time being.

Litmus Use Case – Player and His Inventory

Now, we need some kind of a more-or-less-practical use case to see how these criteria can be addressed for different types of SQL/NoSQL DBMS. As we’re speaking about games – I’ll give up on classical “library” example (the one which is usually done via a link table between AUTHORS and BOOKS in classical Codd-style SQL) – and will consider a more relevant example of storing PLAYERS instead.[[TODO: think about extending example to add that there other entities (shops, ground) where inventory items can reside]]

Let’s take a look at a system where each of the PLAYERS has:

ID (for our current purposes – we won’t argue about it being surrogate vs natural)

DISPLAYNAME

MONEY (doesn’t matter for us whether it is in-game money or real money)

INVENTORY, an (unordered) set of INVENTORYITEMS, each of which contains: DISPLAYNAME TYPE POWER WEIGHT VALUE INVENTORYSLOT



NB for SQL developers: do NOT say I am crazy (at least not yet): this is NOT intended as a DB structure, but rather as a high-level description of “what we want to store”.

Let’s further assume that some of the INVENTORYITEMS are pre-defined (i.e. taken from the list rather than generated on the fly – of course, except for INVENTORYSLOT field), and some are generated (and moreover, that they’re random enough so they cannot be enumerated).

As for types of queries we want – first of all, we obviously need to manipulate and retrieve the PLAYER (and her INVENTORY) by PLAYER ID. However, other than that – let’s assume that we never want to query our DB by fields of the INVENTORYITEMS, with one exception. Let’s consider scenario when we DO want to find out who owns very few very-rare artifacts (i.e. there is a pre-defined list of very-rare artifacts – and we want to query who owns them; all other INVENTORYITEMS are of no interest for such a query). NB: while this query may look artificial and unimportant – you can be pretty sure that sooner or later somewhere in your DB you will have some queries of similar nature – i.e. going beyond pure key-based access.

“we’re speaking about a DB with millions of PLAYERS (and taking gigabytes in size), so any execution plans which require O(N) operations – such as table scans (or equivalent) – don’t satisfy our “perform reasonably well” requirement.Oh, and just to clarify – we’re speaking about a DB with millions of PLAYERS (and taking gigabytes in size), so any execution plans which require O(N) operations – such as table scans (or equivalent) – don’t satisfy our “perform reasonably well” requirement.

Armed with this use case and the criteria above – we can start comparing different approaches to DB design under different DMBS – from classical Codd-style SQL to Unstructured SQL (with SQL-with-XML and Structured NoSQL in between).

Classical Codd-style SQL

The key, the whole key, and nothing but the key, so help me Codd

— Unknown

First of all, a disclaimer: the models discussed below, are not the only ones possible for the DBMS we’re discussing; however – they are intended to be rather typical given the current traditions and practices.

For classical Codd-style SQL (the one which dominated database development at least from 70s to 90s), it would look as follows:

PLAYERS table with PLAYERID being PRIMARY KEY, and DISPLAYNAME and MONEY as fields

ITEMDICTIONARY table, with a list of all the pre-defined ITEMS. Has ITEMID as a PRIMARY KEY, and DISPLAYNAME, TYPE, POWER, WEIGHT, and VALUE as fields; also has ISRAREARTIFACT field to define whether it is an artifact which can be queried. This ITEMDICTIONARY table contains all pre-defined items – and doesn’t really change (that is, unless we’re changing game rules).

“ PLAYERINVENTORYITEMS table is a typical 'link table' saying which pre-defined items are owned by PLAYER 2 saying which pre-defined items are owned by PLAYER. As it is typical for “link tables”, PLAYERINVENTORYITEMS contains PLAYERID and ITEMID as fields (and has PRIMARY KEY as a combination of both fields). In a not-so-typical-for-link-tables-manner, it also contains an INVENTORYSLOT field (to identify where specific ITEMID item goes within inventory of player PLAYERID). NB for those not-so-familiar with SQL: “link tables” such as PLAYERINVENTORYITEMS are routinely used in classical SQL for many-to-many relationships such as PLAYERS-to-ITEMDICTIONARY. Very briefly – “link tables” are necessary within classical SQL to comply with “no undue duplication” requirement (known as “normalization” in SQL-world).

saying which pre-defined items are owned by PLAYER. As it is typical for “link tables”, PLAYERINVENTORYITEMS contains PLAYERID and ITEMID as fields (and has PRIMARY KEY as a combination of both fields). In a not-so-typical-for-link-tables-manner, it also contains an INVENTORYSLOT field (to identify where specific ITEMID item goes within inventory of player PLAYERID).

Ok, by this time we dealt with those ITEMS which are coming from pre-defined lists. However, what about generated ITEMS? To support them, we’ll need to add one more table:

GENERATEDITEMS table, with PLAYERID, INVENTORYSLOT, as well as DISPLAYNAME, TYPE, POWER, WEIGHT, and VALUE as fields. PRIMARY KEY is (PLAYERID, INVENTORYSLOT).3

If we take a look at this structure – we’ll be able to observe that:

It can store everything we need

Each record is uniquely identified (well, we do have all those PRIMARY KEYs)

It is free of undue duplication4 (this one is not that obvious, but after honestly trying to find duplicate data in such a DB – we aren’t likely to find it, except for very rare purely random collisions).

However, to meet “performing reasonably well” requirement – we will need to analyse a bit further. As it was defined in our use case – we need to provide two “performing reasonably well” queries:

Access by PLAYERID. This type of queries (assuming that all PRIMARY KEYS are also indexes, which stands for all the RDBMS I know) is already completely index-based – and so (most likely) already qualifies as “performing reasonably well”.

“ for such a query to be efficient – we need to make sure that we have an index by ITEMID when specifying our PRIMARY KEY for PLAYERINVENTORYITEMS, we need to make sure that: btree index (and not hash index) is used to implement PRIMARY KEY, and PRIMARY KEY has ITEMID field as the first one two these properties combined make sure that index-used-to-implement-PRIMARY-KEY, will also work as an index-on-ITEMID-alone to have a separate index for PLAYERINVENTORYITEMS by ITEMID having this index separate may also allow to take it as a “sparse” index (also known as “filtered index”) – further improving performance of this particular query. 5 It is when implementing such “sparse”/”filtered” index, ISRAREARTIFACT field can become necessary.



Phew. It wasn’t too easy, but we managed to do it . While we used 4 separate tables to implement it – 4 tables are certainly not considered a big deal for an SQL DB (and it will perform well too, both for OLTP modifications, and for those queries which we’ve defined).

SQL with XML

Now let’s take a look what changes if our SQL-based RDBMS allows our tables to have XML fields. Actually, we could try more NoSQL-style with only two tables:

PLAYERS with the following fields: PLAYERID being PRIMARY KEY DISPLAYNAME and MONEY as usual fields INVENTORY is an XML field. This field could contain a list of the INVENTORYITEMS, with each INVENTORYITEM having the following properties: For pre-defined items: <ITEMID> (pointing to ITEMDICTIONARY table), and <INVENTORYSLOT> For generated items: <INVENTORYSLOT>, <DISPLAYNAME>, <TYPE>, <POWER>, <WEIGHT>, and <VALUE>

ITEMDICTIONARY table, exactly the same as for plain SQL (described above).

Again, we can more-or-less easily observe that we can represent everything we need, that we can identify things, and that we don’t have undue redundancy. So, the only thing we still need to do to be certain that the model above achieves our Holy Grail of complying to our requirements – is to make sure that our queries perform reasonably well.

As for the first type of queries – accessing everything via PLAYERID – it will work even better than classical Codd-style SQL.6

“However, the second type of query (finding which of the PLAYERS have specific artifacts) is not that simple.However, the second type of query (finding which of the PLAYERS have specific artifacts) is not that simple. In absence of indexes (other than PLAYERID) any such query over XML will require a table-scan (with parsing each of XMLs on the way – double-ouch!). To avoid this O(N) table-scan – we need an index over the <ITEMID> property of XML INVENTORY field. If our RDBMS allows it – bingo! We’ve complied with our requirements, and can handle queries-we’re-interested-in, without table scans too.

SQL-with-XML: Current Implementation Issues

As we can see, at least theoretically SQL with XML fields and indexes can do what we need it to do (and also we can see that in theory SQL-with-XML can perform not worse and probably even better than classical SQL).

However, all XML-related database technology is still in its infancy by RDBMS standards, and it seems that as of 2016, support and performance of XML queries still lacks . In particular, there are reports that SQL-with-XML performs incredibly poor compared to competition; for example, [SuarezEtAl] measures SQL Server with an XML field losing around 100x7 compared to MongoDB (and references to SQL Server having all XML indexes based on full-text index look very annoying performance-wise too). OTOH, as I look at DB/2 and Oracle documentation ([IBM] and [Oracle] respectively), their respective “XML indexes” and “function-based indexes on XMLType data” seem to be much more like traditional indexes over XML properties, so I’d probably try to use them before claiming that they won’t work fast enough.

To summarize the above:

At this point I don’t see why SQL-with-XML should be necessarily anywhere slower than any alternative technology 8

On the other hand, as of 2016 quality of XML implementations (and especially XML indexing) tends vary significantly

On the third hand  – I expect this performance handicap of SQL-with-XML to be reduced with time.

In other words – while with time I’d expect XML-based SQL to become a very useful and widely used addition to SQL – for the time being, make sure to test XML-based SQL including its performance (and in conditions-resembling-real-world ones too) before using it.

Structured NoSQL

In [[TODO]] section above, we defined Structured NoSQL (such as Mongo-DB). Very shortly – it is a kind of document storage with the document/object stored as a whole, but individually addressable9 fields of the document/object.

“NoSQL will usually call for another denormalisation on top of what we’ve described above for SQL-with-XML.NoSQL will usually call for another denormalisation on top of what we’ve described above for SQL-with-XML. For example, if NoSQL DBMS we’re using, is storing documents as JSON (or, like MongoDB, as BSON) – we can store our data within following tables/collections:

PLAYERS, with PLAYERS documents with the following fields: PLAYERID being a JSON field which is also a key (_id in MongoDB-speak) DISPLAYNAME and MONEY as regular JSON fields INVENTORY is a JSON array consisting of INVENTORYITEMS. Each of INVENTORYITEMS contains: For pre-defined items: ‘ITEMID’ (pointing to ITEMDICTIONARY table), and ‘INVENTORYSLOT’ For generated items: ‘INVENTORYSLOT’, ‘DISPLAYNAME’, ‘TYPE’, ‘POWER’, ‘WEIGHT’, and ‘VALUE’

ITEMDICTIONARY table, with each of the documents containing ITEMID as a key JSON field, and ‘DISPLAYNAME’, ‘TYPE’, ‘POWER’, ‘WEIGHT’, and ‘VALUE’ as regular JSON fields.

Let’s note that in quite a few NoSQL DBs relying on relations from one document to another one (such as ‘ITEMID’ referring to ITEMDICTIONARY table) are quite dangerous because of the lack of the multi-document ACID transactions (more on it in [[TODO]] section); however, in our case it is ok (as long as ITEMDICTIONARY is an essentially constant table, questions of inter-document ACID consistency won’t arise).

Let’s also note that while the above structure was expressed in term of JSON – pretty much the same would work for a column store.

Now about complying to our requirements. Once again, the only problem on the way of compliance – is performance of that query of “which PLAYERS have those rare artifacts?” And once again – it can be resolved by an appropriate index, this time on INVENTORY.ITEMID field (or column) within PLAYER document; once again, having Sparse/Partial/Filtered indexes is a plus performance-wise.

On Further Denormalisation

In quite a few NoSQL databases “best practices” tell that it is better to avoid references (such as an ITEMID reference to a (near-)constant table ITEMDICTIONARY) completely – arguing that it is better to denormalise in the name of speed.

I’d say that at least in theory this logic shouldn’t apply to references to small near-constant tables (as such tables, again at least in theory, can be replicated to all the shards, which would mean no extra work to retrieve them, and they will be cached all the time, so costs of joining will be very very minimal). Whether your NoSQL DB allows to do this kind of replication (and how fast it will work) – is up to you to figure out. If it doesn’t work – it is indeed possible to denormalise the structure above further (actually – denormalise it completely), and to get rid of ITEMDICTIONARY table entirely, incorporating ITEMDICTIONARY data within each of the PLAYERS instead. This complete denormalisation will have its own drawbacks:

First of all, it will increase the size of your data. And (in spite of claims by certain NoSQL projects that disk costs nothing these days and implying that it is the only cost of data being larger) this increase will also hit your caches and as a result – your read performance too.

Second, if you’ll want to change the properties of all the items of one single type – you will be facing quite a challenge. I don’t want to say that you will need it – but if it happens, it is going to be rather ugly.

“Despite these drawbacks, this completely denormalised data model will work (though you’ll need to figure out its performance yourself on your specific NoSQL DB).Despite these drawbacks, this completely denormalised data model will work (though you’ll need to figure out its performance yourself on your specific NoSQL DB).

Unstructured (Key-Value) NoSQL

If speaking about key-value NoSQL – we still can use exactly the same data models as described above for Structured NoSQL. However, when trying to optimize that non-key-based query to find those PLAYERS who currently have that rare artifact – we’ll run into table scan, which in turn fails to satisfy “perform reasonably well” requirement 🙁 .

On the other hand – “pure” key-value stores are rather rare these days, and functionality to implement secondary indexes MAY be provided in one way or another by a primarily-key-value data store. Just as one example – Redis does provide machinery to implement your own secondary indexes (albeit in a rather cumbersome manner), see, for example, [Redis].

Summary

As we’ve seen above – for our litmus test case, we can comply with our requirements on a really wide range of DBMS (ranging from classical SQL-based RDBMS to primarily-key-value Redis, with pretty much everything else in between).

Or, in other words, I tend to rephrase it as follows –

everything we’ll need, can be expressed (and can perform reasonably well) for all types of the DBMS: from classical normalized SQL DBs on one side of spectrum, through SQL-with-XML 10 and Structured NoSQL, and into Key-Value NoSQL on another side of spectrum. 11

It means, in turn, that the differences between data models (discussed above) are not likely to be decisive when making a choice between SQL and NoSQL; the real difference lies elsewhere. In particular, as discussed in [[TODO]] section, for OLTP purposes multi-row ACID guarantees are usually very important; on the other hand, for analytical queries over huge amounts of historical (and eventually-consistent) data tend to help significantly.

[[To Be Continued…

This concludes beta Chapter 20(i) from the upcoming book “Development and Deployment of Multiplayer Online Games (from social games to MMOFPS, with social games in between)”. Stay tuned for beta Chapter 20(j), where we’ll conclude Chapter 20 and will discuss how to put the whole DB thing together (for the time being – and also to provide roadmap into the future)]]

Acknowledgement

Cartoons by Sergey Gordeev from Gordeev Animation Graphics, Prague.