MongoDB’s model architecture in a nutshell

Before taking a tour of the model architecture, it is important to know the motivation behind it. According to Eliot Horowitz, MongoDB CTO and co-founder, MongoDB wasn’t built from scratch, but with the thought to improve an existing relational DB product.

“…the way I think about MongoDB is that, if you take MySQL, and change the data model from relational to document-based, you get a lot of great features...” - Eliot Horowitz

The great features he mentions include the ability to store embedded data as sub-documents, and thereby reduce the number of JOINS operations which finally leads to faster queries.

Moreover, the development process becomes more agile, due to the dynamic schemas feature, and the ability to scale horizontally.

Bson

Bson is an extension of the regular Json format, which includes additional data types such as int, long, date, floating point, decimal 128 and more.

A Bson document could include one or more fields (e.g., “columns”), while each field contains a specific data type, including array, binary data, objects or another sub-document.

A Bson file may look like a Json formatted file, but it is actually serialized and stored as a binary file, so it reduces the disk usage.

When inserting a new Bson document (e.g “row”) into a collection (e.g “table”), MongoDB will automatically add two more important fields, called “_id”, “_class”.

The _id field’s meaning is straightforward and there is no need to explain it, however, I will expand more about its generating process later in this article.

The _class field is the relative path directing data to the current ORM’s entity (for those of you who are familiar with OOP concepts, the _class field must point to a specific class instance; interfaces aren’t allowed here due to deserialization issues).

Document Validation

Unlike other NoSQL stores, MongoDB provides a document validation,

removing another responsibility from the developers.

That is, by managing the document validation when inserting or updating a document, the developer can enforce which fields are mandatory, which data types are possible, what should be the range and control of the data structure.

ObjectId

ObjectId is a unique identifier generated automatically when inserting new Bson file into MongoDB.

The objectId’s generation mechanism guarantees that each identifier is different from any other identifier that had previously been generated.

TS: the generation timestamp, ID: generated from the network id on the current computer, PID: the running client library process, Count: an automatic incremented id by the client

Queries optimization

MongoDB store engine (WiredTiger) does not rest on its laurels. Rather, “MongoDB automatically optimizes queries to make evaluation as efficient as possible…”. For example, it includes a component called “query optimizer” that periodically runs alternative query plans and selects the index with the best response time for each query type.

So what is the “efficient evaluation” they are talking about? Evaluation normally includes the selection of data based on predicates, and sorting data based on the sort criteria provided. The best results of the empirical test are stored as a cached query plan and are updated periodically.

Covered queries

A subset of query optimizations, called “covered queries”, is characterized by their return results. In MongoDB, a query for which the return results contain only indexed fields is returned without any reading from the source documents.

“Covered queries” are a mixed blessing, as far as features go.



Even though they reduce the response time by returning the results directly from the index, they also are complicated by inconsistent data.

Imagine you have a multi-threads application that performs many CRUD operations over the DB. Some of the READ operations are “covered queries” and return results directly from the index.

In such a case, two threads use MongoDB at the same time, with the first one writing/updating/deleting a document, while the second is performs a covered query operation.

Well, you may expect to have an inconsistent data problem because of to the index rebuilding mechanism.

While one thread inserted/updated/deleted the document, the index used for reading the covered query will not yet have rebuilt itself, with the new data changes.

Embedded document vs. separate collection

“Should I need to create a new collection for this data or perhaps embed it as

a sub-document?”

Many developers have a hard time answering this question.

Here are few guidelines to help you reach the conclusion that is right for you — including the fact that it is recommended that you duplicate your data to reach a higher speed, and reference it for more integrity.

Denormalized data (a.k.a. “embedded”), has its own advantages, such as speed, readability, indexing sub-fields, etc.

It is also reduces very costly operations, such as AGGREGATION and JOINS.

However, you have to pay attention to the data you are going to store and what are your preferences about the options that it will be inconsistent.

Another very important guideline to mention is the future-proofs guideline.

If you are planning to query this data in different ways in the future, you may want to consider normalizing it (the problem with denormalized data, is that it is limited to the following context it is in).

Another way to look at this guideline, is by asking whether you’ll be querying for the information in the given field by itself, or only in the context of the larger document.

The last guideline I want to talk about indicates that you should not embed fields that have unbound growth.

You may embed 100 or 1,000,000 sub-documents, but the way to do so is up front. Given how the MongoDB stores data (WiredTiger engine), it would be fairly inefficient to be constantly appending information to the end of an array.

Indexing

The MongoDB WiredTiger engine provides a wide spectrum of indexing options, including unique, compound and array indexes, and some more spectacular indexes options like TTL, geospatial, partial, sparse, and text search indexes.

Unique indexes: When MongoDB has specifies that index is unique, it rejects inserts of new documents with an existing value for the field for which the unique index has been created.

Compound indexes: This kind of index should be used for queries that specify multiple predicates. An additional benefit of compound indexes is that any leading field within the given index can be used.

Array indexes: For fields that contain an array, each array value is stored as a separate index entry.

Partial indexes: By specifying a filtering expression — a condition established during the index creation — a user can instruct MongoDB to include only documents that meet the desired condition.

Spares indexes: This kind of index contains only entries for documents that contain the specified field.