7 min read

On Databases

Databases come in many forms. The simplest definition of a database is any system of storing, organizing, and retrieving data. With this definition, things like memory, hard drives, file systems, files on those file systems (stored in plain text, tab-delimited, XML, JSON, or even BDB formats), and even applications like MySQL, PostgreSQL, and Oracle are considered databases.

Databases allow users to:

Store Data

Organize Data

Retrieve Data

It is important to keep a broad perspective on what data and databases really are so that you can always choose the best solution for your particular problem.

The SQL databases (MySQL, PostgreSQL, Oracle, and others) are remarkable because of the flexibility and performance they provide. In my work, I look to them first when developing an application, with an eye towards getting the data model right before optimization. Once the application is solid, and once I fully understand what parts of the data system are too slow or fast enough, then I can start building my own database on top of the file system or other existing technologies that will give me the kind of performance I need.

PostgreSQL: Free, BSD -licensed popular database.

http://postgresql.org/

http://postgresql.org/ MySQL: Free, GPL-licensed popular database.

http://mysql.org

http://mysql.org Oracle: Commercial industrial database.

http://oracle.com

http://oracle.com SQL Server: Microsoft’s commercial database.

http://www.microsoft.com/SQL/default.mspx

Among the SQL databases, which one is best? There are many criteria I use to evaluate SQL databases, and the one I pay attention to most is how they comply (if at all) with the ACID model.

And given the technical merits of the various SQL databases, I consistently choose PostgreSQL above all other SQL databases when given a choice. Allow me to explain why.

The ACID Model

ACID is an acronym, standing for the four words Atomicity, Consistency, Isolation, and Durability. These are fancy words for some very basic and essential concepts.

Atomicity means that you either do all of the changes you want, or none of them, without leaving the database in some weird in-between state. When you take into account catastrophes like power failures or corruption, atomicity isn’t as simple as it first seems.

Consistency means that any state of the database will be internally consistent with the rules that constrain the data. That is, if you have a table with a primary key, then that table will not contain any violations of the primary key constraints after any transaction.

Isolation means that you can be modifying many different parts of the database at the same time without affecting each other. (As a higher feature, there is Serialization, which requires that transactions occur one after the other, or at least the results of transactions.)

Durability means that once a transaction completes, it is never lost, ever.

Atomicity: All or nothing

Consistency: Rules kept

Isolation: No partials seen

Durability: Doesn’t disappear

ACID compliance isn’t rocket science, but it isn’t trivial either. These requirements form a minimum standard absolutely necessary to provide a database for a reasonable application.

That is, if you can’t guarantee these things, then the users of your application are going to be frustrated since they assume, naturally, that the ACID model is followed. And if the users of the application get frustrated, then the developers of the application will get frustrated as they try to comply with the user’s expectations.

A lot of frustration can be avoided if the database simply complies with the principles of the ACID model. If the database gets it right, then the rest of the application will have no problem getting it right as well. Our users will be happy since their expectations of ACID compliance will be met.

Remember: Users expect ACID!

What Violating the ACID Model Looks Like

To consider the importance of the ACID model, let’s examine, briefly, what happens when the model is violated.

When Atomicity isn’t adhered to, users will see their data partially committed. For instance, they might find their online profile only partially modified, or their bank transfer partially transferred. This is, of course, devastating to the unwary user.

When Consistency is violated, the rules that the data should follow aren’t adhered to. Perhaps the number of friends shown doesn’t match the friends they actually have in a social networking application. Or perhaps they see their bank balance doesn’t match what the numbers add up to. Or worse, perhaps your order system is counting orders that don’t even exist and not counting orders that do.

When Isolation isn’t guaranteed, they will either have to use a system where only one person can change something at a time, locking out all others, or they will see inconsistencies throughout the world of data, inconsistencies resulting from transactions that are in progress elsewhere. This will make the data unreliable just like violating Atomicity or Consistency. A bank user, for instance, will believe their transfer of funds was successful when in reality their money was simultaneously being withdrawn by another transaction.

When Durability is lost, then users will never know if their transaction really went through, and won’t mysteriously disappear down the road with all the trouble that entails.

I am sure we have all had experiences dealing with data systems that didn’t follow the ACID model. I remember the days when you had to save your files frequently, and even then you still weren’t ensured that all of your data would be properly saved. I also recall applications that would make partial changes, or incomplete changes, and expose these inconsistent states to the user.

In today’s world, writing applications with faults like the above is simply inexcusable. There are too many tools out there that are readily available that make writing ACID compliant systems easy. One of those tools, probably the most popular of all, is the SQL database.

Satisfying ACID with Transactions

The principle way that databases comply with ACID requirements is through the concept of transactions.

Ideally, each transaction would occur in an instant, updating the database according to the state of the database at that moment. In reality, this isn’t possible. It takes time to accumulate the data and apply the changes.

Typical transaction SQL commands:

BEGIN: Start a new transaction

COMMIT: Commit the transaction

ROLLBACK: Roll back the transaction in progress

Since multiple sessions can each be creating and applying a transaction simultaneously, special precautions have to be taken to ensure that the data that each transaction “sees” is consistent, and that the effects of each transaction appear all together or not at all. Special care is also taken to ensure that when a transaction is committed, the database will be put in a state where catastrophic events will not leave the transaction partially committed.

Contrary to popular belief, there are a variety of ways that databases support transactions. It is well worth the time to read and understand PostgreSQL’s two levels of transaction isolation and the four possible isolation levels in Section 12.2 of the PostgreSQL documentation.

Note that some of the inferior levels of transaction isolation violate some extreme cases of ACID compliance for the sake of performance. These edge cases can be properly handled with appropriate use of row-locking techniques. Row-locking is an issue beyond this article.

Keep in mind that the levels of transaction isolation are only what appear to users of the database. Inside the database, there is a remarkable variety of methods on actually implementing transactions.

Consider that while you are in a transaction, making changes to the database, every other transaction has to see one version of the database while you see another. In effect, you have to have copies of some of the data lying around somewhere. Queries to that data have to know which version of the data to retrieve the copy, the original, or the modified version (and which modified version?) Changes to the data have to go somewhere the original, a copy, or some modified version (again, which?) Answering these questions leads to the various implementations of transactions in ACID compliant databases.

For the purposes of this article, I will examine only two: Oracle’s and PostgreSQL’s implementations. If you are only familiar with Oracle, then hopefully you will learn something new and fascinating as you investigate PostgreSQL’s method.