It’s 2016, and opposite to popular beliefs, “works on my machine” is still a prevalent phenomenon. We’ve built all kinds of tools from configuration management and VMs and Linux containers, attempting to bridge the gap between the developer’s machine and production servers. Thanks to tools like Git, Vagrant, Docker, Packer, Ansible, Chef, …; we are now able to have fairly strong confidence in the similarity between the state of a developer’s machine, and what’s running in production. In fact, Docker containers remove state entirely from the delivery process by encouraging us to ship fresh new containers on every deployment and disposing of the old ones.

# run version1 of our app

docker run -d myorg/myapp:version1

# deploy fresh new version of our app in a fresh state

docker run -d myorg/myapp:version2

# kill old version

docker ps | grep version1 | awk '{print $1}' | xargs docker kill

But it’s not all rainbows 🌈 and unicorns 🦄. We still have a lot of work to do when it comes to state management of databases. Here’s what I think our current challenges are and what we can do as an industry to establish standard solutions to combat those challenges.

TL,DR; Dev and Prod are not the same because of…

Database State Entropy: data on prod does not match data on dev

Unexpected User Interactions: real users are unpredictable

Database State Entropy

The golden standard today for making incremental changes to a database is through migration scripts. Whether you use Rails (Ruby), Yii (PHP), Sails.js (Node) or Django (Python) for your app, you’re more than likely using one of the baked-in solutions to migrate and maintain your schema as well as your data. Every time we deploy our app we run all the migration scripts that were not applied yet and hope for the best.

Unfortunately, over time, production databases are almost guaranteed to increase in complexity, quantity, and entropy.

en·tro·py — lack of order or predictability; gradual decline into disorder.

This is especially true in a world where Continuous Delivery (CD) is becoming the standard way to ship software. In short, CD is a practice of shipping features to production multiple times a day, including experimental and incomplete features. This means turning features ON and OFF and slowly releasing them to users of your software while monitoring changes in stability as well as user behavior.

This is primarily because production data endures unexpected events such as crashes, human error from bad code, unintentional data mutation, and buggy migrations. Our production data has the potential to reach states that are highly difficult to anticipate on our developer machines. The reality is, our databases running in our laptops will almost never be exposed to the same conditions production data goes through. Not to mention that most dev environments reset their databases regularly. Long-lived data on dev is unheard of.

In order to tackle data entropy, we need to…

Make the database in DEV ~== PROD

We can do so by regularly copying production data on developer machines. Aside from the challenges involved and all the people who are already yelling at me about how terrible this idea could be, I think there’s a way we can do it safely. I mean, the benefits here could be huge. By getting the state of databases on dev machines closer to production, we are increasing our chances of not only fixing bugs that were hard to reproduce but also catch bugs we were not aware of. Developers get to run their automated tests on prod-like data which is in my opinion invaluable, not to mention running local migrations on prod-like data.

For the people yelling the following at me

The data is too sensitive

The first concern that comes up is usually about security risks involved in placing sensitive data on dev machines. And the answer to that is “Data Tampering” which is the practice of obfuscating or transforming row values into different data such that if the data were to fall in the wrong hands, they couldn’t do anything with it. This including incrementing or decrementing integer values, generating random password hashes, changing email addresses, and so on. The main catch here is to try to keep the main structure of the values. For example, if an email address is “charles.barkley@ymail.com” we could tamper it with “mike.tyson@gmail.net”. This process can of course be accommodated to match the security risks involved in your own software.

There’s too much data

Another concern is that production data can be extremely large and hard to fit in mainstream laptop hard drives. To tackle that, we can use what I’m coining as “Bad Data Sampling”; for lack of better terminology. Bad Data Sampling is the process of extracting a selection of records from our production database with distinct characteristics. In other words, we want to find outliers and bring them into our dev environments. Imagine you had a bug for a while in production that resulted in a lot of records not storing the updated_at field for a specific table. Or maybe some records were stored with bad formatting. These records could be causing unexpected bugs and are great to have developers exposed to them on a daily basis.

Instead of extracting all of those records, we can take a few sample records and place them as part of our data. The idea here is not just to pull outliers and uncommon records but to actually get a feel of what our production data looks like on dev. If we had 5,000,000 rows in a table, and only 1,000 were bad/corrupt rows, our data sampling tool could perhaps extract 5,000 records, 1 of which is a bad record for dev machines. The tool would maintain the ratio of good vs bad data but also the frequency of outliers.

Unpredictable user interactions can result in bad data

Apart from guessing and anticipating what the user might do while interacting with your app, Continuous Delivery can really help through gradual rollouts of new features and learning how users interact with them.

Monitor and learn about how your user interacts with different features

and learn about how your user interacts with different features Test unusual behavior and edge cases as part of your test suite to prevent them in the future

unusual behavior and edge cases as part of your test suite to prevent them in the future Deliver experimental features to production that you can turn ON and OFF while slowly exposing a bigger and bigger portion of your users to those features over time. (Continuous Delivery)

Conclusion

I strongly believe that taming the database is the next step to bridge the gap between DEV and PROD. We can do so by…

Make developer databases almost identical to production using methods such as “Data Sampling” and “Data Tampering”

Learn how your user behaves through iterations à la Continuous Delivery.

There’s obviously a huge void of tools for achieving such goals which is exciting for any Open Source enthusiast like myself.

I highly recommend this presentation on Continuous Delivery by Mike Brittain which goes in-depth on the gap between Dev and Prod.

“Your Dev environment is not the same as Production. And if you think that it is, you’re gonna constantly be surprised.” — Mike Brittain (VP, Engineering at Etsy) at the 36:10 mark

Next up…