Storage

Storage, or where and how to physically store big data, may seem as the easiest challenge to tackle, but in fact it isn’t that simple to solve. It’s true that provisioning more storage space is now easier and cheaper than ever. Cloud providers simplify the process and help you build out your own data farm. Databases can also be autoscaled up or down as needed, making it easier to index the data. But when building or refactoring a big data storage architecture, we must consider five factors — data compression, data search, short-term storage, long-term (cold) storage, and data movement.

Data compression — compress, compress, compress. Compressing the data before storing it is key. This saves lots of space in the long run, taking the load off of hardware. There are numerous compression libraries available that will do the job. You can even build your own, if the data is unique enough to necessitate such a feat.

However, there are some cases where the data cannot be compressed, or else it will lose quality and deteriorate. For example, this usually happens when the data consists mostly of images, videos, or audio. Compressing these artifacts results in loss of quality, and whether you should do it, should be determined on a case-by-case basis. However, for most other, text-based data, compressing before storing is a must.

Data search — with large amounts of data, the key to success is to come up with an overarching and consistent strategy on how to index it to make it searchable. Here, it’s best to adopt a search-first methodology. This is where databases become useful. Data should first be enriched with metadata and properly indexed before it can be compressed and stored. You can use a special database cluster to store the metadata, created by utilizing NoSQL and/or a relational database.

Short-term storage — depending on how data is accessed, it’s a good idea to have two storage bins, short- and long-term. The short-term storage bin is a place where the data is temporarily stored. Once in that bin, the data can then be aged based on business needs. To do so, you could use a first-in-first-out (FIFO) queue. The queue holds the data for n days, and on n+1 days, older data is moved to long-term storage. If n is small enough, and storage space and/or processing speed are not an issue, then short-term storage could be plain, not-compressed data. This will allow for even better search capabilities.

The short-term storage hierarchy provides better data control, allowing newer entries to be accessed much faster. In general, recent data needs to be accessed more frequently. As the data ages, the probability of that same data being used again decreases. Of course, the process for moving data to long-term storage could get more intricate than just a simple FIFO time-based queue. You can add number-of-times-accessed, search hits, etc. to your algorithm so that if a piece of data is constantly being used, compared to all other data points, it will remain in the short-term queue longer.

Long-term (cold) storage — this is usually where the data that does not need to be acted upon as often is stored. There still must be some way to perform all activities on it, just like in short-term storage. The difference will be that any action on long-term storage will take much longer to complete.

Data movement — when dealing with terabytes, petabytes, or even larger amounts of data created daily, moving the data from cluster to cluster becomes a problem. A simple data backup of both short- and long-term storage will inevitably take a long time, and data movement must be minimized at all costs to ease network load. A central data repository could, therefore, be maintained, from which all other services (search, end-user display, sorting, indexing, etc.) could operate. This strategy will also help you eliminate data duplication and will make it easier to manage scheduled backups and disaster recovery (DR).