Introduction

Companies and organizations generate data and are increasingly using this data to generate additional values. While traditionally this was a task for business administration analysts, today data plays an important role in all aspects and divisions of the organizations. To enable companies for this change, an efficient and long term data architecture is required. Here we are going to discuss many aspects and technical challenges that need to be addressed to build such a data architecture.

The motivation for this article came from the observation that data platforms often are reduced to the database component which is a huge oversimplification of the whole data lifecycle. We want to illustrate the amount of functionality that is required for a basic, long term data strategy in a company.

What Is a Data Platform?

A data platform is a service that allows you to ingest, process, store, access, analyze, and present data. These are considered the defining features of what we call a Data Platform. This platform can be broken down into theree parts:

Data Warehouse: ingest, process, store, access.

Business Intelligence: analyze and present.

Data Science: Statistics and Artificial Intelligence (a special form of analysis).

Although storing and processing data is at the heart of a data platform it doesn’t stop there. To get a good overview of the overall data management tasks, we compiled a list of 17 criteria concerning the whole data engineering and analytics lifecycle.

Data Architecture (Infrastructure, scaling, database) Import Interfaces Data Transformation (ETL) Performance Process Automation Monitoring Data Historization Data Versioning Surrogate Key Management Analysis/Reporting Data Science Work Area External Access/API Usability Multiuser Development Process In-platform Documentation Documentation Security Costs

In the following sections, we will give a short introduction to each of these criteria but will not go too much into technical details.

1. Data Architecture

The core data architecture is one of the most important aspects in a data platform architecture, but by far not the only one. The goal is to find a suitable storage and database solution to meet your requirements. There are three basic options to choose from with impact on all the other topics below:

Relational Database

Very mature database systems with a lot of built-in intelligence to handle large data sets efficiently. These systems have the most expressive analysis tools, but often are more complex with respect to maintenance. Nowadays, these systems are also available as distributed systems and can handle extremely large data sets. Examples are PostgreSQL with Citus or Greenplum cluster solution, MariaDB/MySQL with Galera Cluster, Amazon Redshift, Oracle, MSSQL, …

NoSQL Style Sharding Database

These systems sacrifice some of the classical features of relational databases for more power in other areas. These systems offer much less analytical power by themselves, but are easier to manage, have high availability and easy backups. They are designed to handle extremely large datasets. There are efforts to mimic the analytical power of SQL that is available in the relational database world with tools on top (Impala, Hive). If you have huge amounts of data or specific requirements for streaming or real-time data you should take a look at these specialized data systems. Examples are Cassandra, the Hadoop Ecosystem, Elasicsearch, Druid, MongoDB, Kudu, InfluxDB, Kafka, neo4j, and Dgraph.

File-Based Systems

It is possible to design a data strategy solely on files. File structures like the Parquet file standard enable you to use very cheap storage to store very large data sets distributed over many storage nodes or on a Cloud Object Store like Amazon S3. The main advantage is that the data storage system suffices to respond to data access requests. The two systems above need to run services on extra compute nodes to respond to data queries. With a solution like Apache Drill you can query parquet files with similar comfort that is known from SQL.

When looking out for the hardware architecture to support your chosen data architecture a few basic options are on the table.

You can try and build your platform relying on services offered by major cloud vendors. On cloud platforms like AWS, Azure, or Google Cloud you can plug together a selection of rather simple services to create a data platform that covers our criteria list. This might look simple and cheap at small scale, but can turn out to be quite complex and quite expensive when you scale out and need to customize. In contrast there are platforms based on self operated hardware including cloud virtual machines and individual software stacks. Here you have a maximum of flexibility but also need to answer many of the criteria in our list by creating your own code and custom solutions. As a last category we would mention more complete and dedicated independent cloud data platforms like Repods, Snowflake, Panoply or Qubole (Disclaimer: The author is from the Repods platform). These platforms handle more or less of the criteria on our list out of the box.

2. Performance

This criteria is often a large influencer in the platform choice. Performance is mostly influenced by the database subsystem you choose. As a rule of thumb, the higher your performance requirements are, the more specialized your database system choice should be. We don’t want to go into more detail and instead focus on the less known topics of a data platform.

3. Import Interfaces

We categorize import interfaces into three different sections.

Files — Still the most common form of data. Web Services — Plenty available on the net with relevant data. Databases — Although in many organizations a lot of data is stored in traditional databases, in most cases direct database access is not exposed to the internet and therefore not available for Cloud Data Platforms. Web services can be placed in between on-premise databases and cloud services to handle security aspects and access control. Another alternative is the use of ssh-tunneling over secure jump hosts. Real-time streams — Real-time data streams as delivered by messaging routers (speaking WAMP, MQTT, AMQP, …) and are not used very much today but are going to become more important with the rise of IoT.

4. Data Transformation ETL

Data imported into the data platform usually has to undergo some data transformations to be usable for analysis down the road. This process is traditionally called ETL (Extract, Transform, Load). These processes usually create a table from the raw data, assign data types, filter values, join existing data, create derived columns/rows, and apply all kinds of custom logic to the raw data. Creating and managing ETL processes is sometimes called data engineering and is the most time consuming task in any data environment. In most cases this task takes up 80% of the overall human efforts. Larger data warehouses can contain thousands of ETL processes with different stages and dependencies and processing sequences.

5. Process Automation

When you have many sources, targets, and data transformation processes in between you also have many dependencies and with this a certain run schedule logic. The automation of processes is part of every data warehouse and involves a lot of complexity. There are dedicated tools (Apache Airflow, Automate, Control-M, Luigi, …) to handle only the scheduling of processes.

Process automation also requires you to manage the selection of data chunks to process, i.e. in an incremental load scenario every execution of a process needs to incrementally pick specific chunks of source data to pass on to the target. This “Data Scope Management” is usually implemented by a metadata driven approach, i.e. there are dedicated metadata tables that keep track of the process state of each chunk and can be queried to coordinate the processing of all chunks.

6. Monitoring

Larger data warehouse systems can easily contain hundreds of tables with hundreds of automated ETL processes managing the data flow. Errors appearing at runtime are almost unavoidable and many of them have to be taken care of by manual intervention. With this amount of complexity you definitely need a way to monitor what is going on in the platform.

7. Data Historization

The need to manage longer histories of data is at the core of each data warehouse effort. In fact, the data warehousing task itself could be summarized as the task of merging seperate chunks of data into a homogenous data history. Data is naturally generated over time and as such there arises the need to increment an existing data stock with new data. To robustly manage data histories, “data historization” is typically used. Technically, time ranges in tables are tracked using dedicated time range columns. This is different to data versioning in the sense that historization is concerned with real life timestamps whereas versioning is usually concerned with technical insert timestamps (See section below).

8. Data Versioning

By versioning data you can track data corrections over time for later recovery of old analysis. Versioning allows you to apply non-destructive corrections to existing data. When comparing versioning capabilities you have to consider the ease of creating versions and the ease of recovering or querying versions. Versioning can be handled on different levels of the system:

Create version snapshots on the storage subsystem (similar to backups).

The underlying database system might come with support for version tracking.

Versioning might be handled by the data warehouse system.

Versioning can be implemented as a custom transformation logic in the user space.

9. Surrogate Key Management

Data warehouses are used to consolidate data from many sources with different identifiers for the respective objects. This requires us to create new key ranges for the imported objectes and maintain them throughout consecutive imports. These new keys are called surrogate keys and it is no simples task to create and maintain these keys in an efficient manner.

10. Analysis/Reporting

A data platform’s purpose is to prepare raw data for analysis and store this data for longer periods of time. Analysis can be conducted in many flavors.

There are many tools referred to as Business Intelligence tools (BI tools) concerned only with creating analytical and beautiful/human readable data visualizations. To prepare consumable data chunks for presentation, a data platform provides features to create data extracts and aggregates from the larger data stock.

Answering specific business questions by intelligently querying the data stores requires a great deal of proficiency with analytical query languages. BI tools aim to simplify these tasks by providing point and click interfaces to answer basic questions like, “Number of visitors per month in store” or “Sum of revenue in region X.” These tools also enable users to visualize the information in easily consumable graphics. In almost all cases, power users still want to be able to bypass these tools and conduct their own queries. Popular examples for BI tools are Tabelau, Qlik, Looker, Chartio, Superset, and many others.

11. Data Science

Nowadays, training machine learning models is a new requirement that data platforms have to serve. The more sophisticated methods are implemented using Python or R together with a wide variety of specialized libraries like NumPy, Pandas, Scikit-learn, TensorFlow, PyTorch, or even more specialized libraries for natural language processing or image recognition.

Since these tasks can be computationally demanding, extra compute hardware is required in addition to the existing analytics hardware. While this opens up a large choice of tools for you to pick from, you again are facing the challenge of hosting and managing compute resources to back very demanding machine learning jobs.

12. External Access/API

All the collected data in the platform is there to be consumed for different purposes. Possible channels that are considered here are:

SQL access for direct analysis or via BI tools.

API Access (REST request) as a service for web sites or apps.

Notfications via push notifications or email for end users or administrators.

File exports for further processing or data delivery to other parties.

13. Usability

The usability of the platform depends very much on the targeted audience. Usability is a very broad and subjective category and considers things such as how easy it is to create and manage objects (users, data warehouses, tables, transformations, reports, etc.) in the platform. Often, there exists a trade off between the level of control a user gets and the level of simplicity. Here, we have to distinguish between the functionality the platform provides and the user generated content inside the platform. In most cases, the user generated content requires the use of code, since the whole subject of data engineering and analysis is by nature complex and requires a large amount of expressiveness.

14. Multiuser Workflow

This category evaluates the support for user interaction and sharing of work and data. This aspect involves realtime updates of user action, collaborative work, sharing, and role assignment. And also a way to discuss and comment on the platform.

15. In-Platform Documentation

A data platform is used to implement a lot of custom complexity with many participating users, over a longer period of time.

This requires proper documentation of the user provided content. Here, we assess how the platforms support this task. Of course, documentation can always be prepared outside of the platform, but this implies the risk of divergence of information. External documentation quickly becomes outdated and therefore just as quickly looses trust from the users.

16. Documentation

All platforms require a certain proficiency by the user. Data engineering is no casual task. Proper documentation on the details of the platform features is therefore required for professional use of a platform.

17. Security

Data Platform Security can be separated into security of storage (data at rest), interaction (data in transport), and access control.

18. Cost Structure

We identify three major cost drivers of a data platform:

Licenses

Infrastructure (Hardware)

Staff

Today, most software stacks can be implemented in high quality using open source and/or free software. Licensed software or services usually require less maintenance effort and low level system know how.

Compute hardware can be used by cloud providers on a pay per use basis. The same holds basically for storage infrastructure.

To estimate your hardware costs, you need to consider the infrastructure to cover the following components:

Database

Data Transformations

Analytics

Data Science

Automation

Monitoring

Hosting of Content

Even though the database is usually the largest component it is by far not the only one.

Conclusion

Data platforms should not be reduced to the underlying core database but instead should be considered as an ecosystem of services that need to be balanced.

The above list provides a basic entry point for evaluating data platforms for their fitness as a long term, manageble data platform for organizations with the primary purpose of aggregating longer histories of data for more significant statistics and forecasts. This list applies only in part if your objective is to solve very specific data related problems for a shorter time period.