In a previous post, we covered how to create a custom analytics solution over Amazon Redshift. One of the biggest advantages of utilizing Redshift for analytics instead of using a third-party off-the-shelf analytics tool (such as Google Analytics, Mixpanel, Localytics, etc.) is the ability to join multiple data sources. For example: joining analytic events coming from your mobile app with user data stored in your MySQL database. This allows answering deeper questions like “Which of my customers are using my product the most but spending the least?”, or “What sequence of user actions leads to the highest likelihood of that user becoming a premium user?”.

We will begin with a high-level overview and then lean toward the technical side, since this is a topic of somewhat technical complexity.

At a glance, this article covers:

Why companies replicate their MySQL databases to Amazon Redshift

The different approaches for MySQL replication

Detailed real-world example implementation of MySQL to Amazon Redshift replication

Common pitfalls to avoid when building your replication solution

Why replicate MySQL to Amazon Redshift?

It’s very common these days to have multiple replicas of your data in several different types of data stores. Different parts of your applications have different requirements from your data store. By replicating your data to multiple data stores, each optimized for a different use-case, you are able to optimize the performance of your system.

Perhaps the most common and straightforward example is read-replicas for your transactional database. But many applications also store their transactional data in ElasticSearch, for example, to enjoy search speed and the rich features of ElasticSearch, store the raw data in S3 for backup, or Memcached for caching etc.

Back to our use-case, data teams tend to replicate their MySQL data to Redshift for two main reasons:

Joining multiple data sources - The main motivation to move to Redshift, is to be able to join and intersect multiple data sources. For example, joining billing data and customer service data with user behavior data. This kind of analysis allows you to ask questions such as: “Which features lead to higher retention rates and more conversions to paid plans?”, “What user behaviors increase churn?” and to perform predictions such as “Which users are likely to become paying customers?” Analytical queries performance - While MySQL is optimized for reading, inserting and updating a relatively small number of records, Redshift is geared towards aggregating and analyzing large amounts of data by taking advantage of its columnar data store and massive parallel processing architecture. These kinds of analytical queries tend to perform very poorly in MySQL, and risk overloading and blocking your operational transactional database. By replicating MySQL to Amazon Redshift, developers are able to improve their analytical query speed by many orders of magnitude. In addition, separating the operational database from the analytical database also means that analytical queries will never affect the performance of the production database.

Different approaches for MySQL replication

There are several main approaches for database replication. In this section we’ll describe the three most common approaches and compare the advantages and drawbacks of each of them.

Full dump and load

In this approach, periodically, the MySQL tables are fully dumped, the corresponding Amazon Redshift tables are dropped and recreated, and the full dumps of the MySQL tables are loaded to Redshift.

The advantage of this approach is that it is very simple and straightforward. The disadvantage is that dumps are very resource intensive, so that it can slow down the MySQL database during the table dumps. To have a consistent dump of your table you might even need to lock your database during the dump. In addition, implementing a full dump on very large tables incurs high latency.

Therefore, this approach is mostly recommended if you have very small tables.

If you do choose to use this approach, it would be best to use a replica instead of your master database, since it might block your database and interfere with your production application.

Incremental dump and load

In this approach, the MySQL table is periodically queried for updates since the last query. The updates are then loaded into Amazon Redshift and a consolidation query reconstructs the original table.

SELECT * FROM my_table WHERE last_update > #{last_import}

The main advantage of this approach over a full dump and load is that in each iteration, only the updates are extracted and loaded which usually significantly reduces the load, since normally only a small portion of the database’s rows are updated between two iterations.

Perhaps the most significant disadvantage of this approach is that it cannot capture row deletions, since deleted rows will never be returned in a query. Similarly, table alterations are also not captured unless actively queried in each iteration. Another minor disadvantage is that it requires the replicated tables to contain an “updated at” column that the periodical query can use. The necessary consolidation step also adds some complexity to the implementation.

Similarly to the full dump and load approach, it’s best to use a replica and not the master database for the incremental dump and load to avoid blocking the master database.

Binlog replication

This approach (sometimes referred to as change data capture - CDC) utilizes MySQL’s binlog. MySQL’s binlog keeps an ordered log of every operation that was performed by the database. After an initial dump of the initial state of the MySQL database, the binlog is continuously streamed and loaded into Amazon Redshift.

We at Alooma found this to be the best approach, and the only approach that allows for near real-time replication at scale. Its main advantages are that it doesn’t lock or affect the performance of the database, it supports both deletions and table alterations and therefore enables exactly one-to-one replication, it doesn’t have any requirements over the structure of the tables and it is very coherent with the stream processing paradigm that allows transformations and near real-time performance. Not surprisingly, this is the most complex approach to implement.

Implementing binlog replication

In this section we provide a detailed real-world example of how to implement MySQL to Amazon Redshift replication using MySQL’s binlog.

We assume that you already created your Amazon Redshift cluster and that you know how to load data to Amazon Redshift. If not, check out Amazon Redshift’s getting started guide.

We will use an example table to demonstrate the different steps of the suggested implementation. Let’s imagine that we have a products MySQL table that we would like to replicate in Amazon Redshift and has the following structure:

id name price 1 Beard oil 1.00 2 in-houseBalm 2.00 ... ... ...

Each of the products in our example products table has an id , a name and a price . We would like to reconstruct this table in our Redshift on a continuous basis.

General overview of the process

Step 1 - Creating the Redshift tables

Step 2 - Initial dump and load - Captures the initial state of the table

- Captures the initial state of the table Step 3 - Continuous binlog streaming - Captures the updates to the table continuously

- Captures the updates to the table continuously Step 4 - Consolidation - Updates the replicated table using the binlog and the previous version of the replicated table

Step 1 - Creating the Redshift tables

Before pulling data from MySQL and loading it to Amazon Redshift, you will need to create the target tables in Redshift, since some properties such as column types and table keys can only be defined once.

Two tables will need to be created: The first will hold the latest reconstructed state of the original MySql table; The second will hold all the updates to the original MySql table since the last consolidation. Both tables will have the same structure so that they can be unioned for the consolidation step. It also means that the reconstructed table might hold a few columns that don’t exist in the original table, but these should be disregarded in analysis.

There are two main types of discrepancies between the MySQL and Redshift tables that you will have to consider: column type discrepancies and key discrepancies. Since column types and keys cannot be changed after table creation in Amazon Redshift, you need to define column types and table keys in advance upon table creation.

Column type conversion

MySQL and Redshift have slightly different column types. Redshift has an equivalent for the most common column types, and most other columns can be converted to a Redshift column type.

For example: MySQL’s INT and FLOAT are direct equivalents of INTEGER and REAL in Amazon Redshift. MySQL’s TIME on the other hand does not have a direct equivalent and has to be loaded to Redshift as VARCHAR(40) , UNSIGNED INT does not have an equivalent in Redshift and BLOB and VARCHAR that are > 65535 must be truncated.

You can find a full list of suggested column type conversions here: http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-template-redshift.html

Primary key, distribution key and sort keys

Primary key

The main discrepancy between MySQL and Amazon Redshift regarding the primary key, is that in Redshift the primary key constraint is not enforced. Thus, two rows can have an identical primary key. Primary keys are only used as a hint by the Amazon Redshift query planner to optimize your queries. Primary keys should be enforced by your ETL process.

Note: Having multiple rows with an identical primary key can cause your queries to return incorrect results, as Amazon Redshift assumes that your primary keys are unique and valid.

Distribution key

When you load data into a table, Amazon Redshift distributes the rows of the table to each of the node slices according to the table's distribution key. A good distribution key allows for an even distribution of data across all nodes and minimizes data movement during query execution.

Usually, the table’s primary key is a good candidate for Redshift’s distribution key. If your MySQL table has a compound primary key, Amazon suggests to use the first primary key as your distribution key.

Sort keys

Since in this solution you are going to use MySQL’s binlog for replication, a column that is used quite often for ordering in your queries is the primary key, which is heavily used when table rows get updated. By defining it as the table’s sort key, you can optimize your queries’ performance.

Step 2- Initial dump and load

After you created the destination Redshift table, you need to perform the initial dump from MySQL and load it into your Redshift table. All updates to the table thereafter will be executed using data coming from the binlog stream.

First, you need to make sure that your database is configured for row-based binlog (we’ll talk about the different binlog formats and explain why we chose row-based later in this post). It’s important to define the binlog format before the initial dump since in order to avoid data loss or inconsistencies you would want to know exactly at what moment you took the table’s “snapshot” and have all of the updates that followed.

You can find detailed instructions for configuring your MySQL for row-based binlog in Alooma’s documentation.

Backup the relevant tables and get the binlog position. This is required, because the binlog replication sends events from a specific point in time. This step requires a user with the following permissions: SELECT, LOCK TABLES

Lock tables. Important - do not close the session when locking the tables, in order to be able to release the lock mysql> FLUSH TABLES WITH READ LOCK; In a different session, dump your table to disk, e.g. using mysqldump, and get the current binlog position by executing mysql> SHOW MASTER STATUS; Save the output, which should look like: +------------------+----------+--------------+------------------+-------------------+ | File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set | +------------------+----------+--------------+------------------+-------------------+ | mysql-bin.000003 | 1507 | | | | +------------------+----------+--------------+------------------+-------------------+ 1 row in set (0.00 sec) From the session where you locked the tables, unlock tables mysql> UNLOCK TABLES;

Now that you have your initial dump, you load it to Amazon Redshift.

Step3 - Continuous binlog streaming

The binlog holds an ordered log of all the updates of the MySQL table. You can stream it continuously to a binlog table in Redshift and use it to recreate the original MySQL table in Redshift.

MySQL has three binlog formats:

STATEMENT - every query that refers to the table is documented in a binlog line. Even if multiple table rows are affected by a query, only one line will be written to the binlog. For example, a single query that updates all the rows of the table will appear as a single entry in the binlog. ROW - every update to a table row is documented in a binlog line. If multiple rows are affected by a single query, a line will be written to the binlog for every affected table row. For example: a single query that updates all the rows of the table will create an entry for each row in the binlog. MIXED - both statements and row updates are written to the binlog.

We recommend to use the ROW format since it allows you to directly stream and load the binlog to the binlog table in Redshift. The reason behind it is that if you want to reconstruct the state of a certain row with statement-based binlog, you have to keep track of the state of the entire table. With a row-based binlog you only need to track the state of individual rows.

For example the binlog table for a “products” table could look like this:

timestamp type log_file log_position id name price deleted 1449397348 insert OFF.003935 212230965 1 Beard oil 1.00 false 1449397348 insert OFF.003935 212230966 2 Balm 2.00 false 1452173072 update OFF.003935 212230967 1 Beard oil 0.99 false

Step 4 - Consolidation

As we mentioned above, the consolidation table stores the latest reconstructed state of your MySQL table, while the binlog table stores all the updates ever since. Every few minutes (or hours, depending on your specific needs and the size of the table) you can run a consolidation query on Redshift that updates the reconstructed table using the binlog table, and clears the binlog table.

The consolidation process first unions the consolidated table together with the binlog table, and then extracts the latest state of each row. The result is a reconstructed one-to-one version of the original MySql table, since every row in the new table represents the most recently updated row of the original table in MySql.

Near real-time consolidation

You can also define the consolidation query as a Redshift view. Such query will take more time to run than a query running on the consolidated table, but your analysts and data applications that query that view will always see the updated version of the replicated table.

Common pitfalls to avoid

There are many pitfalls to avoid when building a MySQL to Amazon Redshift replication pipeline and we’ve already touched some of them above. Below is a more exhaustive list we curated. We will expand on how we handle more of them in future posts.

Conversion between MySQL and Redshift Type - As mentioned above, MySQL and Amazon Redshift have slightly different column types. Column types should be chosen wisely to avoid data loss.

- As mentioned above, MySQL and Amazon Redshift have slightly different column types. Column types should be chosen wisely to avoid data loss. Choosing the right primary key, distribution key and sort keys - MySQL and Amazon Redshift have different key definitions. Defining the Redshift keys correctly can greatly affect performance, both during consolidation and analysis.

- MySQL and Amazon Redshift have different key definitions. Defining the Redshift keys correctly can greatly affect performance, both during consolidation and analysis. Update of a primary key column of a row - If an update query modifies the primary key column of a certain row, the consolidation process described above will not work, as it will not identify that the update relates to the same row (since we partition by the primary key). We might discuss how we handle this case at a later post.

- If an update query modifies the primary key column of a certain row, the consolidation process described above will not work, as it will not identify that the update relates to the same row (since we partition by the primary key). We might discuss how we handle this case at a later post. Table structure modifications - When columns are added, deleted or modified in the original MySql table, the Redshift queries need to adapt accordingly to avoid breaking the pipeline and facing data loss.

- When columns are added, deleted or modified in the original MySql table, the Redshift queries need to adapt accordingly to avoid breaking the pipeline and facing data loss. Logs arriving out of order - Keeping the order of the logs is not always easy at large scale, but is critical in order to correctly replicate the table. Your data pipeline should be built to keep ordering as much as possible but also handle cases where logs come out of order. For example, a row update might arrive after a delete (even though the original order is update and then delete). If not handled correctly it might cause the deleted row to reappear in the replicated table.

- Keeping the order of the logs is not always easy at large scale, but is critical in order to correctly replicate the table. Your data pipeline should be built to keep ordering as much as possible but also handle cases where logs come out of order. For example, a row update might arrive after a delete (even though the original order is update and then delete). If not handled correctly it might cause the deleted row to reappear in the replicated table. Downtime when consolidating - The consolidation process can take a couple of minutes (or even more depending on the size of the table and the binlog updates). How can you minimize the downtime for the consumers who query the Redshift table?

- The consolidation process can take a couple of minutes (or even more depending on the size of the table and the binlog updates). How can you minimize the downtime for the consumers who query the Redshift table? Deleting the binlog - While performing consolidation, records continue being loaded to the binlog table. When you clear the binlog table, you need to make sure to not delete records that haven’t been consolidated yet.

- While performing consolidation, records continue being loaded to the binlog table. When you clear the binlog table, you need to make sure to not delete records that haven’t been consolidated yet. Error handling - Certain components of your pipelines will have errors from time to time. Your pipeline needs to be able to handle network errors, Redshift loading errors and more without data loss. You should also set up monitoring that alerts you when such errors occur.

Should I build my own solution or use a third-party solution?

It is definitely possible to build your own MySQL to Redshift solution. Epecially, it makes sense to do so when starting out with small data volumes in a full dump and load approach. However, if your solution requires either high throughput, high availability, low latency or frequent schema changes then you’re probably better off using a third-party solution.

As a rule of thumb, once you have tables that are bigger than 1GB or once you have more than a couple million daily updates to your tables, third party solutions start to become the more attractive alternative. For instance, when dealing with tons of data flowing through your system, the Alooma Data Pipeline Restream Queue feature ensures that you never lose an event and all errors are caught in order to be re-streamed, saving valuable engineering time.

Need more help building your MySQL to Redshift solution? Contact us and tell us about your use-case.