Blog

Users are signing up in droves and your app is producing tons of data. MySQL has worked well as a production database, but your analysis queries are starting to run slowly. So you decide to test out Redshift as a data warehouse.

Before you can start testing Redshift, you need to move your data from MySQL into Redshift.

Creating an IAM User

The easiest way to get data into Redshift begins with uploading CSVs to Amazon S3. In the AWS Identity and Access Management (IAM) console, create an account with access to an S3 bucket.

Create a new user and save the Access Key Id and Access Key Secret. These keys are needed to let Redshift retrieve files from your S3 bucket.

Now your user needs an inline policy to define what resources it can access. The JSON blob below is a sample policy we created with the policy generator for S3 resources.

Change “my-s3-bucket” in the JSON blob to your S3 bucket name and add it as a Custom Policy under Inline Policies.

{"Statement": [ { "Effect": "Allow", "Action": "s3:*", "Resource": [ "arn:aws:s3:::my-s3-bucket/*" ], "Sid": "Stmt1234567" } ]}

Then make sure your bucket and Redshift cluster are in the same region, or you’ll incur data transfer charges.

Once your bucket is ready, it’s time to create a table.

Creating the Table

Redshift doesn’t support as many column types as most row-store databases. For supported types and their value ranges, take a look at Redshift’s documentation on data types.

With a simple table, the column type translation is pretty straightforward. For JSON, BINARY, and other column types not supported by Redshift, you can store them as NULL to keep the schemas consistent, or varchar(max) if you need the data.

Dumping from MySQL

To get our data out of MySQL, we start by escaping control characters and delimiters with a slash, and separate fields with a comma.

Note that with these escape characters, MySQL will output NULLs as \N. Use mysqldump to get data out of MySQL.

mysqldump --tab . --fields-escaped-by=\\ --fields-terminated-by=, dbname tablename

This creates a dump file. Upload your field to S3, create the table in Redshift, and load the data in using the following command:

COPY schema.table FROM 's3://path/to/dump.csv' WITH CREDENTIALS 'aws_access_key_id=<your-access-key-goes-here>; aws_secret_access_key=<your-secret-key-goes-here>' NULL as '\\N' ESCAPE;

Dumping from Postgres

Postgres can dump csv-like files from psql using \copy, which creates a local file. Just run this command in psql:

\copy table to 'filename' csv header null as '\N'

Upload your dump file to S3, create the table in Redshift, and load the data with the following command:

COPY schema.table FROM 's3://path/to/dump.csv' WITH CREDENTIALS 'aws_access_key_id=<your-access-key-goes-here>; aws_secret_access_key=<your-secret-key-goes-here>' CSV;

stl_load_errors

As you load tables, you might run into an error or two. The Redshift stl_load_errors table contains most of the recent errors that occurred during a COPY. Since stl_load_errors is a very wide table, we recommend you use \x auto to enable the extended display.

For example, this handy query will show you the most recent two errors:

select starttime, filename, err_reason, line_number, colname, type, col_length, position, raw_field_value, raw_line, err_code from stl_load_errors order by starttime desc limit 2;

If you don’t see your COPY command in the results, check the Loads tab in the AWS console or try again. The errors are fairly useful, but if you don’t see what you’re looking for, look up the err_code value in Amazon’s documentation.

Most of the bad data format issues can be resolved by going through the options in our troubleshooting section above. Some errors, though, point to deeper problems in the file.

Any of the no delimiter or missing column errors can indicate an unescaped newline or control character. Truncated lines that show in the dump file cannot indicate an unescaped NUL which Redshift cannot process, even in quotes.

Having Trouble?

You may run into the following gotchas while loading:

For invalid characters, add ACCEPTINVCHARS to the COPY command. ACCEPTINVCHARS replaces invalid characters with a “?” per byte, so the length is unchanged.

For out-of-range datetimes, use ACCEPTANYDATE. These are common when exporting from MySQL, since “00/00/00 00:00:00” is valid there, but not in Redshift.

If your CSV has column headers, use IGNOREHEADER 1.

For out-of-range numbers, NUL (‘\0’) characters, or other data that Redshift cannot ingest, you have to fix it at the source.

Now you’ve got your data in Redshift. Happy testing!