RedShift Elastic Resize — The Good And The Not-so-good Bhuvanesh Follow Apr 8 · 4 min read

We were very excited when AWS announced that we can do the Cluster resizing (Changing the Cluster type eg: dc2.xlarge to dc2.8xlarge) in 10–15mins with Elastic Resize. We may get 2–3mins downtime for swapping the endpoints/IP and etc. This is awesome, right? We were thrilled and wanted to see the overall progress and queries interruption. Recently we did the Elastic Resizing for a 16TB of DS2.xlarge(8 nodes) cluster to DS2.8xlarge (2nodes) Cluster. Here are the good and not so good things about it.

How Elastic Resizing Works?

Credits: AWS

Get the resizing request and prepare for it. Add the X number of nodes with the new configured (your new cluster type or existing type) with storage. Distribute the configurations to all the new nodes. Get a minimal downtime for swapping the Endpoints and etc. Existing connections are put on hold. No new connections are accepted. Update the Snapshots in S3 (like a new incremental snapshot) The cluster will be available for Read/Write. Start serving the data from S3. Start moving the data from S3 to RedShift storage.

How we did the resize:

Generally, it's a best practice to take a snapshot before doing any major maintenance activity. Then we just went to the console and selected the Resize option, then we pick the elastic resize option.

Selected the ds2.8xlarge

Nodes: 2

Click on Resize

What happened after this?

10:00 PM — Resize activity started

10.02 PM — When I logged into RedShift client( I have already opened a session) and ran some queries ( select sysdate ), but it was running for a long time. Then I tried to open a new session, there again I tried to run the same query. But no luck.

10.05 PM — Got the event notification from the RedShift.

A resize for Amazon Redshift cluster 'my-cluster' was started at 2020-04-08 16:35 UTC. The cluster will be in read-only mode while resizing is in progress.

10.06 PM — Then again I ran the query, it started executing, but writes didn’t work.

10.09 PM — Cluster was down.

DB=# select current_timestamp;

SSL SYSCALL error: EOF detected

The connection to the server was lost. Attempting reset: Failed.

!> Cluster 'resize-354529-target' began restart at 2020-04-08 16:39 UTC.

10.10 PM — Restart done.

Cluster 'resize-354529-target' completed restart at 2020-04-08 16:40 UTC.

10.24 PM — After this, it was fine for some time, but again it down in 15mins. From the events it was showing again there was a restart.

Cluster 'my-cluster' began restart at 2020-04-08 16:54 UTC.

Cluster 'my-cluster' completed restart at 2020-04-08 16:55 UTC.

10.25 PM — And Finally, I got successful notification from the events. Till that time the cluster was showing Resizing from the console. It took 25mins to become an available state.

The resize for Amazon Redshift cluster 'my-cluster' completed at 2020-04-08 16:55 UTC, and the cluster is available for reads and writes. The resize was initiated at 2020-04-08 16:30 UTC and took 0 hours 24 minutes to complete.

The Good Things:

Comparing to the classic resize it's better and time saving. This new feature for changing the cluster type — again a time saver. Endpoint, Leader Node IP will be retained. We missed noting down the compute node’s IP. No config changes, sit back and do your routine processes.

The Not-so-good things and the workaround:

The console says, it’ll take 10–15mins to resize till that time it’ll be read-only mode. But for us it took 25mins, it may be due to heavy data. But till this time the queries were fluctuating, sometimes they run, sometimes not. So it's better to consider 1hr maintenance window for this. You are going to lose many of your system tables/views. Like STL tables and views. Its a big drawback. Because we did the resize to measure the performance before and after, now nothing is available to validate. Take a backup of all of your important system tables and views and export them to s3. Old cluster’s monitoring metrics will not be available for on the RedShift console.(Maybe the disk metric is available) For monitoring metrics, you can get it from the cloud watch console, but query history, query plan and etc — nothing, but you can manually export them to s3 before the activity. Database statistics will be lost. — Run vacuum and analyze. I think the snapshot that was taken during this activity should be available on the console. But I didn’t see that in my console. The snapshot after the activity 10.26 PM was taken. So take a snapshot before the resizing.

These are initial observations and I am sure that this service will continue to improve. Any comments, clarifications — please leave a response.

Hope you found this useful!