In this post I'll walk through extracting data from a database of one million contemporary popular music tracks, loading them into AWS Redshift via S3 and running some simple analysis via PostgreSQL's interactive terminal psql.

Installing Requirements The following requirements were all installed on a fresh Ubuntu 15.10 installation. $ sudo apt update $ sudo apt install \ python-pip \ python-virtualenv \ postgresql-client-9.4 \ sqlite3 \ parallel For the AWS CLI tool and s3cmd I'll install them via pip into a virtual environment. $ virtualenv redshift $ source redshift/bin/activate $ pip install \ awscli \ https://github.com/s3tools/s3cmd/archive/v1.6.1.zip#egg = s3cmd

Fetching the Metadata The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks. The dataset is the result of a collaboration between The Echo Nest and LabROSA at Columbia University, supported in part by the NSF. Within this dataset there are several subsets of data. The one I'm interested in is the million track metadata extract in SQLite3 format. The following will download it from Columbia University's Servers. $ wget -c http://labrosa.ee.columbia.edu/millionsong/sites/default/files/AdditionalFiles/track_metadata.db

Exploring the Metadata Before I load the data into Redshift I'll first explore the shape of the data and value distributions. $ sqlite3 track_metadata.db The database has a single table within it. Most of the columns aren't very granular in terms of data types so I'll need to explore how small I can make each column when designing the Redshift table schema later on. sqlite > . schema CREATE TABLE songs ( track_id text PRIMARY KEY , title text , song_id text , release text , artist_id text , artist_mbid text , artist_name text , duration real , artist_familiarity real , artist_hotttnesss real , year int , track_7digitalid int , shs_perf int , shs_work int ); ... There are indeed one million records within this table. sqlite > select count ( * ) from songs ; 1000000 I'll examine a sample record to get an idea of the nature of the data stored. sqlite > . mode line sqlite > SELECT * FROM songs LIMIT 1 ; track_id = TRMMMYQ128F932D901 title = Silent Night song_id = SOQMMHC12AB0180CB8 release = Monster Ballads X-Mas artist_id = ARYZTJS1187B98C555 artist_mbid = 357ff05d-848a-44cf-b608-cb34b5701ae5 artist_name = Faster Pussy cat duration = 252.05506 artist_familiarity = 0.649822100201 artist_hotttnesss = 0.394031892714 year = 2003 track_7digitalid = 7032331 shs_perf = -1 shs_work = 0 I'll then check the cardinatily of the track and song identifier columns. The track_id column is unique where a song_id can have multiple track_ids. sqlite > SELECT count ( * ) cnt , track_id FROM songs GROUP BY track_id ORDER BY cnt desc LIMIT 1 ; cnt = 1 track_id = TRAAAAK128F9318786 sqlite > SELECT count ( * ) cnt , song_id FROM songs GROUP BY song_id ORDER BY cnt DESC LIMIT 1 ; cnt = 3 song_id = SOBPAEP12A58A77F49 I'll see how long the various identifier columns are and the maximum values of the two integer fields I'm interested in. Knowing these values will help me pick the smallest column types for the Redshift table schema later on. sqlite > . mode csv sqlite > . headers off sqlite > SELECT MAX ( LENGTH ( artist_id )) FROM songs ; 18 sqlite > SELECT MAX ( LENGTH ( artist_familiarity )) FROM songs ; 16 sqlite > SELECT MAX ( LENGTH ( artist_hotttnesss )) FROM songs ; 15 sqlite > SELECT MAX ( LENGTH ( track_id )) FROM songs ; 18 sqlite > SELECT MAX ( LENGTH ( duration )) FROM songs ; 10 sqlite > SELECT MAX ( year ) FROM songs ; 2011 sqlite > select MAX ( duration ) FROM songs ; 3034 . 90567

Exporting Data of Interest I've created a query to pull six fields of interest from every record the table and I'll export that data out into a CSV file. $ sqlite3 track_metadata.db <<! .headers off .mode csv .output track_metadata.csv SELECT track_id, artist_id, artist_familiarity, artist_hotttnesss, duration, year FROM songs ; !

Split and Compress The fastest way to load data into Redshift is by breaking up your data into separate files and compressing them. Below I'll break the one-million-row CSV file up into twenty CSV files of 50K rows each and compress them. $ split -l 50000 \ --additional-suffix = .csv \ track_metadata.csv \ track_metadata_ $ gzip track_metadata_* The resulting GZIP files are about 1.8MB each in size. $ ls -lh track_metadata*.gz -rw-rw-r-- 1 mark mark 1 .8M Jan 23 08 :48 track_metadata_aa.csv.gz -rw-rw-r-- 1 mark mark 1 .8M Jan 23 08 :48 track_metadata_ab.csv.gz -rw-rw-r-- 1 mark mark 1 .8M Jan 23 08 :48 track_metadata_ac.csv.gz -rw-rw-r-- 1 mark mark 1 .8M Jan 23 08 :48 track_metadata_ad.csv.gz -rw-rw-r-- 1 mark mark 1 .8M Jan 23 08 :48 track_metadata_ae.csv.gz -rw-rw-r-- 1 mark mark 1 .8M Jan 23 08 :48 track_metadata_af.csv.gz -rw-rw-r-- 1 mark mark 1 .8M Jan 23 08 :48 track_metadata_ag.csv.gz -rw-rw-r-- 1 mark mark 1 .8M Jan 23 08 :48 track_metadata_ah.csv.gz -rw-rw-r-- 1 mark mark 1 .8M Jan 23 08 :48 track_metadata_ai.csv.gz -rw-rw-r-- 1 mark mark 1 .8M Jan 23 08 :48 track_metadata_aj.csv.gz -rw-rw-r-- 1 mark mark 1 .8M Jan 23 08 :48 track_metadata_ak.csv.gz -rw-rw-r-- 1 mark mark 1 .8M Jan 23 08 :48 track_metadata_al.csv.gz -rw-rw-r-- 1 mark mark 1 .8M Jan 23 08 :48 track_metadata_am.csv.gz -rw-rw-r-- 1 mark mark 1 .8M Jan 23 08 :48 track_metadata_an.csv.gz -rw-rw-r-- 1 mark mark 1 .8M Jan 23 08 :48 track_metadata_ao.csv.gz -rw-rw-r-- 1 mark mark 1 .8M Jan 23 08 :48 track_metadata_ap.csv.gz -rw-rw-r-- 1 mark mark 1 .8M Jan 23 08 :48 track_metadata_aq.csv.gz -rw-rw-r-- 1 mark mark 1 .8M Jan 23 08 :48 track_metadata_ar.csv.gz -rw-rw-r-- 1 mark mark 1 .8M Jan 23 08 :48 track_metadata_as.csv.gz -rw-rw-r-- 1 mark mark 1 .8M Jan 23 08 :48 track_metadata_at.csv.gz

Uploading in Parallel to S3 You want to make sure your S3 bucket is created in the same region that you'll be launching your Redshift instance in. Amazon's us-east-1 region has the cheapest price on Redshift clusters so I'll use that region to create my S3 bucket in. $ s3cmd --configure ... Default Region [ US ] : US ... $ s3cmd mb s3://track_metadata_example Bucket 's3://track_metadata_example/' created The following will use GNU's parallel command to upload eight files at a time to Amazon S3. $ find track_metadata_*gz | \ parallel -j8 \ s3cmd put { / } s3://track_metadata_example/ Redshift needs a manifest of the files we'll be loading in. $ vi songs.manifest { "entries" : [ { "url" : "s3://track_metadata_example/track_metadata_aa.csv.gz" , "mandatory" : true }, { "url" : "s3://track_metadata_example/track_metadata_ab.csv.gz" , "mandatory" : true }, { "url" : "s3://track_metadata_example/track_metadata_ac.csv.gz" , "mandatory" : true }, { "url" : "s3://track_metadata_example/track_metadata_ad.csv.gz" , "mandatory" : true }, { "url" : "s3://track_metadata_example/track_metadata_ae.csv.gz" , "mandatory" : true }, { "url" : "s3://track_metadata_example/track_metadata_af.csv.gz" , "mandatory" : true }, { "url" : "s3://track_metadata_example/track_metadata_ag.csv.gz" , "mandatory" : true }, { "url" : "s3://track_metadata_example/track_metadata_ah.csv.gz" , "mandatory" : true }, { "url" : "s3://track_metadata_example/track_metadata_ai.csv.gz" , "mandatory" : true }, { "url" : "s3://track_metadata_example/track_metadata_aj.csv.gz" , "mandatory" : true }, { "url" : "s3://track_metadata_example/track_metadata_ak.csv.gz" , "mandatory" : true }, { "url" : "s3://track_metadata_example/track_metadata_al.csv.gz" , "mandatory" : true }, { "url" : "s3://track_metadata_example/track_metadata_am.csv.gz" , "mandatory" : true }, { "url" : "s3://track_metadata_example/track_metadata_an.csv.gz" , "mandatory" : true }, { "url" : "s3://track_metadata_example/track_metadata_ao.csv.gz" , "mandatory" : true }, { "url" : "s3://track_metadata_example/track_metadata_ap.csv.gz" , "mandatory" : true }, { "url" : "s3://track_metadata_example/track_metadata_aq.csv.gz" , "mandatory" : true }, { "url" : "s3://track_metadata_example/track_metadata_ar.csv.gz" , "mandatory" : true }, { "url" : "s3://track_metadata_example/track_metadata_as.csv.gz" , "mandatory" : true }, { "url" : "s3://track_metadata_example/track_metadata_at.csv.gz" , "mandatory" : true } ] } The manifest itself will also need to live on S3. $ s3cmd put songs.manifest s3://track_metadata_example/

Launching the Redshift Cluster The cheapest Redshift cluster I could find on Amazon's pricing page was a dc1.large in us-east-1 at $0.25 / hour. It comes with 2 vCPUs, 15 GB of RAM, 0.16 TB of SSD-backed storage and throughput support of 0.20GB / second. Below I'll set the environment variables for my AWS access credentials and set the master username and password for my Redshift instance. $ read AWS_ACCESS_KEY_ID $ read AWS_SECRET_ACCESS_KEY $ export AWS_ACCESS_KEY_ID $ export AWS_SECRET_ACCESS_KEY $ read MASTER_USERNAME $ read MASTER_PASSWORD $ export MASTER_USERNAME $ export MASTER_PASSWORD Before creating the Redshift cluster make sure the AWS CLI tool's default region is the same region your S3 bucket is located in. You can run the configure command to update any settings and ensure they're as you expect. $ aws configure ... Default region name [ None ] : us-east-1 ... This will create a Redshift cluster with your default security group. $ aws redshift create-cluster \ --db-name songs \ --cluster-type single-node \ --node-type dc1.large \ --master-username $MASTER_USERNAME \ --master-user-password $MASTER_PASSWORD \ --publicly-accessible \ --cluster-identifier song-data \ --availability-zone us-east-1a { "Cluster" : { "ClusterVersion" : "1.0" , "AvailabilityZone" : "us-east-1a" , "NodeType" : "dc1.large" , "PubliclyAccessible" : true , "Tags" : [], "MasterUsername" : "mark" , "ClusterParameterGroups" : [ { "ParameterGroupName" : "default.redshift-1.0" , "ParameterApplyStatus" : "in-sync" } ], "Encrypted" : false , "ClusterSecurityGroups" : [ { "Status" : "active" , "ClusterSecurityGroupName" : "default" } ], "AllowVersionUpgrade" : true , "VpcSecurityGroups" : [], "NumberOfNodes" : 1 , "AutomatedSnapshotRetentionPeriod" : 1 , "ClusterStatus" : "creating" , "ClusterIdentifier" : "song-data" , "DBName" : "songs" , "PreferredMaintenanceWindow" : "mon:09:00-mon:09:30" , "PendingModifiedValues" : { "MasterUserPassword" : "****" } } } Once you execute that command Amazon will set about creating the cluster. In my experience it usually takes a few minutes. I tend to run a watch command to keep an eye on the cluster status and wait till I can see the cluster's endpoint address and see the status set to "available". $ watch -n10 aws redshift describe-clusters { "Clusters" : [ { "PubliclyAccessible" : true , "MasterUsername" : "mark" , "VpcSecurityGroups" : [], "ClusterPublicKey" : "..." , "NumberOfNodes" : 1 , "PendingModifiedValues" : {}, "ClusterVersion" : "1.0" , "Tags" : [], "AutomatedSnapshotRetentionPeriod" : 1 , "ClusterParameterGroups" : [ { "ParameterGroupName" : "default.redshift-1.0" , "ParameterApplyStatus" : "in-sync" } ], "DBName" : "songs" , "PreferredMaintenanceWindow" : "mon:09:00-mon:09:30" , "Endpoint" : { "Port" : 5439 , "Address" : "song-data.cttuaolixpsz.us-east-1.redshift.amazonaws.com" }, "AllowVersionUpgrade" : true , "ClusterCreateTime" : "2016-01-23T16:59:38.615Z" , "ClusterSecurityGroups" : [ { "Status" : "active" , "ClusterSecurityGroupName" : "default" } ], "ClusterIdentifier" : "song-data" , "ClusterNodes" : [ { "NodeRole" : "SHARED" , "PrivateIPAddress" : "10.234.73.99" , "PublicIPAddress" : "54.157.227.115" } ], "AvailabilityZone" : "us-east-1a" , "NodeType" : "dc1.large" , "Encrypted" : false , "ClusterRevisionNumber" : "1019" , "ClusterStatus" : "available" } ] } In the above you can see the security group is set to my default one. Within that group I've allowed my local IP address to connect on port 5439. I tend to set my security groups up via the AWS console rather than the CLI tool so I don't have a CLI example to copy-and-paste. Once you've enabled access from your IP address to the Redshift cluster via port 5439 you'll be able to access it via the PostgreSQL interactive terminal. Below I'm using version 9.4.5. I've seen 9.3 connect to Redshift just fine as well. $ psql --version psql ( PostgreSQL ) 9 .4.5 $ PGPASSWORD = $MASTER_PASSWORD \ psql -h song-data.cttuaolixpsz.us-east-1.redshift.amazonaws.com \ -p 5439 \ -U $MASTER_USERNAME songs

Loading Compressed CSV data into Redshift I'll create an exploratory schema for the track data to load into. CREATE TABLE tracks ( track_id VARCHAR ( 18 ) NOT NULL DISTKEY ENCODE LZO , artist_id VARCHAR ( 18 ) NOT NULL ENCODE LZO , artist_familiarity DECIMAL ( 16 , 15 ) ENCODE MOSTLY8 , artist_hotttnesss DECIMAL ( 16 , 15 ) ENCODE MOSTLY8 , duration DECIMAL ( 12 , 8 ) ENCODE MOSTLY8 , year SMALLINT ENCODE MOSTLY8 , primary key ( track_id ) ) sortkey ( year ); I'll then issue a command to load in the data from the GZIP'ed CSV files on S3. This command has place holders where your AWS access key identifier and secret access key need to go. Please change these before executing this command. COPY tracks FROM 's3://track_metadata_example/songs.manifest' CREDENTIALS 'aws_access_key_id=...;aws_secret_access_key=...' DELIMITER ',' EMPTYASNULL ESCAPE GZIP MANIFEST MAXERROR 1000 REMOVEQUOTES TRIMBLANKS TRUNCATECOLUMNS ; When I ran this command 77MB of data (when uncompressed) was loaded in 9.92 seconds (~7.76 MB / second). The networking overhead of fetching so many small files probably contributed to the slow performance of this operation. INFO: Load into table 'tracks' completed, 1000000 record(s) loaded successfully. COPY Time: 9921.177 ms I'll then run the VACUUM command to make sure the data is sorted properly. VACUUM tracks ;

Querying data on Redshift Now that the data is loaded I can run some SQL commands to explore the data. SELECT year , AVG ( duration ) FROM tracks WHERE year > 1970 GROUP BY year ORDER BY year ; year | avg ------+-------------- 1971 | 245.59519935 1972 | 249.72693913 1973 | 250.97967923 1974 | 248.27482999 1975 | 258.27489716 1976 | 255.10848195 1977 | 250.99925862 1978 | 245.41075019 1979 | 245.40586793 1980 | 237.43758797 1981 | 238.90122831 1982 | 232.92932395 1983 | 239.75413748 1984 | 243.27543189 1985 | 250.00972298 1986 | 244.71989363 1987 | 237.88305477 1988 | 242.14960836 1989 | 241.54657340 1990 | 240.88123135 1991 | 242.51786532 1992 | 245.00212516 1993 | 246.61140532 1994 | 246.56816199 1995 | 250.44964826 1996 | 247.59096758 1997 | 252.23722730 1998 | 252.05911886 1999 | 251.08285731 2000 | 251.41572952 2001 | 254.42115771 2002 | 249.59224630 2003 | 246.78189993 2004 | 248.30687510 2005 | 245.18002133 2006 | 247.90531512 2007 | 249.69939469 2008 | 250.91283764 2009 | 249.94171984 2010 | 250.41501357 2011 | 185.05098000 (41 rows)