S4QS-RS reads S3 object creation events from SQS queues and copies data from S3 to Redshift using COPY with S3 manifests (see Redshift's COPY documentation for more information.)

See the S3 documentation for more information about setting up the object creation events. Note that if you publish the events to SNS and then subscribe your SQS queue to the SNS topic, you must set the "Raw Message Delivery" subscription attribute to "True".

Data can be either copied into monolithic tables, or time series tables with configurable rotation periods and retention, and multiple rolling views with different periods. View creation is done so that adding or removing columns from time series tables doesn't break the view: missing columns are filled in with NULL as [missing_col_name] in the view queries.

More details about how S4QS-RS works in the configuration section.

Installation

npm install -g s4qs-rs

Configuration

S4QS-RS is configurable with JSON files located in either the ./config subdirectory of the current working directory, or at $NODE_CONFIG_DIR . ./config/default.json is always loaded and used for default settings, and $NODE_ENV is used to determine environment-specific configuration files: For example NODE_ENV=production would cause ./config/production.json to be loaded and merged with ./config/default.json .

NODE_ENV -specific configuration files will always override the default configuration. If ./config/default.json is, for example

{ " SQS " : { " params " : { " QueueUrl " : " https://sqs.us-east-1.amazonaws.com/12345/prod-queue " , " MaxNumberOfMessages " : 10 } } , " S3Copier " : { " other " : " stuff here " } }

and ./config/development.json is

{ " SQS " : { " params " : { " QueueUrl " : " https://sqs.us-east-1.amazonaws.com/12345/dev-queue " } } }

only SQS.params.QueueUrl would be overridden.

You can also use Javascript files as configuration files. See the config NPM package documentation for more information.

S4QS-RS defaults to NODE_ENV=development if NODE_ENV is omitted.

AWS credentials are loaded by the Node AWS SDK. See the SDK's documentation for more information. S4QS-RS needs IAM permissions to perform redshift:DescribeClusters .

Configuration structure as a JSON file

{ " SQS " : { " poller " : { " parallelPolls " : 5 , " pollIntervalSeconds " : 2 , " visibilityTimeoutUpdater " : { " visibilityTimeoutSeconds " : 300 , " visibilityUpdateIntervalSeconds " : 100 } , } , " region " : " us-east-1 " , " params " : { " WaitTimeSeconds " : 20 , " MaxNumberOfMessages " : 10 } } , " S3Copier " : { " healthCheck " : { " etcdServers " : [ " etcd:4001 " ] , " key " : " /s4qs/all_ok " , " matchAgainst " : " OK " } , " Redshift " : { " clusterAvailCheckInterval " : 300 , " connStr " : " postgres://username:password@example.com:5439/schema " , " region " : " us-east-1 " , " params " : { " ClusterIdentifier " : " mycluster " } } , " S3 " : { " params " : { " ACL " : " bucket-owner-full-control " } } , " manifestUploader " : { " maxToUpload " : 25 , " maxWaitSeconds " : 600 , " mandatory " : true , " bucket " : " manifest-bucket " , " prefix " : " s4qs/manifests/ " , " retries " : 5 } , " tablePostfix " : " _devel " , " copyParams " : { " maxParallel " : 2 , " schema " : " myschema " , " table " : " /s3: \/ \/ .*? \/ whatevs \/ (.*?) \/ /i " , " args " : [ " GZIP " , " TRUNCATECOLUMNS " ] , " withParams " : { " REGION " : " us-east-1 " , " MAXERROR " : 100 , " TIMEFORMAT " : " auto " } } , " tableConfig " : { " some_table_name " : { " SQS " : { " params " : { " QueueUrl " : " https://sqs.us-east-1.amazonaws.com/123456789/another-queue-name " , } } , " copyParams " : { " withParams " : { " JSON " : " s3://bukkit/jsonpaths_for_some_table_name.json " } } , " timestampCol " : " ts " , " deduplication " : { " dropOlderThan " : " 45 minutes " , " dedupSetAge " : " 2 hours " } , " columns " : [ " SOURCE INT NOT NULL ENCODE BYTEDICT " , " ID CHAR(24) PRIMARY KEY ENCODE LZO " , " ... " ] , " tableAttrs " : [ " DISTKEY(ID) " , " SORTKEY(ID) " ] } , " other_table_name " : { " SQS " : { " params " : { " QueueUrl " : " https://sqs.us-east-1.amazonaws.com/123456789/another-queue-name " , } } , " period " : 86400 , " maxTables " : 30 , " tablesInView " : [ 1 , 5 , 15 , 30 ] , " deduplication " : { " dropOlderThan " : " 30 minutes " , " dedupSetAge " : " 1 hours " } , " columns " : [ " AHOY INT PRIMARY KEY NOT NULL ENCODE LZO " , " DERR INT ENCODE DELTA " , " ... " ] , " tableAttrs " : [ " DISTKEY(AHOY) " , " SORTKEY(DERR) " ] } } } , " statsd " : { " prefix " : " s4qs. " , " host " : " statsd-server " , " port " : 1234 } , " zabbix " : { " server " : " zabbix.example.com " , " hostname " : " s4qs-rs " } , }

Environment variables

S4QS-RS uses the debug NPM package for outputting debug messages to stdout and error messages to stderr. To see any error messages, you must run S4QS-RS with the env DEBUG=s4qs-rs:*:error . To see debug messages, run with DEBUG=s4qs-rs:* See debug's documentation for more information.

You can map environment variables to configuration parameters by creating a file with the name ./config/custom-environment-variables.json . For example

{ " S3Copier " : { " Redshift " : { " connStr " : " RS_CONN " } } }

would map $RS_CONN to S3Copier.Redshift.connStr . Values set with environment variables always override configuration file values.

Running S4QS-RS

After setting up your configuration, you can run S4QS-RS with DEBUG=*:error s4qs-rs or DEBUG=sq4s-rs:* s4qs-rs to see debugging information.

S4QS-RS traps SIGTERM and SIGINT , and aborts in-flight transactions when terminated.

Errata