Architecture: AWS Kinesis + appbase.io

We will be covering two up and coming technologies for building our streaming analytics pipeline.

AWS Kinesis — a realtime data ingestion service by AWS. appbase.io — a database API for filtering streaming data with Elasticsearch’s rich query language.

We will live tail the data as it is ingested by Kinesis into an ETL pipeline built with transporter and sink it into appbase.io. Here’s how the data flow would look.

Twitter <FS checkins> Kinesis <ETL> appbase.io <Live queries> UI

Step 1: Twitter < .. > AWS Kinesis

In Step 1, we will build a worker process that uses Twitter’s streaming APIs to filter foursquare checkins tweets, resolves the location data with Foursquare APIs and ingests them into AWS Kinesis.

Image: Filtering tweets related to swarmapp (aka Foursquare’s user facing app)

/* Fetching tweets like the above via Twitter’s streaming API */

T.stream('statuses/filter', {track:'swarmapp', language:'en'});

Getting the data from Foursquare APIs

To Get the details of checkins

Resolved checkin response from Foursquare checkins endpoint

Writing data to kinesis streams

Before writing the data to Kinesis, we stringify the JSON object, attach a unique partition key and specify a stream name to write the data to. The entire worker code can be found here.

If you haven’t used AWS kinesis before, it’s a realtime data ingestion service designed for firehoses and other realtime data sources. It’s equivalent to a managed Apache Kafka service. Since it’s designed for ingestion use-cases, it can process thousands of data streams on a per-second basis. A good primer on the topic would be AWS’s official documentation guide. Note: Data ingested into Kinesis can be buffered for up to 24 hours.

If Kinesis is designed for ingesting realtime data, why can’t we just use Kinesis for doing analytics jobs. Well, Kinesis is a message queue. It doesn’t have any support for ad-hoc queries.

There are a number of data systems that are designed for ad-hoc querying: from OLTP databases (think Mongo, Postgres, Elasticsearch) to OLAP data warehouses (Hadoop, Google Big Query, et al). In the spirit of most effectively utilizing our data flow pipeline, we will be using appbase.io — a realtime datastore built on top of Elasticsearch’s distributed search engine. It supports the full gamut of Elasticsearch’s query language: full-text search, geo location, filters, and aggregations.

In our streaming analytics workflow, appbase.io will form the yang to Kinesis’s yin.

Step 2: kinesis < .. > appbase.io, a tale of live tailing

Now that we have our data inside Kinesis, how do we move it around to appbase.io. Instead of using Kinesis’s API to read the data and using appbase.io’s APIs to write it, we will use an ETL tool called transporter.

While you don’t need to know about transporter for this tutorial, it’s a modern ETL tool to move data b/w different data systems. Transporter uses two kinds of adaptors:

Source adaptor — connects with the source data system (Kinesis in this case) and puts the data into transporter’s pipeline. Sink adaptor — connects with the sink data system (appbase.io in this case) and additional allows data transformation via lambda functions before indexing it in the sink.

A tool like transporter can come in handy when you want to work with a variety of data systems reliably. It saves you the time and headache of writing drivers and adaptors for each transport scenario. It already has adaptors for MongoDB, Elasticsearch, Influx, RethinkDB, Kinesis (developed by rishi shah, co-author of this post) and appbase.io (a community adaptor built and maintained by us).

Enough talk, let’s get into the soy (hi vegans!) of the action.

One last thing before diving in — remember, we said that you don’t need to know the inner workings of transporter. We have built a docker image that you can use directly for live tailing by passing the respective connection parameters for Kinesis and appbase.io as environment variables.

Pull the docker image

docker pull rishiloyola/streaming-transporter

Run the following command to run the docker container

docker run -d -e TRANSPORTPIPE="Source({type: \"kinesis\", awsaccesskey: \"XXX\", awssecretkey: \"XXX\", streamname: \"test\"}).save({type: \"appbase\", username: \"XXX\", password: \"XXX\", namespace: \"test.appbase_test\"})" rishiloyola/streaming-transporter

You can pass all your configuration parameters in TRANSPORTPIPE variable. We pass a string escaped function of the following format:

Source({

type: "kinesis",

awsaccesskey: "XXX",

streamname: test,

}).save({

type: "appbase",

username: "XXX",

password: "XXX",

namespace: "test.appbase_test"

})

That’s it, no lines of code written for doing this. Transporter provides an ability to define a transform() method similar to the save() method but before if one wishes to transpose, enrich, or change the data before it is saved in the sink—an appbase.io app in this case.

Step 3: Live Queries

appbase.io comes with an open-source web UI for browsing data called dejavu. You can provide your app credentials (the ones you used in the transport pipe above) and voila, all the data that is being streamed should show up live.

Image: If everything is going correctly, you should see your data being populated in realtime in dejavu

This is how an individual data row’s JSON should look like

Image: Final data stored in appbase.io

You can query the data in dejavu itself. For instance, filter out the checkins happening in New York city. Just apply the “New York” filter in the city field.

Seeing in a data browser is one thing and seeing it live quite another. We will write a query to live stream checkins happening across the world.