Update: We have published a follow-up post after this, outlining how we build our own Hadoop cluster here

At Viki, we use data to power product, marketing and business decisions. We use an in-house analytics dashboard to expose all the data we collect to various teams through simple table and chart based reports. This allows them to monitor all our high level KPIs and metrics regularly.

Data also powers more heavy duty stuff - like our data-driven content recommendation system, or predictive models that help us forecast the value of content we're looking to license. We're also constantly looking at data to determine the success of new product features, tweak and improve existing features and even kill stuff that doesn't work. All of this makes data an integral part of the decision making process at Viki.

To support all these functions, we need a robust infrastructure below it, and that's our data warehouse and analytics infrastructure.

This post is the first of a series about our data warehouse and analytics infrastructure. In this post we'll cover the high-level pipeline of the system and goes into details about how we collect and batch-process our data. Do expect a lot of detailed-level discussions.

About Viki: We're a online TV site, with fan-powered translations in 150+ different languages. To understand more what Viki is, watch this short video (2 minutes). It's funny, I promise.

Part 0: Overview (or TL;DR)

Our analytics infrastructure, following most common sense approach, is broken down into 3 steps:

Collect and Store Data

Process Data (batch + real-time)

Present Data

Collect and Store Data

Logs (events) are sent by different clients to a central log collector Log collector forward events to a Hydration service, here the events get enriched with more time-sensitive information; the results are stored to S3

Batch Processing Data

There is an hourly job that runs to take data from S3, apply further transformations (read: cleaning up bad data) and store the results to our cloud-based Hadoop cluster We run multiple MapReduce (Hive) jobs to aggregate data from Hadoop and write them to our central analytics database (Postgres) Another job takes a snapshot of our production databases and restore into our analytics database

Presenting Data

All analytics/reporting-related activities are then done at our master analytics database. The results (meant for report presentation) are then sent to our Reporting DB We run an internal reporting dashboard app on top of our Reporting DB; this is where end-users log in to see their reports

Real-time Data Processing

The data from Hydration service is also multiplexed and sent to our real-time processing pipeline (using Apache Storm) At Storm, we write custom job that does real-time aggregation of important metrics. We also write a real-time alerting system to inform ourselves when traffic goes bad

Part 1: Collecting, Pre-processing and Storing Data

We use fluentd to receive event logs from different platforms (web, Android, iOS, etc) (through HTTP). We set up a cluster of 2 dedicated servers running multiple fluentd instances inside Docker, load-balanced through HAproxy. When the message hits our endpoint, fluentd then buffers the messages and batch-forward them to our Hydration System. At the moment, our cluster is doing 100M messages a day.

A word about fluentd: It's a robust open-source log collecting software that has a very healthy and helpful community around it. The core is written in C so it's fast and scalable, with plugins written in Ruby, making it easy to extend.

What data do we collect? We collect everything that we think is useful for the business: a click event, an ad impression, ad request, a video play, etc. We call each record an event, and it's stored as a JSON string like this:

{ "time":1380452846, "event": "video_play", "video_id":"1008912v", "user_id":"5298933u", "uuid":"80833c5a760597bf1c8339819636df04", "app_id":"100004a", "app_ver":"2.9.3.151", "device":"iPad", "stream_quality":"720p", "ip":"99.232.169.246", "country":"ca", "city_name":"Toronto", "region_name":"ON" }

Pre-processing Data - Hydration Service

Data Hydration Service

The data after collected is sent to a Hydration Service for pre-processing. Here, the message is enriched with time-sensitive information.

For example: When a user watches a video (thus a video_play event sent), we want to know if it's a free user or a paid user. Since the user could be a free user today and upgrade to paid tomorrow, the only way to correctly attribute the play event to free/paid bucket is to inject that status right right into the message when it's received. In short, the service translates this:

{ "event":"video_play", "user_id":"1234" }

into this:

{ "event":"video_play", "user_id":"1234", "user_status":"free" }

For non time-sensitive operations (fixing typo, getting country from IP, etc), there is different process for that (discussed below)

Storing to S3

From the Hydration Service, the message is buffered and then stored to S3 - our source of truth. The data is gzip-compressed and stored into hour bucket, making it easy and fast to retrieve them per hour time-period.

Part 2: Batch-processing Data

The processing layer has 2 components, the batch-processing and the real-time processing component.

This section focuses mainly on our batch-processing layer - our main process of transforming data for reporting/presentation. We'll cover our real-time processing layer in another post.

Batch-processing Data Layer

Cleaning Data Before Importing Into Hadoop

Those who have worked with data before know this: Cleaning data takes a lot of time. In fact: Cleaning and preparing data will take most of your time. Not the actual analysis.

What is unclean data (or bad data)? Data that is logged incorrectly. It comes in a lot of different forms. E.g

Typo mistake, send 'clickk' event instead of 'click'

Clients send event twice, or forgot to send event

When bad data enters your system, it stays there forever, unless you purposely find a way to clean it/take it out

So how do we clean up bad data?

Previously when we receive a record, we write directly to our Hadoop cluster and make a backup to S3. This makes it difficult to correct the bad data, due to the append-only nature of Hadoop.

Now all the data is first stored into S3. And we have hourly process that takes data from S3, apply cleanup/transformations and load them into Hadoop (insert-overwrite).

Storing Data, Before and After

The process is similar in nature to the hydration process, but this time we look at 1 hour block at a time, rather than per record. This approach has many great benefits:

The pipeline is more linear, thus prevents from the threat of data discrepancy (between S3 and Hadoop).

The data is not tied down to being stored in Hadoop. If we want to load our data into other data storage, we'd just write another process that transform S3 data and dump somewhere else.

When a bad logging happens causing unclean data, we can modify the transformation code and rerun the data from the point of bad logging. Because the process is idempotent, we can perform the reprocessing as many times as we want without double-logging the data.

Our S3 to Hadoop Transform and Load Process

If you've studied this article The Log from the Data Engineering folks at LinkedIn, you'd notice that the approach is very similar (replacing Kafka with S3, and per-message processing with per hour processing). Indeed our paradigm is inspired by The Log architecture. However due to our needs, our system chose S3 because:

When it comes to batch (re)processing, we do it in a time-period manner (eg. process 1 hour of data). Kafka use natural number to order message, thus if we use Kafka we'll have to build another service to translate [ beg_timestamp , end_timestamp ) into [ beg_index , end_index ). Kafka can only retain up to X days due to the disk-space limitation. As we want the ability to reprocess data further back, employing Kafka means we need another strategy to cater to these cases.

Aggregating Data from Hadoop into Postgres

Once the data got into Hadoop, we'd have these daily aggregation jobs to aggregate data into fewer dimensions and port them into our analytics master database (PostgreSQL)

For example, to aggregate a table of video starts data together with some video and user information, we run this Hive query (MapReduce job):

-- The hadoop's events table would contain 2 fields: time (int), v (json) SELECT SUBSTR(FROM_UNIXTIME(time), 0, 10) AS date_d, v['platform'] AS platform, v['country'] AS country, v['video_id'] AS video_id, v['user_id'] AS user_id COUNT(1) AS cnt FROM events WHERE time >= BEG_TS AND time <= END_TS AND v['event'] = 'video_start' GROUP BY 1,2,3,4,5

and load the results into an aggregated.video_starts table in Postgres:

Table "aggregated.video_starts" Column | Type | Modifiers -------------+------------------------+----------- date_d | date | not null platform | character varying(255) | country | character(3) | video_id | character varying(255) | user_id | character varying(255) | cnt | bigint | not null

Further querying and reporting of video_starts will be done out of this table. If we need more dimensions, we either rebuild this table with more dimensions, or build a new table from Hadoop.

If it's a one-time ad-hoc analysis, we'd just run the queries directly against Hadoop.

Table Partitioning:

Also, we're making use of Postgres' Table Inheritance feature to partition our data into multiple monthly tables with a parent table on top of all. Your query just needs to hit the parent table and the engine will know which underlying monthly tables to hit to get your data.

This makes our data very easy to maintain, with small indexes and better rebuild process. We'd have fast (SSD) drives that host the recent tables, and move the older ones to slower (but bigger) drives for semi-archiving purpose.

Centralizing All Data

We dump our production databases into our master analytics database on a daily basis.

Also, we use a lot of 3rd party vendors (Adroll, DoubleClick, Flurry, GA, etc). For each of these services, we write a process to ping their API and import the data into our master analytics database.

These data, together with the aggregated data from Hadoop, allow us to produce meaningful analysis combining data from multiple sources.

For example, to break down our video starts by different genre, we would write some query that joins data from prod.videos table with aggregated.video_starts table:

-- Video Starts by genre SELECT V.genre, SUM(cnt) FROM aggregated.video_starts VS LEFT JOIN prod.videos V ON VS.video_id = V.id WHERE VS.date_d = '2014-06-01' GROUP BY 1 ORDER BY 1

The above is made possible because we have both sources of data (event tracking data + production data) in 1 place.

Centralizing data is a very important concept in our pipeline because it makes it simple and pain-free for us to connect, report and corroborate our numbers across many different data sources.

Managing Job Dependencies

We started with simple crontab to schedule our hourly/daily jobs. When the jobs grew complicated, we'd end up with very long crontab:

Crontab also doesn't support graph-based job flow (e.g run A and B at the sametime, when both finishes, run C)

So we looked around for a solution. We considered Chronos (by Airbnb), but their use-case is more complicated than what we needed, plus the need to setup ZooKeeper and all that.

We ended up using Azkaban by LinkedIn, it has everything we need: Crontab with graph-based job flow, it also tell you the runtime history of your job. And when a job flow fails, you can restart them, running only tasks that failed/haven't run.

It's pretty awesome.

Making Sure Your Numbers Tie

One of the things I see being less discussed in analytics infrastructure talk/blog is making sure your data don't drop half-way during transportation, resulting in data inconsistency in different storages.

We have a process that runs after every data transportation, it counts number of records in both source and destination storage and prints errors when they don't match. These check-total processes sound a little tedious to do, but it proved crucial to our system; it gives us the confidence in the accurary of the numbers we report to management.

Case in point, we had a process that dumps data from Postgres to CSV, then compresses and uploads to S3 and loads them into Amazon Redshift (using COPY command). So technically we have the exact same table in both Postgres and Redshift. One day our analyst pointed out that the data in Redshift is significantly less than in Postgres. Upon investigation, there was bug that cause CSV file to be truncated and thus not fully loaded into Redshift tables. It was because for this particular process we didn't have the check-totals in place.

Using Ruby as the Scripting Language

When we started we are primarily a Ruby shop, so going ahead with Ruby was a natural choice. "Isn't it slow?", you might say. But we use Ruby not to process data (i.e it rarely holds any large amount of data), but to facilitate and coordinate the process.

We have written an entire library to support doing data pipeline in Ruby. For example, we extended pg gem to make it more object-oriented to Postgres. It allows us to do table creation, table hot-swapping, upsert, insert-overwriting, copying tables between databases, etc, all without having to touch SQL code. It has become a nice, productive abstraction on top of SQL and Postgres. Think ORM for data-warehousing purpose.

Example: The below code will create a data.videos_by_genre table holding the result of a simple aggregation query. The process works on a temporary table and eventually it'll perform a table hot-swap with the main one; this is to avoid any data disruption being made if we would have done it on the main table from the beginning.

columns = [ {name: 'genre', data_type: 'varchar'}, {name: 'video_count', data_type: 'int'}, ] indexes = [{columns: ['genre']}] table = PGx::Table.new 'data.videos_by_genre', columns, indexes table.with_temp_table do |temp_t| temp_t.drop(check_exists: true) temp_t.create connection.exec <<-SQL.strip_heredoc INSERT INTO #{temp_t.qualified_name} SELECT genre, COUNT(1) FROM prod.videos GROUP BY 1 SQL temp_t.create_indexes table.hotswap end

(the above example could also be done using a MATERIALIZED VIEW btw)

Having this set of libraries has proven very critical to our data pipeline process, since it allows us to write extensible and maintainable code that perform all sort of data transformations.

Technology

We rely mostly on free and open-source technologies. Our stack is:

fluentd (open-source) for collecting logs

Cloud-based Hadoop + Hive (TreasureData - 3rd party vendor)

PostgreSQL + Amazon Redshift as central analytics database

Ruby as scripting language

NodeJS (worker process) with Redis (caching)

Azkaban (job flow management)

Kestrel (message queue)

Apache Storm (real-time stream processing)

Docker for automated deployment

HAproxy (load balancing)

and lots of SQL (huge thanks to Postgres, one of the best relational databases ever made)

Conclusion

The above post went through the overall architecture of our analytics system. It also went into details the Collecting layer and Batch-processing layer. In later blog posts we'll cover the remaining, specifically:

Our Data Presentation layer. And how Stuti, our analyst, built our funnel analysis, fan-in and fan-out tools, all with SQL. And it updates automatically (very funnel. wow!)

Our Real-time traffic alert/monitoring system (using Apache Storm)

turing: our feature roll-out, A/B testing framework

How is your analytics infrastructure like? Please share with us in the comments!

Subscribe to our RSS feed/ Twitter to get updates of future blog posts.

We're hiring:

We are looking for a Platform Engineer (Singapore Office - HQ) to work on our scalable distributed API infrastructure. If you wants to taste what modern Asia feels like, why not give Singapore (and Viki) a try? (protip: it's not in China). Check this out for more job openings.

If you're students, we have a few software engineering internship projects available. Write to us at huy at viki dot com and we'll tell you more.