Click to learn more about author Jay Sitapara.

Introduction

In the Big Data landscape, data engineers are always striving to come up with efficient and accurate methods to compute volume, variety and velocity of data so that it can prove to be a strong skeleton for Data Scientists who are conducting their analysis. This is necessary since new age technologies like Machine Learning and Artificial Intelligence heavily depends on Big Data as mention in the recent Forbes post.

However, Big Data deployment in the Cloud came up with a lot of difficulties such as under, improper and over utilization of compute resources at various periods. To abstract away these problems, Serverless Architecture came to the rescue. Netflix, Mapbox and New York Times are some of the prominent organizations using Serverless Architecture and unleashing their potential in Real-time Data Processing.

Moreover, Big Data is evolving rapidly. With this evolution comes the plethora of tools which are helping enterprises in dealing with Real-time Data Processing. Since live data needs to be given special attention, it is very critical to handle. If not handled properly, it will directly affects its value. To tackle huge chunks of data in a time-sensitive environment, Serverless Architecture is proving to be of great help.

Serverless Architecture & Big Data

The essence of the Serverless Computing lies in the fact that it doesn’t requires IT infrastructure management. Along with that, it can be easily integrated with the 3rd party services, for examples, Backend-as-a-Service and Function-as-a-Service.

If you consider the prevalent scenario, the typical infrastructure for real-time processing consists of Hadoop clusters and Data Warehouse solutions where Data Scientists will have to run a batch jobs to generate reports. This is a total failure and here’s where Serverless Architecture fits more appropriately.

Here’s why it fits and fulfills the Big Data requirements:

Multiple event sources : Direct sync and async API calls, integration with various services of the Cloud platform, 3rd-party triggers compatibility.

: Direct sync and async API calls, integration with various services of the Cloud platform, 3rd-party triggers compatibility. Ephemeral : Code is stateless, freedom of runtime language since it supports multiple languages like Node.js, Java, Python, Go and any more.

: Code is stateless, freedom of runtime language since it supports multiple languages like Node.js, Java, Python, Go and any more. No Infrastructure Management : Serverless provides you with on-demand scalability, for example, similarly to Amazon EC2 and Auto-scaling groups.

: Serverless provides you with on-demand scalability, for example, similarly to Amazon EC2 and Auto-scaling groups. Efficient Performance : Functions are easy to write, deploy and maintain. Gives you the space to focus solely on your business logic.

: Functions are easy to write, deploy and maintain. Gives you the space to focus solely on your business logic. Cost-effective: Automatically matches capacity to request rate. Compute cost 100ms increments.

What is Serverless Real-time Data Processing?

More and more people are taking the advantage of Big Data with the help of Serverless functions. In a recent article by Grid Dynamics, they discussed how they built an Analytics Platform for a startup whose mobile game acts as a digital advertising platform.

For an organization to build such platforms, lot of consideration needs to be evaluated about which we will discuss in this section. In such platforms, event data is captured from the multiple sources, processed in real-time with the help of functions and is pushed out to the data lakes. This works under the following two model:

Push Model

In this model, the data is mapped and pushed from the event sources to the functions. This can be done either synchronously or asynchronously. Later, it invokes the functions via the Event Source API. Though it provides robust concurrency, you will be required to manage resource-based policy permissions.

Pull Model

In this model, the data is mapped by the function itself after polling it from the data stream coming from, for example, databases. Whenever a new data is found on the stream, it invokes a function which further processes it. This can be executed only through synchronous invocations.

Here we will take an example of Amazon Kinesis to better understand what type of services are required for Real-time Data Processing. Amazon Kinesis is mainly used along with AWS Lambda to offer managed services for streaming data ingestion and processing.

Source: Amazon Web Services

The streaming data is sent to Amazon Kinesis and stored in shards. Each shard ingests/read data up to 1 MB/sec while emits/writes data up to 2 MB/sec. Each shard stores data for 24 hours which is expandable to 7 days.

With the help of “fan out” technique, multiple Lambda functions can be triggered to process different functionalities. For configuring the Lambda functions, opt for CPU allocation which is proportional to the memory configured. However, increasing the memory makes your code execute faster only if it is CPU bound. It also allows for larger record sized processed.

After processing, AWS Lambda will store the data in DynamoDB and S3. You can also use services like Amazon Elasticsearch Service for analytics by aggregating the data from AWS Lambda.

For metrics and monitoring, Lambda sends event data and function into Amazon CloudWatch.

Steps to Follow

In this section, we will discuss the workflow of setting up the processing streams.

#1. Configuring Event Source

Firstly, you need to set your Lambda function as an event source in the Amazon Kinesis. Secondly, configuring the batch size. It can be defined as the maximum number of records that Lambda will send to one invocation. The effective batch size is every 250 ms which is calculated as MIN(records available, batch size, 6 MB). Increasing the batch size allows fewer Lambda function invocations with more data processed per function.

Polling and processing occurs concurrently on per shard basis. Lambda polls every 250 ms if no new records are found and will grab as much data as possible in the GetRecords call. Batches are passed for invocation to Lambda through function parameters. Batch size may impact duration if the Lambda function takes longer to process more records.

#2. Tuning Throughput

The amount of data ingested and processed should be in a continuous flow. If the put/ingestion rate is greater than the theoretical throughput, your processing is at risk of falling behind. Here is how you can calculate the throughput rate:

Maximum theoretical throughput: # shards * 2 MB / Lambda function duration (s)

Effective theoretical throughput: # shards * Batch size (MB) / Lambda function duration (s)

If we talk about the failed executions, Lambda function retries on the execution failures until the record is expired. Though throttles and errors impacts duration and directly impacts the throughput. During the low throughput, effective batch size will decrease then the configured value and vice versa.

#3. Monitoring with CloudWatch

CloudWatch offers a robust monitoring facilities for your real time data processing applications. Here are the three things that you need to closely monitor:

Kinesis Streams: The metrics that you should effectively monitor are GetRecords (effective throughput), PutRecords (bytes, latency, records, etc.) and GetRecordsIteratorAgeMilliseconds (this will tell you how old your last processes records were).

The metrics that you should effectively monitor are GetRecords (effective throughput), PutRecords (bytes, latency, records, etc.) and GetRecordsIteratorAgeMilliseconds (this will tell you how old your last processes records were). Lambda Functions: The metrics that you should effectively monitor are Invocation count (the number of times your function has been invoked), Duration (the execution/processing time for your function), Error count, Throttle count and Iterator Age (this is time elapsed from batch received & final record written to stream).

The metrics that you should effectively monitor are Invocation count (the number of times your function has been invoked), Duration (the execution/processing time for your function), Error count, Throttle count and Iterator Age (this is time elapsed from batch received & final record written to stream). Debugging: For an effective and accurate debugging, you need to review all the metrics, view RAW consumed, make custom logs and search for the log events.

Take Away!

Jesse Anderson, Data Engineer & Managing Director at Big Data Institute, mentioned in his video that Real-time Data Processing is a critical task for business and their customers who are looking to expand their horizons and want to jump in the next-gen era of businesses.

Saying that, more and more companies are looking for ways to integrate in-stream Real-time Data Processing pipeline to support their existing products and platforms. In such cases, Serverless tools are proving to be an easy approach.

As far as vendor lock-in and infrastructure outsourcing are not an issue, a Serverless solution can highly increase your time to market. As from time to time, we’ve observed that the delivery of Serverless Real-time Data Processing solution takes almost one-tenth the time of a similar solution based on a serverful version.