Sid Anand is an architect at Agari, and a former technical architect at eBay, Netflix, and LinkedIn. He has 15 years of data infrastructure experience at scale, is a PMC for Apache Airflow, and is also a program committee chair for QCon San Francisco and QCon London .

Wesley Reisz talks to Sid Anand , a data architect at cybersecurity company Agari, about building cloud-native data pipelines. The focus of their discussion is around a solution Agari uses that is built from Amazon Kinesis Streams, serverless functions, and auto scaling groups.

Show Notes

What are the problems that Agari solves?

1:25 Agari is an e-mail security SASS company, which has its operations in the AWS cloud, and we’re trying to solve the problem of spear fish detection and elimination.

1:40 Most of you have received e-mail which is unsolicited, typically considered spam.

1:50 That problem has been solved, for the most part, through secure e-mail gateways in enterprise companies. These detect spam campaigns.

2:05 The problem we’re trying to deal with is when there is a targeted campaign against a specific individual, such as a CFO or CEO.

2:10 If they click on a link or follow instructions in such an e-mail then they compromise themselves or their company.

2:20 Since there won’t be an e-mail like that before or afterwards, we have to detect a spear fish message with no prior history.

How do you go about approaching something like this?

2:30 A lot of approaches in the security space use a fingerprint of a known attack pattern, which becomes an arms race.

2:40 You have to be aware of all the attack patterns, through fingerprint detection and elimination.

2:45 But when a new attack rears its head, you have a very short period of time to shut down that attack. That’s the negative model.

2:50 We take a different approach, a positive model. We look for patterns of good behaviour, and if you match that, you’re good.

3:00 If you don’t match, then you’re in the range of suspicious to untrusted.

How do you match that behaviour?

3:10 We have been in the business for a long time, and we have captured e-mail traffic and flow from some of the largest inbox providers like Yahoo, GMail and Office 365.

3:20 We have a good model of all the IP addresses of the mail senders in the world, and we look where your e-mail address has been.

3:30 We use that to determine if you’ve gone through a known compromised server at some point.

3:40 We have to process a huge volume of mail flow, and we can use that to build a reputation model of IP addresses and sending domains, and we can apply those models to mark real-time mail flow.

How have data pipelines changed over the last 5-10 years?

4:15 Traditionally, data pipelineing have fallen into two buckets.

4:20 The first bucket is companies such as technical, legal or medical industries was storing mail in a database that you wanted to run a report on. The problem is, those databases weren’t built for running reports.

4:40 You would hire a data warehouse engineer and then extract, transform and load the data into an OLAP or reporting database (typically referred to as a data mart or data warehouses), and the reports could be run against those databases.

5:00 In the last ten to fifteen years, websites and interactions with the world require data pipelines for enrichment, scoring and ranking, which require different sorts of data sciences and machine learning on flowing data or data at rest to come up with recommendations for you, such as movie recommendations on Netflix or music suggestions on Spotify.

5:30 It can also be used for credit card fraud detection or data collection from IoT devices.

5:45 Since reports may be run periodically, such as once a day or once an hour, batch processing may have been fine.

5:50 But since we’re moving into more intelligent systems, we need real time answers, and the old ETL systems don’t work.

What does a real-time data pipeline look like today?

6:10 Most of the technology that has been prevalent in the last few years has been batch oriented: Hadoop; Pig; Hive; - they all take batch approaches for processing data.

6:20 More recently, technologies such as Spark and Flink have come around although they are a little bit more micro batch oriented; they can reduce the latency, but they still have them.

6:40 You can also handle one event at a time, although that’s more for custom solutions - there aren’t any products which deal with that model at the moment.

Can you explain what a micro batch is?

6:50 Spark has a programmatic model that does transformations on sets of data.

7:00 To be efficient, it has to have a set of data. So it might have to wait for 5 seconds to collect a set of data before it can process it. That’s a micro batch.

7:10 What if you need per-message or per-event processing? It’s not efficient for that.

7:25 When we moved from batch to near real-time, because we were familiar with Apache Spark, we altered the existing Spark jobs to be streaming.

7:35 The problem we had was because of the micro-batching and the rate of data coming we couldn’t meet sub-second SLAs. It was still taking - in some cases - minutes to run computations.

7:45 So we abandoned Spark and wrote custom apps that operate per-message, and now we’re able to operate sub second end to end.

So what does it look like?

8:00 We have a custom app that reads messages off of AWS Kinesis, and it will get a batch of messages - but messages that have been received in the last few milliseconds.

8:20 For those messages, it will score it and then send the message downstream the next hop in the pipeline.

8:30 We don’t have to wait to fill a batch, which is Spark’s model. We are constantly polling Kinesis in milliseconds and processing whatever is returned.

Does that work because each event is discrete?

9:10 We had to solve the windowing problem - we have Elasticache, which is based on Redis, and we store data as we get it in the cache.

9:25 If we need history, we can consult the cache to get those counts and apply that history when scoring that event, which is actually much faster than happens in Spark.

What are the design decisions when choosing a data pipelining approach?

9:55 If you’re doing batch processing - for example, we still do nightly batch processing to do model building.

10:00 Machine learning and training needs to happen on batch sets.

10:10 For those use cases, batch is still the right approach and Spark is the right tool for the job, whether you’re in the cloud or your own data centre.

10:20 If you’re building a near real-time data pipeline, I don’t believe Spark streaming is ever going to be the right solution for you.

10:30 The promise of Spark is that you would write one application for both batch processing and streaming.

10:35 The reality is that your batch processing needs and streaming needs are very different.

10:45 If you try to use the same application for both, you’re not going to get the same performance for both.

10:50 Batch applications have the ability to do more complex calculations and take more time to do so.

10:55 Stream processing is all about timing. You have to sometimes sacrifice some of the complex operations when you take that decision.

11:10 In the finance industry, you’re not going to see Spark streaming. They have some of the shortest latencies for their data pipelines anywhere.

11:20 We’ve taken that approach - building small applications that can do their operations very fast. As a result we couldn’t use the Spark streaming model.

What were the design considerations in creating this solution?

11:50 We have a small team - about 20 engineers, maybe half of them are working on the data pipeline.

11:55 We also have new customers that are being on-boarded every week.

12:00 Our scale keeps changing - not in a continuous fashion, but in a step fashion.

12:10 Every new customer on board results in a step change operation.

12:15 We can’t have engineers involved in scaling our system up.

12:20 We need a system that has a few characteristics.

12:25 It needs to be operable. It should be self-healing, auto-scaled, highly-available, whatever the latency needs should be enforced, monitored for end-to-end health, and should be very easy to deploy and roll-back.

12:45 It should also be dynamic in terms of cost. For example, our traffic patterns are diurnal.

12:50 Most customers are in the United States, with some in Europe but none in Asia.

13:00 The cost footprint should be dynamic.

13:10 The data should be correct: it should never lose data, never corrupt data, never send duplicate data.

13:20 The data should be timely; if the data is late, it is useless. If we don’t get a response within a few seconds, it’s as if we lost the data.

What does the data pipeline look like?

13:30 We have a data pipeline with three key components.

13:40 We have streams, such as Kinesis or Kafka.

13:50 Processors consume from a stream, do some calculation, and then produce to an outgoing stream.

14:00 Persistent stores take the data and store it, such as Elasticache, Elasticsearch, DynamoDB, PostgresMS3 and follow the polyglot persistence model where we use the right tool for the right job.

Can you talk about the streams?

14:25 We decided to go with Kinesis streams. Kinesis is a managed version of Kafka, and has the advantage that it doesn’t depend on ZooKeeper, which tends to be the Achilles Heel of Kafka, especially in the cloud where one of the three ZooKeeper nodes could go down because the machine is being retired.

14:40 Kinesis doesn’t depend on ZooKeeper; it uses DynamoDB for the back end.

What is your experience with Kinesis?

15:00 Kinesis, like Dynamo, follows a pre-provisioned model. So you provision for the IO, allocating shards ahead of time. You get consistent and guaranteed behaviour from that shard.

15:20 In the world of Kafka, you decide the number of shards you have, but they don’t have guaranteed IO characteristics.

15:25 Kinesis is ahead in that case, but with an API you can auto-scale your cluster, just by provisioning more shards or shrinking.

15:45 This allows you to cost control, such as shrinking your cluster during low traffic hours.

What do the processors look like?

15:50 The processing nodes receive data, process it, and then export the data.

16:05 We either use AWS Lambda, or AWS auto-scaling groups.

16:15 Our processors are written in Python, Node, Ruby and Java.

16:20 Their job is to score a message as trustworthy or not trustworthy.

16:25 We also have nodes that perform annotation and enrichment; they take third party data sets and from our caches and annotate messages as they enter the pipeline.

16:30 We also have importer nodes, which persist messages in our data stores.

Why the different languages?

17:00 We have a team of engineers that like to write in different languages.

17:10 We started in Ruby, and as we brought data scientists on board, Python and Java started to be used more.

17:15 Once in a while, we borrow code from the open source community. A lot of the AWS lambda code is in Node.

How do you decide between AWS lambda and auto-scaling groups?

17:30 Lambda only supports Node, Java and Python - so our early Ruby code can’t use them, they have to use auto-scaling groups.

How do you manage deployments?

17:50 We have two approaches to building auto-scaling groups.

18:00 The first is to provision a standing Soloist node, essentially Terraform and Ansible.

18:30 Every time we deploy a new gem, we snapshot the soloist to get a new AMI and then do a rolling deploy to an auto-scaling group.

18:40 The other approach is to use Packr. It will spin up a ephemeral Soloist node, it will use Ansible to set the machine up, and then deploy the artefact to that box, generate an AMI and then tear it down.

19:00 It will then use Terraform to do a blue/green deployment to that auto-scaling group.

How do you choose between your different back end stores?

19:20 We started with a single Postgres node, and were using it for a variety of things.

19:30It was being ingested message by message into Postgres, detailed information.

19:35 We were also generating aggregates, because our tool shows analytics to users.

19:45 Thirdly, we wanted our users to discover data, which meant some kind of search functionality.

19:50 Postgres supports each of these but isn’t necessarily the right tool for all of them.

19:55 We off-boarded search and aggregation to ElasticSearch.

20:05 The big innovation/insight there is that typically with data pipelines is that they’ll pre-aggregate data on fixed interval: for example, generating a summary every hour.

20:10 The problem is that the summaries are always wrong; they’re off, and stale, and don’t take into account late data.

What else is important?

20:40 The important part is to decouple through both streams and a data format that supports schema evolution.

20:55 They’re sending JSON, the cloud providers are using JSON, and everyone is breaking.

21:15 The idea is that you should be able to accept old and new versions of the data, using something that supports evolution like Protobuf or Avro.

21:30 We chose Avro because it was supported in the four languages we use.

21:40 In Protobuf and Thrift you have to do code generation and use an IDL. Avro has a generic interface which works much better with dynamic languages.

22:00 Another key design idea is to limit index use in databases.

22:30 Indexes are used to search for data, but as you add new indexes you slow down the insertion rate.

22:35 That starts hurting when ingesting real-time data.

22:40 You can optimise data ingestion on your main data store, but then have a change data stream that can replicate the data down to other data stores which potentially have more indexes set up.

23:00 This split allows you to ingest at the lowest latency to start, but have additional queries and more data later on.

23:20 We ingest data into our primary data store, into Postgres, and a change data capture stream going into Kinesis which then goes into ElasticSearch.

23:30 The other benefit is that we’re consuming data that has already been transformed and cleaned up into the database.

23:45 The data cleaning is performed on ingestion into the database, and downstream the data is clean, so it doesn’t need to be cleaned again for search or graphing.

Why isn’t everyone using the cloud?

24:00 When you get to a certain scale - like LinkedIn or Facebook - the cost to transfer the data into the cloud may be prohibitive.

24:15 Netflix considered itself a drop in the ocean of Amazon AWS - but it turned out that Netflix was a significant portion of the resources on AWS.

24:30 So larger scale environments on LinkedIn or Facebook have a similar scale to AWS at the moment.

What’s Apache Airflow?