This is part 2 of “Getting Data to Data Lake from Microservices” series, for part 1 check Getting Data to Data Lake from Microservice — Part 1: From Databases.

Some data are not stored in database at all, instead it is treated as a log. Usually it is because there’s no need to keep the activity as a state to make app feature works properly. For example, Amazon has no reason to store keyword user searched in transactional database, but it has to store what items added into carts so that user could proceed to checkout.

Specific in e-commerce world, there’s a term called clickstream, which basically means every click done by user in the browser. Clickstream is very important usually to know how the conversion rate goes, are we doing good or not. For example if user add to cart but rarely proceed to checkout, that means conversion rate cart-to-checkout is low and there’s something wrong in that specific step, maybe it’s hard to do the payment or something.

A log is perhaps the simplest possible storage abstraction. It is an append-only, totally-ordered sequence of records ordered by time — Jay Kreps

Illustration of the log (picture taken from the blog)

Quote above is I think the best way to describe log. Taken from famous blogpost about the concept of Log and why LinkedIn built Kafka.

Now back to our main topic, how to get log data to the data lake?

The Obvious —Aggregate Existing Log

Again we start with the obvious. Let’s take an web-based e-commerce as specific example. Usually for web-based e-commerce, it uses load balancer like nginx as part of effort to make the front end horizontally scalable. And usually engineer will log all http requests coming to it to a log file or centralized logging system.

Another example is your service logs behavior in several level: INFO, WARN, ERROR. Usually you log those into file or centralized logging system like CloudWatch Log or Stackdriver.

Simplest way to start if you have those kind of log already is to ingest the log to the data lake right away. There are many log aggregators you could use like Apache Flume or Talend. However using a log aggregator possess several issues:

Data comes in various format, sometime it’s just a mere log which every line is hard to parse

No clear contract, if log format changes then we will have another maintenance cost

Each service need to install agent that aggregate log to centralized log system

To avoid problems regarding format and its changes, it is very helpful if you could standardized log across the organization. One format that works quite well is JSON (the file needs to be well-compressed, JSON is verbose and costly to parse).

Wait, is there any other alternative?

Microservice Actively Push Activity to Unified Log

Unified log concept that is proposed by Jay Kreps in the blogpost I mentioned above, proven as part of must-to-have architecture nowadays. It’s adopted by many companies and even cloud giants create their own unified log, such as AWS Kinesis.

Unified log simplify architecture from this many-to-many dependency disaster like below

Unified log: before (picture taken from the blogpost)

to this.

Unified log: after (picture taken from the blogpost)

This alternative doesn’t solve all problems mentioned with aggregating logs, but please bear with me, this design has more advantages.

Basically each microservices will actively push data to this unified log, published in agreed format. Other systems could subscribe the log and do whatever they want. Both publisher and subscriber agree with specific schema for each log stream.

In this case what we want is to subscribe from unified log and write data to our data lake. In order to do that, we could schedule a job that read data in bulk from the unified log and write it to data lake. Take Apache Kafka as an example for unified log, we could use Spark Streaming to write data periodically to data lake. Or we could use something more specific for this job like Gobblin. Other managed service alternatives are GCP PubSub or AWS Kinesis.

Stream use cases, taken from spark meetup slide

What is powerful for this unified log is, it enables real time data processing. Mostly this is why it is widely adopted, things are getting fast, and business nowadays need to process and decide in real time. Take fraud detection as an example, we need to be able to detect pattern as soon as possible so that we could do the right action when fraud is taken place. Another use case is doing trend analysis, so that we could detect anomaly in real time.

The Details of Unified Log

Using unified log concept is better than aggregating log. However there are details that need to be figured out when adopting this:

It needs to be highly available and performant ,we introduce unified log as central of our architecture, that means it needs to be a very resilient component.

,we introduce unified log as central of our architecture, that means it needs to be a very resilient component. Understand your data requirement (performance, completeness, accuracy, etc), in order to determine what kind of promise you’ll need: at most once, at least once, or exactly once. In order to achieve high performance under high load, exactly once promise is a lot more costly.

(performance, completeness, accuracy, etc), in order to determine what kind of promise you’ll need: at most once, at least once, or exactly once. In order to achieve high performance under high load, exactly once promise is a lot more costly. Understand client library behavio r is important. Sending data to unified log shouldn’t take a lot of effort from client point of view. In order to achieve that, usually client doesn’t send data for each record, but they do a buffering until certain amount and then flush to unified log eventually. This is good performance but also a risk for data loss, let say if the data is in buffer and it’s not gracefully shutdown (flush before shutdown), there will be data loss.

r is important. Sending data to unified log shouldn’t take a lot of effort from client point of view. In order to achieve that, usually client doesn’t send data for each record, but they do a buffering until certain amount and then flush to unified log eventually. This is good performance but also a risk for data loss, let say if the data is in buffer and it’s not gracefully shutdown (flush before shutdown), there will be data loss. Centralized schema , it’s a must so that we can ensure subscriber will always be able to read and process data sent by the publisher. Letting the data schemaless means disaster in the future, like broken subscriber.

, it’s a must so that we can ensure subscriber will always be able to read and process data sent by the publisher. Letting the data schemaless means disaster in the future, like broken subscriber. Unified log as a service, one thing to consider is to have a service that wrap unified log physically. The benefit is we could decouple microservice from dependency directly to physical storage, such as Kafka. But it introduces overhead, not to mention the service need to be insanely scalable and highly available.

Database Ingestion vs The Log

If you think carefully, changes on database could also treated as log. Take this as an example, let say in an e-commerce we keep state of item in the cart, here is the states: ADDED_TO_CART → PAID → DELIVERED. Actually we could also keep it as a log: 1) Add to cart log, 2) payment log, 3) deliver log. Usually database also store action to databases as log, take Postgres as an example who has WAL.

My point is, be it data from database ingestion, or clickstream, both could be stored as log. However for data coming form database ingestion, it requires re-design to do so. One way to do it is to put observer in part of code which write to database. This pattern usually called as event-sourcing architectural pattern.

There are several thing need to be considered when adopting this pattern, but the most important points always boils down to data correctness. How we could ensure integrity of data that is there in database as a state versus data that exist as a log on data lake.