If you work in the SaaS world and you operate high-availability systems, you need to monitor your systems closely. Monitoring is one of OpsGenie’s critical internal business continuity operations.

Full Observability for AWS Lambda!

Sign up for early access on thundra.io

Why do we monitor our system? There are two main reasons. First, we need to be aware of any issues in our system, and monitoring lets us know what is happening. Second, we need to identify issues, as awareness of issues is not enough. In order to identify issues, we need to collect as much monitoring data from our application as we can.

The next question is what is monitoring data? The answer really depends on the application. For our systems (and in general), we are interested in three types of monitoring data:

Audit (aka trace)

Stat (or metric)

Logs

There are many products, frameworks, and tools for monitoring classical web applications. If you implement your system on AWS Lambda and want to monitor your Lambda functions, almost no monitoring tool (at least for Java) tells you what is going on in your Lambda function. Although AWS provides you some metrics about how long the invocation took, it doesn’t say much about what is going on under the hood. For monitoring your Lambda function in more detail, AWS XRay support had been added. But while AWS XRay provides metrics about calls to external services such as DynamoDB, HTTP endpoint, etc.; it doesn’t expose internal metrics. You still have no idea what is going on inside.

Let me explain why you need more detailed monitoring data for Lambda functions. Assume that you have a Lambda function and it takes a message by request, processes and then saves it to DynamoDB. Let’s call this Lambda function “message-handler”. Your Lambda function was implemented to be used by different callers so the request contains a regex and a message body text. Each caller invokes Lambda function with their own logic related regex. In here, regex is used for normalizing message body before saving it to DynamoDB. You have deployed your Lambda function and it works like a charm. So far so good. But then you have seen that some Lambda invocations takes a few minutes and some unlucky ones fail with timeout. Here, the only data you have is the invocation time and you have no idea which part of the code during invocation takes most of the time. Then you have enabled AWS XRay support. By AWS XRay metrics, you have seen that for the long running Lambda invocations, DynamoDB calls are not the problem. Then what is the next part to blame?

This is one of the worst cases you might be faced with for your Lambda function in the production environment. Here, the problem might be about regex operation which causes catastrophic backtracking for some problematic regex and input texts but you have no metric to verify that is the problem. You cannot debug your Lambda function, you cannot profile it, and you cannot collect more detailed metrics. In this case, you should be a magician developer to find the problem by guessing with static code analyze.

At OpsGenie, we are implementing new applications as new microservices and splitting our existing monolithic architecture into microservices as we mentioned in our previous blog posts:

So, we decided to use AWS Lambda as FaaS or deploying and running our microservices. But before we go to production with our new Lambda architecture, the problem mentioned above was one of the key challenges we needed to handle.

Thundra — Let the Monitoring Begin

Because of the reasons we mentioned above, we implemented our in-house monitoring infrastructure Thundra, which is a character from Aladdin , to collect more detailed metrics from Lambda functions to be used for analyzing issues. Note that Thundra is not specific to AWS Lambda, it provides monitoring (audit + stat + log) infrastructure and we have adapted Thundra to our Lambda infrastructure Sirocco, which is also another character from Aladdin, by using its plugin support.

All of the monitoring data types (audit + stat + log) can be published via “MonitorDataPublisher” API. Currently, there are CloudWatch, Firehose and Kinesis based publisher implementations and others can be implemented when needed. For Lambda environment, to minimize monitoring data publish overhead, we are publishing monitoring data through CloudWatch logs. Monitoring data is written to CloudWatch logs (log streams of Lambda function log group) and then another Lambda “monitor-lambda” is subscribed to Lambda function log group with a filter pattern to only receive monitor data (audit + stat + log) events. Then monitor data events are demultiplexed and forwarded to their targets (NewRelic/ElasticSearch + Graylog).

The following diagram shows our Lambda monitoring architecture:

OpsGenie’s Lambda monitoring architecture

Audit and stat data are published to the following systems:

NewRelic : Currently we are using NewRelic’s Insights (not the APM because we have our own APM solution for AWS Lambda platform) product to only query and visualize our audit and stat data.

: Currently we are using NewRelic’s Insights (not the APM because we have our own APM solution for AWS Lambda platform) product to only query and visualize our audit and stat data. Firehose: We also send our audit and stat data to Firehose for storing it as batch on both of Amazon S3 and ElasticSearch.

- S3 is used as the backup of raw audit and stat data currently. For possible further improvements, we are also thinking of analysing audit and stat data on S3 via Elastic Map-Reduce.

- ElasticSearch is used for querying and visualising (with Kibana and Grafana) audit and stat data like NewRelic Insights. Currently we are using both of NewRelic Insights and ElasticSearch for similar requirements because they have advantages and disadvantages over each other.

Log data is published to Graylog. At first, we were sending log data to Graylog over HTTPS in GELF format. But then we switched to TCP level communication via SSL sockets (based on Netty) in GELF format due to performance overhead of application layer (HTTPS) communication and intermittent SSL handshake errors when using HTTPS.

Let me give an overview about what Thundra provides for monitoring. As I already mentioned above, it has three basic monitoring features:

Auditing and tracing

Collecting stats

Monitoring and enriching logs

Audit

Auditing is used for tracing executions (method calls), collecting metrics and publishing the audit data to be analyzed later. It starts at a time, collects audit metrics, and then finishes. It is generally used for tracing the execution during request/response based communication.

In the auditing infrastructure, the key point is the “AuditContext” which represents a logical scope started and finished at a time. Here, the scope depends on your logic and needs. It might be the whole request, single or multiple method calls. Scopes can contain children scopes so “AuditContext”s are hierarchical.

Let me explain in detail with the “message-handler” example mentioned before. There are three tiers like a typical web application: Controller, Service and Repository layers.

Lambda handler can be considered as the Controller layer. Let’s call our Lambda function handler as “MessageHandler” class. “MessageHandler” opens the request and passes it to the Service layer via the “MessageService” class.

Processing message body with supplied regex is done in the Service layer. Let’s call the class, which handles process logic, “MessageService”. “MessageService” uses “MessageNormalizer” class to normalize the given message body by given regex and passes the result to the Repository layer through “MessageRepository” class.

Then normalized message is saved to DynamoDB by the Repository layer. Let’s call the class which saves message to DynamoDB as “MessageRepository”.

Getting request and returning response is the root audit context. It starts with request and ends with response. Since this is very high level audit metric, for detailed analysis we need to collect more sensitive audit data with deeper audit context levels. Therefore, we should also trace the most important classes which are the “MessageHandler”, “MessageService”, “MessageNormalizer” and “MessageRepository” classes. We can represent audit context names as “<class_name>::<method_name>” and each sub-method call is associated with the caller as sub audit context. Here are the durations of audit contexts:

Invocation trace of “message-handler” Lambda function

With the help of this deep level auditing, it is easy to notice that the problematic part is the normalization of the message body with the given regex. If we would have only invocation duration (5100 milliseconds), we couldn’t say which part (normalization or save) is the long running action. You may think that since we can see the latency of Repository layer via AWS XRay, DynamoDB metrics is not the problem; it is obvious that the normalization is the problematic one. It might make sense for this case but the more parts/modules there are in your system, the harder it will be detected by guessing. You need concrete metrics to blame a specific part of your system.

They are all good but manually opening and closing audit contexts are error prone and they mess up the code. They should be cross-cutting operations for configured/specified points (classes/methods). With this motivation, we have introduced instrument plugin for our audit infrastructure. Using it, you can specify the classes/methods to be audited and their bytecodes are instrumented to automatically open/close “AuditContext”s to injects their audit metrics.

Here are some of the highlighted features of our auditing infrastructure:

“AuditContext”s can be opened and closed manually or automatically.

“AuditContext”s are hierarchical so there might be sub-contexts.

“AuditContext”s are inheritable so that thread local and audit trees of child threads attach to audit tree of their parent thread as sub audit tree.

Custom properties can be added into “AuditContext” to be shown in the metrics.

Audits are published via “AuditDataPublisher” API.

The classes/methods to be audited can be specified/configured programmatically (“@Auditable” annotation or “AuditPointProvider” API) or by system property.

Method arguments, return values and errors thrown from a method can be audited.

You can audit methods line by line so you can see which line takes how many milliseconds.

Custom actions can be defined to be executed before/after method call or when an exception is thrown.

Custom actions can be defined to be triggered when the method calls exceeds defined time limit. In this case, it is considered as long running method. Currently, CPU profiling is supported if the invocation exceeds its defined long run limit. So if a method call take more time than expected (long run limit), CPU profiling kicks in and at the end of the method call, we have profiling information that says which parts of the application are hot.

Audit infrastructure is integrated with AWS XRay. When there is an audit context , a mapped XRay sub-segment is created. So you can monitor and query audit metrics on AWS XRay. By this integration, we are on the road to perfect tracing by combining distributed tracing with local tracing.

Audit Java agent can also be activated (attached to JVM) on the fly. This is required especially for AWS Lambda environment because the agent cannot be specified as VM argument to Lambda function. Therefore, on Lambda environment, our audit agent is activated at startup and it applies instrumentation to audit points (classes/methods).

The following screenshots shows how we use our collected audit data in production:

Lambda cold start occurrences for each function:

Lambda cold start occurrences for each Lambda function

Integration of our audit infrastructure with AWS XRay (so we have full tracing by connecting local and distributed trace data):

Integration of Thundra with AWS XRay

Stat

Stat infrastructure collects stats (cumulative or instantaneous) and publishes the stats data to be analyzed later. Stats can be application/environment specific (CPU stats, memory stats, etc …), module/layer specific (cache stats, DynamoDB stats, etc …) or domain specific (user stats, …).

Here are some of the important features of stat infrastructure:

Stats are collected through “StatInfoProvider” API on demand or periodically.

Trigger of “StatInfoProvider”s can be configured and controlled by “StatInfoProviderStrategy” API. Trigger behaviour can be count based, time based, custom or composition of any of them with AND/OR operator.

Stats are published via “StatDataPublisher” API.

And these are the currently available stats:

CPU Stats: Process/system CPU load, system load average, processor count, process CPU time, total/free physical memory sizes, total/free swap space sizes, committed virtual memory size, etc …

Process/system CPU load, system load average, processor count, process CPU time, total/free physical memory sizes, total/free swap space sizes, committed virtual memory size, etc … Memory Stats: Heap and non-heap memory usages (init/used/committed/max), memory usages of each memory pool (Eden space, Survivor spaces, Tenured space).

Heap and non-heap memory usages (init/used/committed/max), memory usages of each memory pool (Eden space, Survivor spaces, Tenured space). GC Stats: Minor/Major GC counts/durations, GC CPU percentages, etc …

Minor/Major GC counts/durations, GC CPU percentages, etc … Thread Stats: Current thread count, peak thread count, total created thread count, metrics (CPU/user time, blocked time, waited time) of most CPU intensive threads.

Current thread count, peak thread count, total created thread count, metrics (CPU/user time, blocked time, waited time) of most CPU intensive threads. Custom Stats: Custom statistics (cache statistics, DynamoDB statistics, etc …) published to “StatContext” for the thread-local stat session (ex. during the request).

For example, the following screenshot shows CPU percentages of each function:

CPU percentages for each Lambda function

Log

Logging infrastructure provides a log4j logger (“org.apache.log4j.Logger”) which decorates logs with application/environment informations (application name, type, id, version, profile, hostname, host IP, etc …) and provided domain specific custom log properties (customer name, user name, etc …).

After decoration, generated log messages are published via “MonitoredLogPublisher” API (through CloudWatch, Firehose, Kinesis, etc.) to log repositories such as Graylog/ElasticSearch. With the help of decorated properties, log messages can be identified where they have been collected.

In addition, log data can have a reference to active audit context, if there is any. So you can access the details and metrics of the execution where the log is created. If the log message is not one of the expected ones, referenced audit context can be analysed to see the big picture and understand why the code branched to the place where the log is created.