In Magnet we have a lot of projects going on in different sectors and with a diverse type of customers. One of them, which we are very excited about, is a data monitoring tool for a renewable energy sector client. This type of project relies on many IoT devices sending great amounts of data to our applications and that can be a big challenge.

In one of our most recent projects we had to manage the persistence and visualization of the data coming from solar panel devices. The tricky part of this project is the data ingestion, because if there is an error in that process the data is lost forever. Therefore, we want to ensure that our systems are resilient enough to avoid this issue. For that, having a good monitoring system is a must.

OKSOL solar panel by Orkli

Defining the Metrics

Before starting with the monitoring system itself it is important to define what we want to measure. Hence we started choosing those metrics we thought were critical during the data ingestion:

Elapsed time: this is the most obvious one, since we wanted to measure the time our system needed to accomplish the whole request. The goal was to spot timeouts and possible computational inefficiencies.

this is the most obvious one, since we wanted to measure the time our system needed to accomplish the whole request. The goal was to spot timeouts and possible computational inefficiencies. API Request Status: this indicator is easy to gather and it gives very useful information. It allows us to see whether the IoT devices are sending malformed data or whether our system is behaving incorrectly by giving server errors.

this indicator is easy to gather and it gives very useful information. It allows us to see whether the IoT devices are sending malformed data or whether our system is behaving incorrectly by giving server errors. Data Ingestion Volume: this is an interesting metric to know the amount of data we get with each request (the device sends data in batches) and also to see the data volume over time.

Measuring

Once we had a clear view of what we wanted to do, we designed a strategy to perform the measurements, to save them in an external system and to visualize them properly.

The first part was the easiest, as we only had to make a wrapper over the data ingestion functionality. This way we could monitor without touching any of the existing code, and by doing so we avoided any possible errors that could break the data ingestion.

Saving

We chose AWS CloudWatch to save the measured data. The main reason for opting for CloudWatch was that our stack already relies on AWS, and also because its affordability, scalability and easy integration with other systems.

Since our stack is Clojure-based we went for the Amazonica library which is a wrapper around the AWS Java SDK. Sending a metric is straightforward, with a few points to note:

Namespace: it is a container for CloudWatch metrics. Each namespace is isolated from the other. There is no need to create them by hand, they will be created the first time you send a metric. In our case we defined a namespace for each environment: development, testing and production.

it is a container for CloudWatch metrics. Each namespace is isolated from the other. There is no need to create them by hand, they will be created the first time you send a metric. In our case we defined a namespace for each environment: development, testing and production. Metrics: we can send more than one measurement in a single call. These are the required fields for each metric:

we can send more than one measurement in a single call. These are the required fields for each metric: Name: arbitrary name that identifies the metric.

arbitrary name that identifies the metric. Value: the measurement which has to be a number.

the measurement which has to be a number. Unit: the type of the value: Count, seconds, milliseconds, bytes…

the type of the value: Count, seconds, milliseconds, bytes… Dimensions: every metric has specific characteristics that describes it, and you can think of dimensions as categories for those characteristics. Each dimension has a name and value, the latter can be anything. For example we use it to send the API request status of the data ingestion metric.

More examples on how we send the data:

As you probably noticed the last two metrics look very similar. The only difference is that we are sending an extra dimension in one of them. You might wonder why sending just the last one is not enough. The problem is that CloudWatch doesn’t allow us to aggregate by dimensions. Making a SQL analogy we would say that the group by clause doesn’t exist in this case. The reason behind this is that CloudWatch treats each unique combination of dimensions as a separate metric.

Visualization

After gathering and saving the data, there is one last step left: visualization. Visualizing the metrics properly is essential in order for these to be useful. CloudWatch already provides a built-in dashboard system, but we think it’s quite limited for our needs at the moment.

We use Grafana in most of our IoT projects (mostly for the BI part), and as the integration with CloudWatch is rather easy we decided to give it a chance. In just a few clicks we could configure CloudWatch as a datasource, and all the Metric data was available for usage. Finally we created the dashboard using the graph builder provided by Grafana. In the following pictures you can see the result:

Grafana dashboard with some of the Metric graphs

Grafana Elapsed time Metric graph. Note how the API follows the “fail fast” filosophy.

Choosing the right tool for the job is never an easy task, but I am confident that AWS CloudWatch + Grafana are a killer combination that we are going to leverage in the coming years.

This work was done in collaboration with Lucas Sousa de Freitas