1. Introduction

In this article, we will look at one of the many ways to keep an application behaviour closely under watch. In this case we are going to look at Monitoring. Monitoring an application can be crucial in a development environment.

Let’s have a look at a typical case:

When an application gets developed, the assumption that we can move straight to production because the unit tests and the integration tests did not detect any problem can still be very risky. Even when an experienced tester has made his best effort to pen test the application and has made all the necessary business logic checkups. BDD (Behaviour Driven Development) many times comes into the equation as a means to prevent uncomfortable situations. There is always an extra element of reliability given that BDD checks the behaviour of a running environment. Of course TDD (Test Driven Development) is always something assumed to be followed underwater. Best case scenario is that we have about 90% code coverage, we have given coverage to the logic that the IDE (Integrated Development Environment) did not detect, the BDD tests are fine, and we have a superb and ultra professional DDD (Domain Driven Design) expert which assures us that all the implementation matches all the expectations of the Product Owner. We are also very proud to have a team where everyone knows all software design patterns by heart and that they use and think about them in their daily developments. Furthermore, everyone knows how to implement with and use SOLID (Single responsibility, Open-Closed, Liskov-substitution, Interface Segregation, Dependency Injection) and ACID (Atomicity, Consistency, Isolation, Durability) principles.

Everything seems fine, we go to production, and then we realize that the application isn’t really doing what it was supposed to do. Furthermore, it seems to display all sorts of unexpected behaviour. In these scenarios, the problem may well be that we did not have any concern about the CPU, GPU, memory usage, Garbage Collection, Security and the list goes on. We didn’t check downtime of the application, we didn’t check the resilience of the application, availability and robustness. We also didn’t figure out what to do in case of concurrency, multiple users and how much data actually needs to flow in and out of our application.

Monitoring can offer a lot of help to prevent these situations.

Please note this article will undergo revisions in the future. We can find the code on BitBucket . The basic code for this article can be found via tag 2.0.0. If you are interested in seeing further development and improvements, please check the codebase at a later time. It will change, but the changes will only consider improvements and extra functionalities.

For this article, we must have some basic experience with WebFlux and Spring. We also need to have some basic understanding of Architectural concepts and BASH scripting. We will stay focused on the domain model, code curiosities and monitoring.

2. Requirements

For this article, it is expected that you have some experience and understanding of Spring and Spring Boot, albeit not essential. What’s really important is to have the following installed in your machine:

Java 14 (any java 14 distribution will do. My best advice is to use SDK Man)

Latest grade (At least version 6.5 in order to be compatible with Java 14)

A good IDE (IntelliJ for example)

Docker Desktop (This is essential while running the helper bash scripts I’ve prepared for you)

3. Goals

When talking about monitoring, we mostly make a few assumptions. It’s easy, it doesn’t take too much time to set up and we just want to check if the system is ok. Further, there is this generalized assumption that we shouldn’t complicate it too much. Indeed, these are all valid points. On the other hand, if we think about what actually needs to be done to make monitoring a reality in our environment, then we quickly begin to realize that there are quite a few moving parts to make monitoring a reality. Let’s think on about what do we need on a very high level:

We want metrics We want to have those metrics in our system We need to fetch metrics We need to store metrics We need to visualize these metrics We need an alert system Our system should not be impaired in any way in terms of performance

Looking at point 1, mostly, we are sure that we need metrics. Now we need to do the hard work of thinking about which metrics do we need? Do we need to monitor Garbage Collection? Do we need to monitor resources? Do we need to monitor memory usage? Maybe we need all of these metrics, or maybe we already know first hand what do we want to monitor. It’s good to have an idea beforehand of what we want. For points 2 and 3, we need to make sure that we have considered using a push or a pull mechanism. In this article we are going to look at the pull mechanism. The biggest advantage of a pull mechanism is that, once our application is running, we can control at any time what we want to receive. A push mechanism implies that the application is constantly sending metrics through the wire even when we want to scale down our metrics fetching mechanism. In other orders, monitoring will always make a dent in the application performance, but we can scale that down if we use a pull mechanism. Next, in point 4 we need to store our data. Fetching mechanisms, come many times with ephemeral storage engines. These engines are called so because they can be reset by very simple operations. These can be a restart of a system or just a system update. For these reasons, we need another party in our system which will make sure that the longevity of our data is determined by us and not the system. Once our data is stored we now need a clean and easy way to visualize data. That is point 5. In many systems, the visualization software also allows us to configure our requirements for point 6. Alert systems usually go hand in hand with the visualization interface we are using. In terms of point 7, that is a can be a fallacy in itself. There is always a small decrease in performance. We can also say that if that decrease isn’t perceptible than it’s because it does not exist. It’s a fair point and it makes complete sense, but it’s important to think about it. Misconfigured monitoring solutions can bring a system shutdown, cause latency issues and make the application unresponsive.

4. Implementation

Let’s first make an application! The case we are going to look at is a system that will get airports by means of a search word. Further we will get information about webcams around the airports. RapidAPI is a REST API provider which makes several API’s available, and I was looking for something relatively complex. Then I thought about making a request that would depend on another. If we get the coordinates of an airport, we can use that data and a desired radius in order to get all the webcams in that area. For this case I will be using the Airport Finder API and the Webcams Travel API.

In order to make things a bit more interesting, I’ve created 2 rest applications:

International-airports-sst — SST (Single Source of Truth service) — This is the service that will modify the raw data coming from the RapidAPI interfaces and create a more filtered dataset to our system. This service is protected via OAUTH2. It runs on port 8081.

— SST (Single Source of Truth service) — This is the service that will modify the raw data coming from the RapidAPI interfaces and create a more filtered dataset to our system. This service is protected via OAUTH2. It runs on port 8081. International-airports-rest — This one will serve the front end application and is facing the user. There is no authentication need. It will authenticate though, in order to communicate with the SST service. It runs on port 8082.

Both rest applications have been created with Spring Boot and WebFlux. They have all been implemented in a Reactive way.

I have also created a GUI (Graphic User Interface). In this case I am using Angular and for the UX (User eXperience) design implementation I am using Angular Materials. We will run this application on a NodeJS server.

The reason for the creation of the 2 rest services is because I find it to be one great way of showing monitoring in action. One service depends on the other. Therefore, we can analyze more different data behavior cases. I have also implemented OAUTH2 for the same reasons. In other words, I wanted one service to have a different behavior than the other. We will see a glimpse of that at the end of this article.

4.1. Application Setup

In this section, we’ll have a look at the application overview, how everything is set up and the containerized environment.

Let’s start with the application system.

In this simplified overview, we can see that, before we get any data, we are going through at least 4 services. We have an aNGular service running on NodeJS, two REST spring boot applications and then the last SST service communicates with the outside via the RapidAPI interfaces.

Let’s have a closer look on how the containerized environment looks like. In our previous requirement enumeration, we realized that we need at least 4 things. We need an application environment, a fetcher, a permanent persistence mechanism and a visualization environment. In our case these are our application runtime, Prometheus, InfluxDB and Grafana respectively. We can make everything run in our local machine separately, but luckily, we can speed up the development process by the use of Docker images and Docker compose. Besides speed, the other big advantage is that we can run the whole environment with one single command in a single docker-machine. The source code will use the default localhost docker-machine provided by Docker Desktop..

Let’s have a look at our simplified version of a containerized environment with docker-compose:

Localhost configuration using Docker Desktop

As seen in the diagram, all our moving parts are containerized in one single machine. This is localhost. No extra configuration is needed when using Docker Desktop. It’s important to notice from the image that the only ports that are supposed to be part of the public domain are 3000 and 8080. All other ports are inside the local docker-machine domain. Every container is aware of each other by the names we have given to the. We’ll see how to translate this system to docker-compose further on in the article.

4.2. Domain

Now that we have designed the architecture, let’s step into our domain and have a look at the most important aspects of the application. For this article it’s not my concern to explain the full length of the code and the practices I used. What’s important for this article is that we understand on a domain level what is happening.

Rapid API uses an authentication system based on keys and host paths. They also provide a ready-made code using OkHttp libraries. The implementation is very straightforward. Regarding OAuth2, I am assuming that we have basic knowledge of how it works. The same is said for Spring. Let’s have a look at the domain itself.

From RapidAPI we can get Airport information based on a search term and based on the code. These are the only endpoints we need for the airports in our case. They do provide more options. In this scenario we will be using two endpoints:

Be sure to check their API for further information about these endpoints.

In any of these cases our airport objects will have a similar structure to the following example:

When looking for airports by term, the result is, as we can see, an array of airports. It’s important to notice that with the airport we find, we are also getting its longitude and latitude. Also notice that although getting the airport by code suggests that we would get a single airport object, it instead, still returns one array of airports with one single element in it.

As we have now have the coordinates, we can process the rest of the information and get the nearby web-cameras. In this case, we will use the provided endpoint for the Webcams Travel API:

Via this endpoint, we will get as expected an array. For the purpose of demonstration, I’m just showing one element of this array as if I’d got just one camera:

Notice how gigantic this JSON actually is. Our array is not a root element. It is the value of param webcams. This one is located on the root.

4.3. Code Dive

Let’s have a quick more in-depth rundown of the code in the example. This is a very academically implemented code and it is highly modularized. Before continuing I think it’s important that we know beforehand what each module actually does. Let’s see them from a bottom-up perspective. First the SST REST Service implementation:

International-airports-sst-client-common — library shared between all the clients of the RapidAPI interfaces

— library shared between all the clients of the RapidAPI interfaces International-airports-sst-client-airports — the REST client or consumer of the Airport Finder API interface.

— the REST client or consumer of the Airport Finder API interface. International-airports-sst-client-webcam — the REST client or consumer of the Webcams Travel API

— the REST client or consumer of the Webcams Travel API International-airports-sst-data — The DTO (Data Transfer Objects) used to expose the retrieved data.

— The DTO (Data Transfer Objects) used to expose the retrieved data. International-airports-sst-live — The SST service which provides the first data filtering

— The SST service which provides the first data filtering International-airports-sst-mock — Should anything happen to the live service and the RapidAPI is no longer available for some reason, we will still be able to run the complete exercise described in this article using this replacement service for the International-airports-sst-live service. It is not implemented in version 1.0.1. Subsequent versions will have this implemented.

Then the International Airports REST Service:

International-airports-model — The data model communicates with SST. It follows exactly the same implementation as the DTO’s represented in International-airports-sst-data

— The data model communicates with SST. It follows exactly the same implementation as the DTO’s represented in International-airports-service-api — This is the residence of all service and repositories interfaces

— This is the residence of all service and repositories interfaces International-airports-data — The DTO used to expose the data retrieved from the SST

— The DTO used to expose the data retrieved from the SST International-airports-rest-api — This is the residence of all the controller interfaces

— This is the residence of all the controller interfaces International-airports-rest-service — The REST service which runs a data aggregation and provides the combined data from the airports and the webcams

Finally for the front-end we have an Angular 8 application which resides in module international-airports-gui.

At this moment we should have a pretty good grasp of what we want to do and how the domain is setup.

Let’s make a quick dive into the code. I will only highlight the aspects I think are most important.

The purpose of the implemented code on the SST service, is really to filter out this information and the purpose of REST service is finally to aggregate this information. For this exercise, I’ve used Lombok quite extensively. I’ve settled the model entities to be implemented as final but I’ve also made them serializable with a bit of help from Lombok magic. Since we are dealing with immutable data, it makes perfect sense to implement the data model as such. To achieve this we need to add this file to the root of every project where the code for the model needs to be final:

This will add all properties to the constructor in the same way it is done with the @java.beans.ConstructorProperties. This allows deserialization of entities with final properties and no empty constructor. In order to generate final entities, then instead of using the keyword final, we will use Lombok's @FieldDefaults and @AllArgsConstructor to make Lombok generate that class for us:

The rest of the SST service is very straightforward to understand. It’s very simply put, just a common REST Service, protected with an OAuth2 Client Credential grant type authentication. It reads data from a source and exposes this filtered data via another port.

The International Airports REST Service, is also very straightforward to understand. There is however one particular curious fact. I would like to share it in this article given that it represents an interesting point when discussing the purpose of using the different Flux and Mono publishers in a WebFlux architecture. Let’s have a look at the aggregator service:

Looking at these two methods we quickly realize that we are using Pair and that the method getAirportByCode returns a Flux in spite of actually only returning one single AirportDto. Whereas in the getAirportsBySearchTerm method it makes perfect sense to return a Flux, it may be a bit confusing for the getAirportByCode method. In the end, we are only returning one AirportDto, so we could be easily led to believe that this should be a Mono.

We are chaining two publishers. There is a Mono that comes directly out of the airportsService. Here we have a publisher for a single AirportDto. After that we are pairing the airportDto in the handle of a webCam publisher. We get this publisher via the method getCamsByLocationAndRadius of the webCamService. At this point the only thing we are doing is setting up a new webCam publisher. This new webCamPublisher will return a Flux of cameras. Each camera will have to be paired with the found AirportDto. It is because of this pairing, that we have to return a Flux. The curious behavior I realized with WebFlux is that if I had returned a Mono, I would still get one single AirportDtom, but with only one single WebCamDto and not all the webCams found in the REST call. We are using share to multicast all airportDtos found, to the next webCam mapper. To put this into more perspective, we can look at the following unit test cases:

Let’s have a look into the difference between both test methods. On the one above we are converting our Flux result into an Iterable object. This is the same as saying that we have blocked the reactive stream flow. On the second example. We are doing the same thing, but before we block the flow we are converting our publisher from Flux to Mono. Underwater what this means is that we are disrupting all multicasting in the stream to be reduced to the first stream found. This is the reason why we are getting 10 webcams on the first example but only 1 on the second example.

5. Monitoring

In previous steps we have looked into how the architecture is designed and we have looked at our domain. We have also had a look at the code and took a dive into some interesting aspects of the implementation. We should have a very clear view of what the system is doing at the moment and how all the moving parts are working. We will now go step-by-step through all the elements of the monitoring system. In this part of the article, it’s important to bear in mind that this system can be run on a local machine. All moving parts are part of a docker-compose setup.

In this article, I wanted to make an example that would be easy to understand and to modify and this is why every single component has its own custom Docker image.

Also let’s keep in mind that the examples I’m showing happen after we have successfully ran docker-compose. Further in this article, we’ll go through the details. For now, let’s take note that there is a build.sh script, where we can start the system in one go. Let’s keep in mind that the docker-machine is expected to be called dev.

5.1. Setting up InfluxDB

One of the greatest hurdles with working with any architecture, is the persistence layer. Prometheus does not provide a durable and persistent storage system. This is something that is expected from Prometheus, given that Prometheus is a sort of fetcher and data management tool. This is why we have such databases as InfluxDB. This is a storage system specifically designed to store metrics. Let’s have a look at how we can connect to it.

InfluxDB does not need to be connected to Prometheus in the strict sense of the word. We can have it running and just use it within our system. To check that it’s actually up and running we can just run the following command:

influx -host localhost:8086

For this, we need to have the influx client installed. If we then run the right commands, we should experience the following command flow:

influx -host localhost Connected to http://localhost version 1.7.9 InfluxDB shell version: v1.7.9 > use prometheus Using database prometheus > show measurements name: measurements name ---- app_version expressjs_number_of_open_connections http_server_requests_seconds_count http_server_requests_seconds_max (...) tomcat_sessions_active_max_sessions tomcat_sessions_alive_max_seconds tomcat_sessions_created_sessions_total tomcat_sessions_expired_sessions_total tomcat_sessions_rejected_sessions_total up

5.2. Setting up Prometheus

The Prometheus setup is probably the most complicated and elaborate of the whole monitoring setup. Here we have two concerns. One is that our applications have open REST interfaces that Prometheus can read. The other is to make sure that Prometheus can store this data in InfluxDB.

In this article we are going to have a look at two cases. These are Node applications and Spring Boot applications.

Let’s have a look at how Prometheus needs to be set up on our SpringBoot applications. First we need to establish which points we need to open. We can do this by using Prometheus specific properties. We can reach them and make them available into our project by adding the following libraries:

implementation('org.springframework.boot:spring-boot-starter-actuator:2.2.2.RELEASE') implementation('io.micrometer:micrometer-core:1.3.3') implementation('io.micrometer:micrometer-registry-prometheus:1.3.3')

The actuator is needed because it provides important metrics that come out of the box in Spring Boot and can already be indirectly read by Prometheus. The micrometer properties from the io.micrometer core and prometheus libraries provide metrics using the same model as Prometheus. This allows for a much easier data reading in the Prometheus processes from the metrics endpoints. This will still not open the endpoints though. These are actually just the libraries that will allow the endpoints to be generated.

Let’s have a look at how can we enable the endpoints. Looking at the modified application.properties, we will find this new list:

management.endpoints.web.exposure.include=* management.endpoint.shutdown.enabled=true management.endpoint.metrics.enabled=true management.endpoint.prometheus.enabled=true management.endpoint.httptrace.enabled=true management.metrics.export.prometheus.enabled=true management.trace.http.enabled=true

With these simple instructions we are enabling endpoints to be formed and we are also activating other tracing mechanisms. All of these management points are part of the Spring Boot Actuator. However, a Prometheus actuator endpoint is only autoconfigured in the presence of Spring Boot Actuator. Hence the reason for needing the micrometer libraries.

Prometheus has its own scrappers. They will run processes against Spring Boot applications and scrape all of those metrics periodically. We can also make our own metrics and make them available through the actuator. There are many different ways of doing this and one example is to play a bit with trace. In our metrics we can find this endpoint:

/iairports/actuator/metrics/http.server.requests

Looking at the result we have:

Here we have COUNT, TOTAL_TIME and MAX. These are essentially, the number of requests made to the application, the total time they took to respond, and the maximum time it took to get a response from a request. These are all important metrics. With these metrics, we can have an idea of how many people are looking into our web application. For ratings, it’s a very handy metric. The total time seems also to be very important. We want to monitor the performance of our application. Using total time can be an option. With Prometheus, we have much better options than this, but since we have count and total time, we can actually get the average response time for a certain endpoint. In anyways, continuing through to the maximum time an endpoint took to respond, we also can see the use of this other metric. If one endpoint exceeds a certain time to respond then that endpoint has a latency issue. Therefore some sort of action must be made in terms of improving it.

What we don’t have in these metrics, is essentially the minimum amount of time it took for an endpoint to reply. We just don’t have data for that. At no point we have registered the time of each request. Here is an alternative without Prometheus:

This is creating an implementation of the HTTP tracing, This has been enabled by the HTTP tracing flag as we saw before: management.trace.http.enabled=true. We can notice already that this implementation also carries a repository. Simply speaking, I just created a FIFO (First In/ First Out) queue. In other words, I’m just keeping 10 elements in the repository at the time. Also, this code is only responsible for the recording of GET requests. We can change this to perform other or the same operation to other requests.

The focus here is just on the fact that we can change metrics the way we please and reprogram some of them to report customized data directly into Prometheus or any other monitoring fetching tool. Traces will carry the following format as an example:

If we look at the last element of this request, we find timeTaken. This is the response time of each GET request. Our MIN could in this case be an implementation of the minimum timeTaken of this array. It’s important to notice that in this case, we are measuring milliseconds.

We have gone through the essentials for the Spring Boot back end support for Prometheus. Let’s have a look at how NodeJS can provide the same endpoint for Prometheus.

In our example, we have a fully developed application in Angular 8. We could have used NGINX to deploy it directly. However in that case, NGINX itself would have had to have been configured in order to provide the endpoints to Prometheus. However, NodeJS provides a very interesting solution by means of the libraries: prometheus-api-metrics and prom-client.

There is a server.ts example in the code, but our focus should only be at this point on two crucial lines of that code:

The whole server.ts is an implementation of a simple NodeJS service which runs on express.

To get Prometheus running, we finally need to materialize all of what we have discussed into the prometheus.yml file:

Notice that at the end, the configuration is pointing to our InfluxDB. We are setting our scrapping interval to 15 seconds and we evaluate each scrape for the past 30 seconds.

With these metrics in place, we are ready to jump to the Grafana configuration.

5.3. Setting up Grafana

Grafana has a different way of configuring than Prometheus. Here we don’t need any code. Grafana is based purely on the contract it has with Prometheus. Let’s recap what we have so far:

2 Spring Boot applications and 1 NodeJS application

Persistent Measurements Database InfluxDB

Metrics Fetcher and management with Prometheus

We will use Prometheus with Grafana to generate our graphs. A detailed discussion on how graphics are built with Grafana is a topic on its own and removes the focus of this article. What’s important to understand is how can we make our Grafana configuration also persistent. We want to keep our graphic configuration persistent so that we can use them on any Grafana environment.

When we look at Grafana for the first time, we become immediately aware that we have to create a data source. With this action we also create a Dashboard provider. Dashboard providers are kept in YAML files located in /etc/grafana/provisioning/dashboards/ of Grafana. Let’s look at our example on dashboard.yml:

By doing this, we are telling Grafana that Prometheus is the name of its first Dashboard provider. At the same time we are letting grafana know that we will load all dashboards located in /etc/grafana/provisioning/dashboards. Although we have already established a data provider for our dashboards, we still need the actual end point to Prometheus. This is done in our datasource.yml file:

After defining the data source and the dashboard data provider we finally need to place all our dashboards in /etc/grafana/provisioning/dashboards. Grafana provides ways to save these dashboards in a JSON format. They do not need to be referenced. As long as they reside in that folder, we will see that Grafana will read all of them.

5.4. Docker composer orchestration

Let’s bring all the moving parts together. Fortunately, this can be exceptionally easily done with docker-compose. Looking into docker-compose.yml we can see the following:

This is precisely a configuration file that represents our first seen diagram. All ports in this example are opened. If we look at the example in the repo, all the ports are closed except for Grafana and the application itself on ports 3000 and 8080 respectively.

In the project, we can find a few build.sh and build-standalone.sh scripts. These have been made in order to provide a quick introduction on how to run everything with Docker compose. The first runs everything and the latter, just the container it refers to.

If we have a docker-machine named dev, then the build.sh should build and run everything in one go. By default, all unnecessary ports to the user are isolated. Let’s uncomment everything and run this script. InfluxDB has a command line as we have already seen before.

The three other important aspects of all of this setup are the application, Prometheus, and Grafana. Let’s have a look at them.

This is the look of our page:

This is how our Prometheus interface looks like:

Our JVM processes dashboard in Grafana:

And finally our NodeJS process dashboard in Grafana:

6. Conclusion

We have finally reached the end of this article. It is quite extended but I hope to have been able to transmit the essential points I found to be of inspiration with this sort of Monitoring architecture. I didn’t want it to become too complicated but I also wanted to make it a fun article with a few surprises.

We have gone through some curious cases, namely the Flux implementation with an Airport with multiple webCams and the implementation of an Http trace store.

In this article we have seen how to set up a very basic configuration of Prometheus, Grafana and InfluxDB applied to Spring Boot and NodeJS applications.

We have seen images as examples of how the Dashboards look and feel.

If we delve very deep into these graphs, we will realize that there is so much detail and so much to learn with our end product. In the end, we would be interacting with Grafana. This would be the result of a complete setup. If we look deeply into these graphs we realize how interesting they can be. For example, we can examine precisely and with extreme accuracy, how the Garbage collection is working. We can examine GC algorithms like Shenandoah. We can look precisely at how many objects are in the Tenured space, Eden Space and Survivor Space for example. We can look at how many logs of how many types have been produced. We can measure how many requests were made and how fast they were performed. A classic example is that our SST service always performs more requests than our user-facing service. This is expected. Our user-facing service aggregates requests and so, it will respond with fewer requests. However, maybe latency doesn’t change that much between them. We can also monitor the state of our CPU with Grafana. We can therefore detect patterns between CPU usage, number of files opened, and the GC activity. With Grafana we can also establish alerts and detect if by any chance our GC has made a long pause and if that is a problem for us.

Let’s imagine a situation. We are in a team. The product has been developed and everyone is excited about it. At the same time, the product isn’t used very often. A week goes by and no user has logged in yet. Just before the weekend, someone does. The unexpected happens. The application crashes and give no warning to the user. A transaction has never happened and the user thinks it has. The whole weekend goes by and finally, the user notices this. The rest of the story is probably not a positive one. The point is that, with Grafana and systems alike, we can use all metrics to our benefit. We can prevent situations like this. We can detect patterns and establish alarms once these patterns are detected. This situation would have been as easy to prevent just by detecting an Exception, an Error, or a system crash. The guardian developer and perhaps the whole or part of the team would have taken action much earlier.

I have placed all the source code of this application in BitBucket

I hope that you have enjoyed this article as much as I enjoyed writing it.

I’d love to hear your thoughts on it, so please leave your comments below.

Thanks in advance for your help and thank you for reading!

Take care, stay interested, stay logic, stay safe!

6. References

Percona — Using Prometheus with InfluxDB for metrics storage