What is your worst nightmare? For me, it is when our customer visits social media and tells us that our service is down and we are not aware of it. To prevent this, we need to set up monitoring as part of the deployment process. Hope all of you do the same. Monitoring helps knowing when things go wrong. It helps to debug, and gain insight of the issues. It notifies if any CPU or memory usage goes up for a certain time. We want to continuously monitor our instances and services for any kind of anomaly in behavior, CPU usage, memory usage, disk space, network usage, etc. Monitoring allows us to identify long-term trends, analyze performance and see visualizations.

Monitoring includes collecting, processing, aggregating and displaying data about a system along with alerting based on that data. This helps query error rate, request duration, latency etc. When we get alert we know what’s breaking which indicate symptom, then looking at further data helps find the cause which eventually helps fix the issue.

Let’s see how we fix an issue with monitoring:

Get alert for some error or looking at error graphs to know something is wrong. Look at graphs to identify error time frame. Debug log files to pinpoint error logs in that time frame. Fix the error and make sure everything is normal afterwards.

Let’s summarize why we need to do monitoring:

Get alert when something goes wrong. Helps in investigating and diagnosing issues. Helps in visualising system behavior using dashboard. Helps in analysing long-term trends. Helps in comparing system behavior across multiple time frames. Helps conduct a retrospective analysis.

Prometheus

Prometheus is an open source monitoring tool based on the pull-based mechanism which helps in scraping data, query it, create a dashboard using it and provide alerts based on alert rules. It supports Prom query language for the searching of metrics.

Prometheus has a main central component named Prometheus Server. It helps monitor servers which are called targets. It can be a single target or multiple targets monitored for different metrics like CPU usage, memory usage etc. Prometheus server scrapes targets over an interval to collect metrics and store them in time-series database. We can make PromQL queries for this data to get metrics for a specific target. By default, Prometheus pulls metrics about its own instance. We can use exporters for different third-party systems, which convert metrics of third-party tools to Prometheus metrics format. Prometheus provides its own dashboard for visualization. Prometheus also has an alert manager component to send alerts via email, slack or other alerting tools. We can define rules which Prometheus server reads and fire alert when defined condition triggers.

We have already talk about exporters which provide information about many different components like infrastructure, databases, web servers etc. There are lots of exporters available, two examples of exporters are node_exporter and blackbox_exporter. Node exporter produces metrics about infrastructure which include CPU, memory, disc, network stats and lot more. Blackbox exporter produces metrics by probing different endpoints on different protocols like HTTP, HTTPS to get their availability, response time etc. In this post, we are going to setup node_exporter and blackbox_exporter to work with Prometheus server.

Let’s first setup Prometheus using Ansible. First, we need to create “prometheus” user and group which helps isolate ownership of Prometheus server and provide security. This user can’t login in the instance and only used inside the instance. We define all parameters in the variable file for better control.

- name: Creating prometheus user group

group: name="{{groupId}}"

become: true - name: Creating prometheus user

user:

name: "{{userId}}"

group: "{{groupId}}"

system: yes

shell: "/sbin/nologin"

comment: "{{userId}} nologin User"

createhome: "no"

state: present

Next step is to download and install Prometheus with ownership to prometheus user. Once installed, clean up all temp files needed for the setup process.



unarchive:

src: "

dest: /tmp/

remote_src: yes - name: Install prometheusunarchive:src: " https://github.com/prometheus/prometheus/releases/download/v{{ version }}/prometheus-{{ version }}.linux-amd64.tar.gz"dest: /tmp/remote_src: yes - name: Copy prometheus file to bin

copy:

src: "/tmp/prometheus-{{ version }}.linux-amd64/prometheus"

dest: "/usr/local/bin/prometheus"

owner: "{{userId}}"

group: "{{groupId}}"

remote_src: yes

mode: 0755 - name: Delete prometheus tmp folder

file:

path: '/tmp/prometheus-{{ version }}.linux-amd64'

state: absent

Next, we are going to copy Prometheus config file which is used by Prometheus server.

- name: config file

template:

src: prometheus.conf.j2

dest: /etc/prometheus/prometheus.conf

Once we have everything set to run Prometheus, we are going to copy systemctl init file, so on each restart, we are sure that Prometheus is running.

- name: Copy systemd init file

template:

src: init.service.j2

dest: /etc/systemd/system/prometheus.service

notify: systemd_reload - name: Start prometheus service

service:

name: prometheus

state: started

enabled: yes

Next step is to start the server so it can start working based on config file. Then we check if it is working by hitting an HTTP call at port 9090 and confirming we got 200 response.



uri:

url:

method: GET

status_code: 200 - name: Check if prometheus is accessibleuri:url: http://localhost:9090 method: GETstatus_code: 200

With all this, we have our Prometheus server up and running and collecting data from its own instance.

We want to collect infrastructure related metrics from other instances for which we need to setup node_exporter in other instances. Prometheus server collects data from these instances using node_exporter and shows us details using its dashboard. Steps are going to be similar to what we did with Prometheus using Ansible.

We are going to create user and group named “node_exporter” which helps isolate ownership of node_exporter and provide security.

- name: Creating node_exporter user group

group: name="{{groupId}}"

become: true - name: Creating node_exporter user

user:

name: "{{userId}}"

group: "{{groupId}}"

system: yes

shell: "/sbin/nologin"

comment: "{{userId}} nologin User"

createhome: "no"

state: present

Next step is to download and install node_exporter by placing its binary to path “/usr/local/bin”. Once the setup is done, we remove all redundant files.



unarchive:

src: "

dest: /tmp/

remote_src: yes - name: Install prometheus node exporterunarchive:src: " https://github.com/prometheus/node_exporter/releases/download/v{{ version }}/node_exporter-{{ version }}.linux-amd64.tar.gz"dest: /tmp/remote_src: yes - name: Copy prometheus node exporter file to bin

copy:

src: "/tmp/node_exporter-{{ version }}.linux-amd64/node_exporter"

dest: "/usr/local/bin/node_exporter"

owner: "{{userId}}"

group: "{{groupId}}"

remote_src: yes

mode: 0755 - name: Delete node exporter tmp folder

file:

path: '/tmp/node_exporter-{{ version }}.linux-amd64'

state: absent

Copy systemctl init file to start node_exporter on every restart and start node_exporter service so it can start collecting data.

- name: Copy systemd init file

template:

src: init.service.j2

dest: /etc/systemd/system/node_exporter.service - name: Start node_exporter service

service:

name: node_exporter

state: started

enabled: yes

Once everything is done, we are going to verify that it is working as expected.



uri:

url:

method: GET

status_code: 200 - name: Check if node exporter emits metricsuri:url: http://127.0.0.1:9100/metrics method: GETstatus_code: 200

Once we have node_exporter running, let’s start running blackbox_exporter to do blackbox monitoring on different services by probing the endpoints using protocols like HTTP. Blackbox exporter takes module and target URL parameter through “/probe” api. Modules are configured in blackbox config file, default config include http_2xx module which does HTTP probe and gives success on 2xx response.

As we already did with Prometheus and node_exporter, we are going to create user and group in same way.

- name: Creating blackbox_exporter user group

group: name="{{groupId}}"

become: true - name: Creating blackbox_exporter user

user:

name: "{{userId}}"

group: "{{groupId}}"

system: yes

shell: "/sbin/nologin"

comment: "{{userId}} nologin User"

createhome: "no"

state: present

Next step is to download and install blackbox_exporter by placing its binary to path “/usr/local/bin”. Once the setup is done, we remove all redundant files.

- name: Copy prometheus blackbox exporter file to bin

copy:

src: "/tmp/blackbox_exporter-{{ version }}.linux-amd64/blackbox_exporter"

dest: "/usr/local/bin/blackbox_exporter"

owner: "{{userId}}"

group: "{{groupId}}"

remote_src: yes

mode: 0755 - name: Delete blackbox exporter tmp folder

file:

path: '/tmp/blackbox_exporter-{{ version }}.linux-amd64'

state: absent

Then copy blackbox config file and systemctl init file to their respective destination. Start blackbox exporter so it can start taking requests.

- name: Copy blackbox exporter config file

template:

src: blackbox.yml.j2

dest: /data/blackbox_exporter/blackbox.yml

owner: "{{userId}}"

group: "{{groupId}}" - name: Copy systemd init file

template:

src: init.service.j2

dest: /etc/systemd/system/blackbox_exporter.service

notify: systemd_reload - name: Start blackbox_exporter service

service:

name: blackbox_exporter

state: started

enabled: yes

Lastly we verify that blackbox_exporter is working fine.



uri:

url:

method: GET

status_code: 200 - name: Check if blackbox_exporter is accessibleuri:url: http://localhost:9115 method: GETstatus_code: 200

We are going to use HTTP module for our probe which is configured through blackbox.yml config file.

modules:

http_2xx:

prober: http

timeout: 5s

http:

preferred_ip_protocol: "ipv4"

valid_http_versions: ["HTTP/1.1", "HTTP/2"]

valid_status_codes: []

method: GET

At this point, we have Prometheus server running on one instance, blackbox_exporter running on same instance and node_exporter running on another instance. We need to update the configuration of Prometheus server (in our example, running probe on http://www.google.com) and restart the server so it can start fetching data.

prometheus.conf template file

global:

scrape_interval: 15s scrape_configs:

- job_name: 'prometheus'

scrape_interval: 5s

static_configs:

- targets: ['localhost:9090']

- job_name: 'node_exporter'

scrape_interval: 5s

static_configs:

- targets:

{% for host in groups['all'] %}

{% if inventory_hostname != host %}

- '{{ host }}:9100'

{% endif %}

{% endfor %}

- job_name: 'blackbox'

metrics_path: /probe

params:

module: [http_2xx]

static_configs:

- targets:

- http://www.google.com

relabel_configs:

- source_labels: [__address__]

target_label: __param_target

- source_labels: [__param_target]

target_label: instance

- target_label: __address__

replacement: localhost:9115

Next, we go to Prometheus dashboard and confirm that everything is up and running.

All targets monitored using Prometheus

Querying up state of all nodes using node_exporter

Probing http://www.google.com using blacbox_exporter

The complete code can be found in this git repository: https://github.com/MiteshSharma/PrometheusWithAnsible

PS: If you liked the article, please support it with claps 👏. Cheers