In the first blog post of this series I’ve bored everyone with details about syslog RFCs, formats, and transports. But I also presented an effective and fast way to parse syslog messages using the go-syslog library.

Building on that, this story intends to present the full pipeline to extract metrics from logs with the InfluxData stack.

For this purpose this blog post comes with some companion code (mostly YAML 😒) that demonstrates how to detect and count the number of out-of-memory (OOM) killed process in Kubernetes.

Namely it shows:

how to parse syslog messages into metrics ingesting them via the Telegraf’s syslog input plugin (and storing them into InfluxDB) how to visualize and plot such metrics automatically and without hassle via the Chronograf’s new log viewer and/or dashboards how to elicit new meaningful metrics (eg. number of process OOM killed) through a Kapacitor UDF.

A nice sketch representing our log processing pipeline with mascots 🐯🐦🦒🐨.

Let’s proceed step by step.

Baking the environment 🍮

Before to deep diving into it we need to setup our local environment since we pretend to experiment with the pipeline locally.

So first of all we need a local Kubernetes.

But how to tell it to forward its log messages to the first piece of our pipeline? We could point its docker syslog logging driver towards the port of our syslog receiver (ie., a Telegraf syslog input plugin configured to listen over TCP on port 6514, maybe) with the right syslog message format (ie., --log-opt syslog-format=rfc5424 ).

Nevertheless such idea will not work in practice since we want our Telegraf to be run by Kubernetes and the docker syslog logging driver is not able to wait it to be up without breaking everything. Furthermore such driver is not able to transport logs with the octet counting framing and we want it because we love complicating our life in doing complicated things. 😌

So let’s arrange a plan for this …

The plan I came up with is to have Kubernetes (and its host’s kernel) forward logs to journald. Unfortunately journald does not emit RFC 5424 syslog messages. So we’ll use a rsyslog sidecar with the following setup to solve everything:

the imjournal module to grab messages from journald the mmkubernetes module to convert them into valid RFC 5424 syslog messages the omfwd module to send them to our syslog receiver with the octet counting technique.

These 3 steps convert into the code that I’ve put in an appendix at the end of this post for the braves.

For now we just need to start our minikube telling it to use journald as logging driver. That’s it.

minikube start --docker-opt log-driver=journald

Yay!✌

Syslog input plugin for Telegraf

As already said, the first InfluxData component of our pipeline is Telegraf. It is a plugin-driven server agent for collecting and reporting metrics.

We created a syslog input plugin for it, which uses go-syslog library under the hood. The intent is to use it as a syslog receiver to listen for syslog messages transmitted over UDP or TCP or TLS, with or without the octet counting framing.

The downside is that we support (for now) only syslog messages formatted according to RFC 5424. Thus we do not support RFC 3164 legacy messages (yet) for the reasons explained in the first blog post of this series.

Anyway, given an input syslog message this plugin will create the following metrics in the syslog measurement.

Tags:

severity (its short level form as per Linux kernel)— string

facility (its short keyword as per man syslog 3 ) — string

) — hostname — string

appname — string

Fields:

version — integer

severity_code (as per table 2, section 6.2.1)— integer

facility_code (as per table 1, section 6.2.1) — integer

timestamp (the time recorded in the syslog message) — integer

procid — string

msgid — string

message — string

sdid (whether the structured data has an ID or not) — bool

structured data — string

Now, briefly, some things worth explaining to avoid you headaches and dizziness.

The timestamp of the point corresponds to the time the message has been received. It is different from the timestamp of the metric listed above, which infact represents the one (if any) that was presented in the syslog message.

Furthermore the structured data (if present in the log message) produces field keys by combining the structured data identifier (ie., sdid ) with the structured data element name using the config field sdparam_separator (which is _ by default).

Finally, reminding you that the message part may contain UTF-8, I invite you to use emojis in your syslog messages to make all of this more inviting! 😎

By the way, an example is worth a thousand words. Thus, the following syslog message (transported with octet counting framing) …

173 <165>1 2018–10–01:14:15.000Z mymachine.example.com evntslog — ID47[exampleSDID@32473 iut="3" eventSource="Application" eventID="1011"] An application event log entry ✨

… would lead to this outcome in terms of line protocol:

syslog,appname=evntslog,facility=local4,hostname=mymachine.example.com,severity=notice exampleSDID@32473_eventID="1011",exampleSDID@32473_eventSource="Application",exampleSDID@32473_iut="3",facility_code=20i,message="An application event log entry ✨",msgid="ID47",severity_code=5i,timestamp=1065910455003000000i,version=1i 1538421339749472344

We are all set now to configure such plugin for our Telegraf. It has various configuration options but we want to stay simple: we want to receive syslog messages over TCP on port 6514 and to parse them with best effort mode on. So here it is the core part of its configuration.

[[inputs.syslog]]

server = "tcp://:6514"

best_effort = true

Counting OOMs looking at syslogs 🧮

We pretend to be able to count the number of out-of-memory (OOM) killed processes simply looking at the stream of incoming syslog messages, so to be able to plot them in a dashboard.

What we know is that OOMs will generate logs in the host’s kernel (that we’ll receive, too) which message part will contain a phrase of this kind:

... Kill process <PID> (<PROCESS NAME>) score <OOM SCORE> or sacrifice child

The plan (very simple and for demonstration purposes only) is to have a user defined function (UDF) that given an input stream of logs will detect the ones whose messages will match that format and will count their occurrences distinguishing by process, also storing the OOM score.

To simplify: our streaming processor (ie., Kapacitor) needs to accept a stream of syslog message and to retain those that match the following regular expression through a UDF.

(?m).*Kill process (?P<pid>\d+) (?P<proc>\(.*\)) score (?P<score>\d+)

Easy peasy lemon squeezy. 🍋🍑

Let’s write some pseudo-code to illustrate the logic of our example UDF.

r := regexp.MustCompile(`...`)

message, ok := p.FieldsString["message"]

if ok {

m := r.FindStringSubmatch(message)

data := mapSubexpNames(m, r.SubexpNames())

proc := strings.Trim(data["proc"], "()")

state := h.state[proc]

if state == nil {

// initialize state

state := &myState{Counter: 0}

// index it by process

h.state[proc] = state

}

h.state[proc].Counter++

// create point

newpoint := &agent.Point{

Time: time.Now().UnixNano(),

Tags: map[string]string{

"proc": string(proc),

"pid": string(data["pid"])

},

FieldsInt: map[string]int64{

"count": h.state[proc].Counter,

},

}

// send point ...

}

In case you want to deepen into the Kapacitor UDF, here you can find its code.

Finally we need:

to define the source of the input points for our UDF (ie., database telegraf , retention policy autogen , measurement syslog );

, retention policy , measurement ); to restrict them to the set we care (appname equal to kernel and message containing sacrifice word);

and message containing word); to define where it should stream back the resulting output points (ie., database telegraf , measurement k8s ).

Here is the TICK script needed that translates such requirements.

dbrp "telegraf"."autogen"

stream

|from()

.measurement('syslog')

.truncate(1ms)

.where(lambda: "appname" == 'kernel')

.where(lambda: "message" =~ /sacrifice/)

@example()

|influxDBOut()

.database('telegraf')

.measurement('k8s')

I know, I know what you are thinking … There would be many other (better) use cases for UDF to fit into this pipeline. But clearly this is just a demo and I care about the concepts, not the specific use case to present them, so please don’t blame me neither my little imagination in this case. 🙏

Deploying the pipeline

Let’s put all the pieces together now to count the OOMs happening in our fake Kubernetes cluster.

Me doing YAML rather than doing serious things as in the past. 🥺

For simplicity we’ll do everything in a single Kubernetes namespace (with name logging ). Apply the following namespace YAML file.

---

# Section: logging namespace

apiVersion: v1

kind: Namespace

metadata:

name: logging

From now on, since YAML files are really verbose, I simply report the command to apply them. You can look at the YAML files clicking on their name (containing a link to the companion git repository).

We also want to store out metrics so we need to deploy our fabulous time-series database.

Now it’s time for Telegraf plus Rsyslog (at this point you’ll surely have understood why) with all the extra things (eg., sending metrics also to InfluxDB, sharing journald paths, roles to make mmkubernetes able to call Kubernetes APIs, and so on).



kubectl apply -f kubectl apply -f telelog.yaml kubectl apply -f roles.yaml

To have the log viewer setup nothing special is needed.

Deploying Chronograf will suffice for it. Then we can create as many beautiful dashboards we want manually.

At this point we can point our browser towards localhost:8888 and enjoy the show of syslog metrics starting coming and creating an awesome histogram.

By the way, do not forget to port forward the Chronograf service if needed.

kubectl port-forward svc/chronograf -n logging 8888:80

You should be able to see something similar to the image in the heading of this post!

Furthermore you could (and you should) play with the explorer to explore the metrics that the syslog messages are causing.