With containers springing up and down in minutes and virtual machines coming and going in hours, some sysadmins have neglected their system logs. Log files still provide invaluable insight into how systems are operating! Here’s what you -- still -- need to know.

When I was a young whippersnapper, we had to work our server logs 29 hours a day for tuppence a month, and the CIO would beat us around the head and neck with a broken bottle, if we were lucky! But, sheesh, you tell that to young sysadmins today, and they won’t believe you!

With the rise of containers and virtual machines, some system administrators have been neglecting their system logs. That’s a mistake.

Even if your containerized applications spin up and down several times an hour, you still need to keep and analyze logs. To find the root cause of a failure or to track down a system attack, you must be able to review what happened, when it happened, and what components of your software and hardware stack were affected. Otherwise, you’ll waste time looking for problems in the wrong place — time that you don’t have to spare in an emergency. Or, worse still, you may miss hidden issues such as performance problems, security violations, or costly use of system resources.

Without system logs, you’re not administering a system; you’re running a black box and hoping for the best. That’s no way to run servers, whether they are physical, virtual, or containerized.

So, here are some of the basics to keep in mind as you approach server logging in the 21st century. These are all practices that I either use myself or picked up from other sysadmins, including many from the invaluable Reddit/sysadmin group.

Centralize your logs

First, decide where to keep the log files. While there’s still a time and place for keeping logs in /var/logs, in the case of Linux/Unix, or in %WINDIR%\Panther, in the case of Windows, most sysadmins and auditors prefer a centralized logging system. After all, who wants to search for a problem over hundreds of log files on as many servers? In addition to simplifying your log analysis, you can more easily secure log data if it’s in one location.

Also, centralized logs make it easier to conform with regulatory compliance mandates imposed under HIPAA, SOX, GDPR, etc.

There are numerous centralized logging system programs, such as Splunk, Fluentd, and Graylog. Based on the sheer number of mentions I’ve seen from sysadmins, however, the plurality build their own management tools from the ELK stack, a.k.a. the Elastic Stack.

ELK is made up of three open-source projects: Elasticsearch, Logstash, and Kibana. Elasticsearch is a search and analytics engine. Logstash is a server‑side data processing pipeline that imports data from multiple sources simultaneously, transforms it, and then sends it to a “stash” like Elasticsearch. Kibana enables its users to visualize data in Elasticsearch. This approach enables IT departments (and their admins) to create a custom logging and monitoring mix that fits their company’s precise needs.

How long should you keep log data?

Once you start centrally collecting your data, the next eternal question is “How long should you keep your log data?” The answer is: “It depends.”

Don’t just rely on your gut feeling to make this call. You need to consider the legal requirements for retention or destruction, your company’s own data retention policies, and how long the logs may be useful.

How much room they take up is also an issue, but in today’s world of cheap storage, it should be the last one on your list. In addition, big data log analysis tools such as VMware’s vRealize Log Insight and programs based on the SMACK stack can wring useful analysis from even the largest collection of log files.

That said, when you go metal detecting on a beach, you keep the coins; you don’t keep all the sand as well.

Monitoring basics

Once you have all that log data securely snuggled away, you need programs to help you monitor the data (and turn it into information). There are a multitude of choices, but before you choose one, you need to define what your organization needs.

Consider questions such as: How many servers? Using which network gear? Cloud? Containers? More than one data center? What configuration management tools are you using? What’s your budget? The list goes on and on. But, until you know the answers, you’re not ready to commit to a monitoring tool.

Generally speaking, as one high-level sysadmin at a major retailer recommends:

Anything backed by a relational database does not scale adequately. Choose a time series database whenever possible.

You must ensure high availability.

A single tool that does two things adequately is better than two platforms, each of which does only one thing really well. In the real world, you don’t want granular data (e.g., “One of your switches has rejected 79 percent of packets on interface 401-ew-13-gw-west”). You’d rather know that a switch is down, which explains why 12 application servers can’t talk to your back end.

Data granularity matters. “If you only poll data every 3 to 5 minutes, there’s a lot that can happen during that time that you will never catch.”

The sysadmin’s specific recommendations: “For a shop running 100 machines in an office? Sure, Zabbix is probably fine. You’re 100% Windows? Great, use SCOM. You only care about network equipment? PRTG or Solarwinds is probably fine. Web performance is all you care about? Site24x7/Pingdom are great.”

Many others agree with these recommendations. Other programs that people in the field are pleased about include DataDog, Prometheus, and LogicMonitor.

This updated SDDC Forrester report helps infrastructure and operations professionals understand the current state, opportunity and emerging vendor landscape of the SDDC. Download it now

You may have noticed I’ve not mentioned older-style system monitoring programs such as Nagios. It’s not that Nagios is no longer useful or being used. But Nagios was designed before companies were deploying across multiple data centers or the cloud. It doesn’t scale well across today’s platforms.

Nagios is also check-based. That is, the software checks: “Is the CPU over 70%? Then the check has failed; take some action.” Generally, when a check fails, Nagios sends an alert. Prometheus, a newer program, monitors a metric over time; if the metric doesn’t meet a configurable threshold or baseline, the application sends an alert.

Both approaches have their virtues, however. There’s still lots of life left in Nagios.

When it comes to logging and monitoring tools, don’t search for a silver bullet. There is none. Each tool has advantages and disadvantages to weigh against your previously itemized company needs.

Before you buy or set up any software, be wary of programs that promise unlimited logging with little effort and no need for agents or syslogs. Also be careful of programs that offer “value-added functionality,” such as automatically shutting down a rogue device, which may cause more trouble than it’s worth.

Tips and tricks of the logging trade

Another important point with log keeping—and one I feel almost embarrassed to mention—is that you must use the Network Time Protocol set to Zulu time, a.k.a. Greenwich Mean Time, on your servers. This way your logs have the correct time and can be correlated correctly with one another, no matter from where the data is gathered. Without the use of Zulu time, you can’t use timestamps to correlate between events.

You might think that this is a “duh” point to mention, but I recently encountered an article suggesting that users set their Amazon Web Services to local time. No!

Set alert thresholds appropriately from the start. If you don’t, you will get awakened in the middle of the night by a default-level alarm that you don’t really care about. That’s not a threat. It’s a promise.

This is important. Alert burnout is real, and it can cause you to begin ignoring alarms, which eventually will be catastrophic. If the information you’re monitoring isn’t going to change what you are doing, you’re wasting your time. Alerts need to be actionable. Otherwise they’re just useless noise.

My rule of thumb is simple: Alarms should be alarming. Everything that comes to my phone should make me shoot out of bed to fix the problem. Everything else should be relegated to email (subject to various inbox rules) or, if really low-priority, a summary report. If you set your phone alerts so that you get 10 messages a day, you will stop checking your phone after two days.

You’re better off monitoring a small number of carefully chosen things that matter than a large number of things that don’t. For example, people often monitor network utilization on all servers as their first step, rather than checking on the most critical services. The result? Stuff is still down, and they have huge volumes of useless data collected. Understand where your critical problems are most likely to happen, and check those servers first.

You should also use your captured data for more than just troubleshooting. For example, use it to predict trends and future hardware investment.

The usefulness will become clear when it comes time to negotiate about departmental budgets. Management will listen better if you say, “Over the past year our storage-area network disk usage has increased by 5TB. Based on that, I expect to need another 5TB for the coming year, so we will need additional storage to accommodate that growth.” That’s more credible than, “I think we need more storage this year.”

Get the picture? Logging and monitoring remain just as important as ever. You ignore them at your risk and your company’s peril.

Logging: Lessons for leaders