The art of logging

Best practices for writing, accessing and managing log files in a distributed environment.

Designing how various services in a distributed environment write their logs and how a team can efficiently access them requires thoughtful consideration of many factors. When we have hundreds of instances, the amount of logs written per day can be huge. The number of different components and services make it even more complex. In such environments, the simple go-to-the-host-to-check-logs don’t work. In this post, I’ll talk about the best-practices we have found helpful.

Use a log aggregator : once you have more than a handful of hosts, you need a central place for accessing logs. Otherwise, you end up doing a lot of grunt work for simple things, especially when you need to correlate events to debug an issue. That’s why we need to ship all logs to a central place. There are many open source and commercial tools that solves centralized log management. Few open source tools are Elastic Stack, Graylog, syslog, etc.

: once you have more than a handful of hosts, you need a central place for accessing logs. Otherwise, you end up doing a lot of grunt work for simple things, especially when you need to correlate events to debug an issue. That’s why we need to ship all logs to a central place. There are many open source and commercial tools that solves centralized log management. Few open source tools are Elastic Stack, Graylog, syslog, etc. Standardize the log format : use a standard format to be used by every component your team has created. We like JSON because we can feed them to a log aggregator and query based on various properties. Also, use standard key names. For example, if you are logging user ID along with a log message, use same key names everywhere. Don’t mix “user_id” and “userID”. It will save a lot of time when you’ll need to query logs for a particular user across components.

: use a standard format to be used by every component your team has created. We like JSON because we can feed them to a log aggregator and query based on various properties. Also, use standard key names. For example, if you are logging user ID along with a log message, use same key names everywhere. Don’t mix “user_id” and “userID”. It will save a lot of time when you’ll need to query logs for a particular user across components. Have filtering capability : Another reason why we like JSON is that filtering relevant logs is a simple task. We can easily filter logs for a particular user, role, service, source file, etc. JSON has few great advantages: 1) we can introduce any number of custom properties when it makes sense. For example “job_id” might be present only in the logs that are written by a background worker. For web services, http_status_code might be an important attribute to log. 2) JSON is easier for humans to read. Saves loads of time during debugging sessions when we want to create complex filters. 3) `jq` is an excellent tool for working with JSON in CLI.

: Another reason why we like JSON is that filtering relevant logs is a simple task. We can easily filter logs for a particular user, role, service, source file, etc. JSON has few great advantages: 1) we can introduce any number of custom properties when it makes sense. For example “job_id” might be present only in the logs that are written by a background worker. For web services, http_status_code might be an important attribute to log. 2) JSON is easier for humans to read. Saves loads of time during debugging sessions when we want to create complex filters. 3) `jq` is an excellent tool for working with JSON in CLI. Log most commonly used filters : user id, job id, request id, service name, role of the instance, hostname of the instance, log message, severity level (info, warning, critical), timestamp. Basically, log anything that you think you’ll need during debugging minus the sensitive data. Log most commonly used data that aid debugging: error no, error message, stack, source file and line.

: user id, job id, request id, service name, role of the instance, hostname of the instance, log message, severity level (info, warning, critical), timestamp. Basically, log anything that you think you’ll need during debugging minus the sensitive data. Log most commonly used data that aid debugging: error no, error message, stack, source file and line. Convert logs written by 3rd party applications : other than services written by ourselves, most of us also use applications like Nginx, MySQL, PHP, etc. These third party tools don’t usually use JSON. We convert these logs to JSON before we send it to the aggregator. Most log aggregator agents (fluentd, logstash) have the capability of using regexes to parse these logs and convert them to JSON before sending it to the aggregator. We also have standard OS logs like /var/log/message that should be converted.

: other than services written by ourselves, most of us also use applications like Nginx, MySQL, PHP, etc. These third party tools don’t usually use JSON. We convert these logs to JSON before we send it to the aggregator. Most log aggregator agents (fluentd, logstash) have the capability of using regexes to parse these logs and convert them to JSON before sending it to the aggregator. We also have standard OS logs like /var/log/message that should be converted. Clean up sensitive data : we should be careful about what we log. Remove any sensitive data before logging. Examples are: user credentials, sessions, user content, etc. should not be present in the log files.

: we should be careful about what we log. Remove any sensitive data before logging. Examples are: user credentials, sessions, user content, etc. should not be present in the log files. Storing and purging : a large system can generate hundreds of GBs of logs per day. Storing all of them in the disks is not feasible. After a certain period, logs can be pushed to services like AWS S3 for a duration and then to AWS Glacier. Finally, they should be deleted altogether from Glacier. AWS policies make it very easy.

: a large system can generate hundreds of GBs of logs per day. Storing all of them in the disks is not feasible. After a certain period, logs can be pushed to services like AWS S3 for a duration and then to AWS Glacier. Finally, they should be deleted altogether from Glacier. AWS policies make it very easy. Use dashboards : We have dashboards for each component like APIs, sync engine, push notifications, etc. Additionally, it’s a great thing if each developer can create their own dashboards and filter relevant logs. They can quickly focus on their components when debugging an issue. We use Kibana as it works well alongside ElasticSearch.

: We have dashboards for each component like APIs, sync engine, push notifications, etc. Additionally, it’s a great thing if each developer can create their own dashboards and filter relevant logs. They can quickly focus on their components when debugging an issue. We use Kibana as it works well alongside ElasticSearch. Include request ID: including a unique id with each request and logging it at every component lets you trace a particular request across all the components. Generate this unique ID at the client side. An example scenario is if the client has logged a timeout for a request id but it’s not logged anywhere on your server, it was probably a network issue, provided other requests were being served normally at that time.

Useful things we can do with such a logging system are:

Each log entry has a severity category. We can use this to monitor error-rate for each module. Monitoring error rate is vital for alerting.

Another nice use-case of error-rate monitoring is when we roll out an update. We can roll out the update to a percentage of our instances and monitor error rates only for these instances. If there is an increase in error rate, we probably have a problem. Otherwise, we can start rolling out to more instances.

When debugging an issue for a user, we can filter all the relevant logs for that user’s id and identify the component at fault by correlating logs from different modules.

Tech Support can use logs for replying to tickets. Create tools and dashboards to make it easier for the support team to get a ‘status report summary’ for a user. They will be able to give more meaningful replies instead of robotic ones. Train them to use these tools.

References: