You should have a logging strategy

2021-05-22

Every software product you create should think through its logging strategy. Logging Strategy informs or guides the development team on what tools to use for logging, what to log, how much to log, where to log, how to collect these logs, how to view these logs, and how long to keep these logs.

What is Logging Strategy?

Once a Software system is outside of the developer's laptop and into a QA or Stage or Prod environment, the only way to understand or make sense of what is going on inside an application is by looking at their logs. So a Logging Strategy informs or guides the development team on what tools to use for logging, what to log, how much to log, where to log, how to collect these logs, how to view these logs, and how long to keep these logs.

When a project lacks a good logging strategy developers log in ways that may not be helpful when troubleshooting issues in environment other than development environment. When this happens you often hear teams asking questions like - We are not sure what is impacting a user or We can't figure out why we are not receiving any orders since past two days.

Log Generation.

Don't reinvent the wheel.
Make use of proper log levels.

What to log?

Make use of proper log category this includes package names, or namespaces like - com.somehealth.someservice.someclass or somehealth.someservice.someclass. This will help in searching for specific types of errors easier. Write meaningful log messages. Avoid using cryptic log messages and avoid making use of short forms. Be precise and clear using simple English.

Examples

Make use of fixed formatting.

Stick to one format. Avoid using different format in different services. Machine parseable format. Try to use a format that is easy to parse programmatically but make sure it is Human Readable also.

Don't log sensitive information.

Logging to a database.

This one is a debatable topic. General recommendation would be to write to files. Database logging is used for following reasons:

Things to consider when deciding to log in Database

Logging challenges in a distributed systems or microservices.

  1. Mismatch in formats of logging? When dealing with microservices we will have multiple services generating its own logs. Now we will have to look through logs of all these different services and if the formats are different then it will become difficult to troubleshoot or employ any single unifying libraries to view these logs. So it is important that we maintain consistency with regards to logging done by different services. A great deal of pain can be averted by ensuring that we use the same style and formatting for logging across our different services.

  2. Stick to same logging library. Try to use same library for logging across different services. If you are using same platform for your different services then make use of same logging library.

  3. How do we access and view multiple log files? When dealing with multiple services hosted on multiple hosts it is not practical to go through the logs by physically connecting to these hosts and going through their log files. This is when you want to think about log collection. You should ship your logs to a central location and use some log consolidation and a log viewer to go through these logs.

  4. How do you consolidate and order logs in multiple log files? Most loggers provide for timestamp recording for each line of log. It is good to make sure that all the hosts running the services are date time synced. Along with date time syncing, time stamping logs and log consolidation we should be able to trace logs through multiple log files.

  5. How do you relate logs with requests flowing through different services? In order to solve for this problem we have to introduce something called as a correlation IDs or request IDs. These are IDs that get created at the very first entry point. It would normally be a GUID or some other form of unique ID. Once created this ID is passed down to all subsequent services or systems so that we can trace and see how the request has been flowing through various systems while relating it to the logs. Make sure that this request ID or correlation ID is recorded in the error logs as part of the context.

Log Collection

Log collection becomes important if we want to analyze data that has been logged on the server. It is not a good practise to log on to the server machines and start looking into the log files directly. This may be impractical when dealing with multiple hosts and services as we just discussed above. Copying over log files manually to look into them is also not the right approach. We should plan for making use of proven technologies for log collection like -

Log Aggregation

Centralised Log Aggregation is the process of aggregating all logs in one place. This is slightly different from just collecting logs. Here we are dealing with the challenges of volume and variety. As we come across volumes of log files and with varied formats this is when we need to perform the task of aggregation of the logs collected. There is an overlap on the tools side with regards to log collection and log aggregation. So for example LogStash does both collection and aggregation in ELK but when you move to Elastic Stack it makes use of Beats to collect log and then LogStash to aggregate or transform data to feed into ElasticSearch. Graylog does it all - collection, aggregation and visualization.

Log Visualization

It is not enough to collect and aggregate logs, we also need to provide for some form of visualization. We should have the ability to analyse and search through these logs and check the details of errors logged. This is where the visualization tools come into play. In the ELK and Elastic Stack, Kibana provides visualization and search and analysis capabilities on the logs collected and indexed in ElasticSearch. Graylog also provides visualization and analysis capabilities.

Storage and Retention

Last thing you need to worry about is where and for how long will you store these logs. Usually you need access to these logs for two purposes.

So depending on what is the primary need you can arrive at a reasonable duration for which the logs should be retained.

Second aspect of storage medium will be again dependent on the above two needs. For the team needs we would need to store logs in a centrally accessible location with quick access and ability to analyse and search through it. In the case of Audit you might want to archive it into a tape drive as we can do with longer retrieval times.

Remember to setup some form of clean up task on the hosts where services are generating physical log files, in case our log collection strategy only copies out the log file and does not move them permanently.

Some options to consider for Centralised Logging