Abhy's Canvas

Every software product you create should think through its logging strategy. Logging Strategy informs or guides the development team on what tools to use for logging, what to log, how much to log, where to log, how to collect these logs, how to view these logs, and how long to keep these logs.

What is Logging Strategy?

Once a Software system is outside of the developer's laptop and into a QA or Stage or Prod environment, the only way to understand or make sense of what is going on inside an application is by looking at their logs. So a Logging Strategy informs or guides the development team on what tools to use for logging, what to log, how much to log, where to log, how to collect these logs, how to view these logs, and how long to keep these logs.

When a project lacks a good logging strategy developers log in ways that may not be helpful when troubleshooting issues in environment other than development environment. When this happens you often hear teams asking questions like - We are not sure what is impacting a user or We can't figure out why we are not receiving any orders since past two days.

Log Generation.

Don't reinvent the wheel.

What this means is that do not go about using printf or writeLine or console.log calls to create logs yourself. Make use of a simple and popular libraries.
.Net - Make use of SeriLog, Log4Net, or NLog.
Java - Make use of LogBack, or Log4j2
Python - Make use of Python Logging Modules, or python-json-logger
Node - Make use of Winston, Bunyan or Pino

Make use of proper log levels.

TRACE Level - can be used in development but should never be committed into repositories and should never reach production.
DEBUG Level - used for debugging in development, minimal number of logs of this level should be committed into repository. Should not be activated in production by default but only in exceptional situation for a short duration only.
INFO Level - Log user driven or system specific actions at this level
WARN Level - Log all events that could potentially become an error. Like DB calls taking longer than a threshold or missing information.
ERROR Level - Every error should be logged at this level.

What to log?

Make use of proper log category this includes package names, or namespaces like - com.somehealth.someservice.someclass or somehealth.someservice.someclass. This will help in searching for specific types of errors easier. Write meaningful log messages. Avoid using cryptic log messages and avoid making use of short forms. Be precise and clear using simple English.

Examples

"There was a problem" can be "This folder name is already taken. Try a different name"
"Something went wrong" can be "We are unable to take your request. Please contact our customer service" or "We are unable to take your request now. Please try again later"
"Failed to Sched. Bio. Apt." can be "We were unable to schedule your biometric appointment. Please try again later or contact our customer service"
"Lab rejected." can be "Request for Lab was rejected as it is missing Provider information"
Make sure to log the context of the error. Error was for what request, which user, and for which field or data point. This is extremely important as without context error messages become meaningless.

Make use of fixed formatting.

Stick to one format. Avoid using different format in different services. Machine parseable format. Try to use a format that is easy to parse programmatically but make sure it is Human Readable also.

Don't log sensitive information.

Do not log (PHI) patient health information. This means we should NOT be logging symptoms, conditions, diseases of a patient. We should not log patient health history. This will violate HIPAA Compliance.
Do not log (PII) personally identifiable information. This means we should NOT log information like SSN, first and last name, email ids, phone numbers, and address. This will violate HIPAA Compliance.
Do not log financial information. This means we should NOT log Credit or Debit Card numbers, Banking Account number, or Banking Route numbers.
Do not log security data. This means do not log passwords, auth tokens, security tokens, API Keys, Connection strings, and environment settings.

Logging to a database.

This one is a debatable topic. General recommendation would be to write to files. Database logging is used for following reasons:

It is very easy to setup, log and query in the database.
Logging becomes structured with database schema
Very easy to query and extend search by enabling full-text search

Things to consider when deciding to log in Database

Additional network hop will add to latency
Increase in chance of errors in the form of database connection failure, schema error, query error, and network issues.
Need to setup fallback to file logging in the case of failure to log in database.
Avoid fire and forget async call for logging as then you might not be able to implement fallback to log to a file during failures
Need to setup additional process to transfer those file logs and load them into database
Avoid mixing application data and log data in the same database.
Take into account the product needs, scale, and cost.

Logging challenges in a distributed systems or microservices.

Mismatch in formats of logging? When dealing with microservices we will have multiple services generating its own logs. Now we will have to look through logs of all these different services and if the formats are different then it will become difficult to troubleshoot or employ any single unifying libraries to view these logs. So it is important that we maintain consistency with regards to logging done by different services. A great deal of pain can be averted by ensuring that we use the same style and formatting for logging across our different services.
Stick to same logging library. Try to use same library for logging across different services. If you are using same platform for your different services then make use of same logging library.
How do we access and view multiple log files? When dealing with multiple services hosted on multiple hosts it is not practical to go through the logs by physically connecting to these hosts and going through their log files. This is when you want to think about log collection. You should ship your logs to a central location and use some log consolidation and a log viewer to go through these logs.
How do you consolidate and order logs in multiple log files? Most loggers provide for timestamp recording for each line of log. It is good to make sure that all the hosts running the services are date time synced. Along with date time syncing, time stamping logs and log consolidation we should be able to trace logs through multiple log files.
How do you relate logs with requests flowing through different services? In order to solve for this problem we have to introduce something called as a correlation IDs or request IDs. These are IDs that get created at the very first entry point. It would normally be a GUID or some other form of unique ID. Once created this ID is passed down to all subsequent services or systems so that we can trace and see how the request has been flowing through various systems while relating it to the logs. Make sure that this request ID or correlation ID is recorded in the error logs as part of the context.

Log Collection

Log collection becomes important if we want to analyze data that has been logged on the server. It is not a good practise to log on to the server machines and start looking into the log files directly. This may be impractical when dealing with multiple hosts and services as we just discussed above. Copying over log files manually to look into them is also not the right approach. We should plan for making use of proven technologies for log collection like -

LogStash - If you are using ELK or Elastic Stack (with Beats) then this becomes the defacto log collection option. LogStash though is much more than just a log collection tool.
Graylog - One of the leading names in the industry, it offers much more than just log collection. It is a centralised logging system with visualization capabilities.
Fluentd - It can pick up data from any production environment and push it to any analytics platform.
Flume - An Apache open source, consider this if you are dealing with very large data sets to be pushed into Hadoop.

Log Aggregation

Centralised Log Aggregation is the process of aggregating all logs in one place. This is slightly different from just collecting logs. Here we are dealing with the challenges of volume and variety. As we come across volumes of log files and with varied formats this is when we need to perform the task of aggregation of the logs collected. There is an overlap on the tools side with regards to log collection and log aggregation. So for example LogStash does both collection and aggregation in ELK but when you move to Elastic Stack it makes use of Beats to collect log and then LogStash to aggregate or transform data to feed into ElasticSearch. Graylog does it all - collection, aggregation and visualization.

Log Visualization

It is not enough to collect and aggregate logs, we also need to provide for some form of visualization. We should have the ability to analyse and search through these logs and check the details of errors logged. This is where the visualization tools come into play. In the ELK and Elastic Stack, Kibana provides visualization and search and analysis capabilities on the logs collected and indexed in ElasticSearch. Graylog also provides visualization and analysis capabilities.

Storage and Retention

Last thing you need to worry about is where and for how long will you store these logs. Usually you need access to these logs for two purposes.

Team Needs - Immediate need is for teams to troubleshoot issues being recorded in the logs. In this case we do not need to keep all of the historical logs and a short duration of say a month or few months can suffice.
Audit Needs - These needs arise over a period of time and audits usually end up seeking logs for very old period. Industry standard would be to keep for a year.

So depending on what is the primary need you can arrive at a reasonable duration for which the logs should be retained.

Second aspect of storage medium will be again dependent on the above two needs. For the team needs we would need to store logs in a centrally accessible location with quick access and ability to analyse and search through it. In the case of Audit you might want to archive it into a tape drive as we can do with longer retrieval times.

Remember to setup some form of clean up task on the hosts where services are generating physical log files, in case our log collection strategy only copies out the log file and does not move them permanently.

Some options to consider for Centralised Logging

ELK Stack or Elastic Stack - Free and Open Source, provides for log collection using Beats, aggregation using LogStash, indexing and persistence using ElasticSearch and visualization using Kibana.
Datadog - If you are on cloud scale, Datadog is a great choice. It is very expensive though. It offers a complete solution and is often the choice when working with AWS. It offers Log Management, APM, Security Monitoring and much more.
Graylog - Good for enterprise use but expensive, it also has a free open source lightweight version.
Splunk - Another enterprise grade solution. It offers real time visibility with infrastructure monitoring, application performance monitoring, Log investigation.

You should have a logging strategy

2021-05-22