Microservice Observability - Logging
Making sense of everything happening within your logs
Logging is one of the most important parts of software systems. Whether you have just started working on a new piece of software, or your system is running in a large scale production environment, you’ll always find yourself seeking help from log files. Logs are the first thing developers look for when something goes wrong, or something doesn’t work as expected.
Logging the right information in the right way makes the life of developers so much easier. Two things that developers need to be aware of to get better at logging are what and how to log. In this article, we’ll take a look at the some of the basic logging etiquettes that can get us the best out of our logs.
The What and How of Logging
Let’s take the example of an e-commerce system and take a look at logging in Order Management Service (OMS).
Suppose a customer order fails due to an error from Inventory Management Service (IMS), a downstream service that OMS uses to verify the available inventory.
Let’s assume that OMS has already accepted an order, but during the final verification step, IMS returns the following error because the product is no longer available in the inventory.
404 Product Not Available
What to Log
Normally, you would log the error in this way
log.error(“Exception in fetching product information - {}”, ex.getResponseBodyAsString())
This will output a log in the following format
[2020-09-27T18:54:41,500+0530]-[ERROR]-[InventoryValidator]-[13] Exception in fetching product information - Product Not Available
Well, there isn’t much information available in this log statement, is it? A log like this doesn’t serve much purpose because it lacks any contextual information about about the error.
Can we add more information to this log to make it more relevant for debugging? How about adding the Order Id and Product Id?
log.error("Exception in processing Order #{} for Product #{} due to exception - {}", orderId, productId, ex.getResponseBodyAsString())
This will output a log in the following format
[2020-09-27T18:54:41,500+0530]-[ERROR]-[InventoryValidator]-[13] Exception in processing Order #182726 for Product #21 due to exception - Product Not Available
Now this makes a lot more sense!! Looking at the logs, we can understand that an error occurred while processing Order #182726 because Product #21 was not available.
How to Log
While the above log makes perfect sense for us humans, it may not be the the best format for machines. Let’s take an example to understand why.
Suppose there is some issue in the availability of a certain product (say Product #21) due to which all orders containing that product are failing. You have been assigned the task to find all the failed orders for this product.
You happily do a grep for Product #21 in your logs and excitedly wait for the results. When the search completes, you get something like this
[2020-09-27T18:54:41,500+0530]-[ERROR]-[InventoryValidator]-[13] Exception in processing Order #182726 for Product #21 due to exception - Product Not Available
[2020-09-27T18:53:29,500+0530]-[ERROR]-[InventoryValidator]-[13] Exception in processing Order #972526 for Product #217 due to exception - Product Not Available
[2020-09-27T18:52:34,500+0530]-[ERROR]-[InventoryValidator]-[13] Exception in processing Order #46675754 for Product #21 due to exception - Product Not Available
[2020-09-27T18:52:13,500+0530]-[ERROR]-[InventoryValidator]-[13] Exception in processing Order #332254 for Product #2109 due to exception - Product Not Available
Not quite something that we were expecting right? So how can we improve this? Structured logging to the rescue.
Structured Logging
Structured logging solves these common problems and allows log analysis tools to provide additional capabilities. Logs written in a structured format are more machine-friendly, i.e. they can be easily parsed by a machine. This can be helpful in the following scenarios:
- Developers can search logs and correlate events, which is invaluable both during development as well as for troubleshooting production issues.
- Business teams can parse these logs and perform analysis over certain fields (for e.g. unique product count per day) by extracting and summarising these fields.
- You can build dashboards (both business and technical) by parsing the logs and performing aggregates over relevant fields.
Let’s use our earlier log statement and make a small change to make it structured.
log.error("Exception in processing OrderId={} for ProductId={} due to Error={}", orderId, productId, ex.getResponseBodyAsString())
This will output a log in the following format
[2020-09-27T18:54:41,500+0530]-[ERROR]-[InventoryValidator]-[13] Exception in processing OrderId=182726 for ProductId=21 due to Error=Product Not Available
Now this log message can be easily parsed by the machine by using “=” as a delimiter to extract the OrderId
, ProductId
and Error
fields. We can now do an exact search over ProductId=21
to accomplish our task.
This also allows us to perform more advanced analytics on the logs, such as preparing a report of all the orders that failed such errors.
If you use a log management system like Splunk, the query Error=”Product Not Available” | stats count by ProductId
can now produce the following result
+-----------+-------+
| ProductId | count |
+-----------+-------+
| 21 | 5 |
| 27 | 12 |
+-----------+-------+
We could also use a JSON layout to print our logs in the JSON format
{
"timestamp":"2020-09-27T18:54:41,500+0530"
"level":"ERROR"
"class":"InventoryValidator"
"line":"13"
"OrderId":"182726"
"ProductId":"21"
"Error":"Product Not Available"
}
It’s important to understand the approach behind structured logging. There is no fixed standard for the same and it can be done in many different ways. The biggest challenge however, when it comes to structured logging, is establishing a standard set of fields for your software. This can be achieved by following a custom logging model or centralised logging which ensures that all developers use the same fields in their log messages.
Conclusion
In this article, we saw the pitfalls of unstructured logging and the benefits and advantages offered by structured logging. Log management systems such as Splunk are hugely benefited by a well structured log message and can offer easy search and analytics on log events.
In the next article of my Microservice Observability Series, we will cover more about Monitoring, the second pillar of observability.
Thank you for staying with me so far. Hope you liked the article. You can connect with me on LinkedIn where I regularly discuss technology and life. Also take a look at some of my other articles and my YouTube channel. Happy reading. 🙂