Demonstration of Observability in Chaos
Aravind begins the demonstration by showing a slack notification titled "High Mean Transaction Duration for the Advert Service", stating that it is an automated response generated by a deployed application microservice. Then we proceed to understand the example with an e-commerce web application of which the "Advert Service" microservice is a part. Through the entirety of the website, users can browse through the different products that are being sold, read the description of any particular product, add one or more products to the cart, and finally place an order. The entire application is built using a microservice architecture as seen from an Elastic Kibana APM dashboard:
The architecture consists of a front-end layer connected to a Node.js application layer containing all the APIs such as the Java advert service, Golang checkout service, and the Python recommendation service, which are then again connected to a bunch of services. Aravind explains that every large enterprise seeks to perform a scale test before their targeted campaigns or events which draws many potential customers, and stresses the application infrastructure.
Aravind proceeds to demonstrate the details of the advertService, wherein one can observe the anomaly score of the service, the anomaly timeline of the service, and a bunch of different observability scores for the service that can be configured for different alerts. Further on, Aravind introduces the transactions for the advertService to diagnose the particular anomaly in the service, where one can see the methods being called by the service along with their timelines and other relevant details.
The APM services dashboard also lets us analyze specific services for the number of error rates for the duration of their execution, to better understand the error in the system, the performance impact created because of a specific load included in the system, and how the system is responding overall. It also lets us access the application-specific logs and compiler-specific metrics, say, JVM metrics such as CPU usage, system memory usage, heap memory usage, garbage collection rate, etc.
Another approach to resolving the system anomaly is using the Log, Metrics, and APM dashboard, where one can specifically check when the application has restarted. Similarly, we can observe multiple metrics for observing and figuring out any anomaly. One can also inspect the service deployment-specific errors to understand the nature of the anomaly. Based on this information, the application logs can be inspected for the requisite time period to better understand how the application execution took place. An alert can be put to this log stream using a set of desirable conditions.
Apart from the application viewpoint, one can also try to understand their system's anomalies from the infrastructure point of view, where one can inspect the infrastructure specific details of a system like the Kubernetes pods that are running or the Docker containers that are running, analyzing metrics such as memory usage or CPU usage. It's also possible to check the uptime for different services, checking their ping or availability or status, to determine their performance over different time duration or geo-location, and so on. This is all about incident analysis from the observability point of view, how to search for the changes in the system and compare them.