Dashboard
Recently, I encounter a situation where our team’s dashboard page we use to monitor our system is not good enough for us to detect and react to abnormalities. We only find that our system was broken several hours after an incident happens and the scary thing is that we only find out because we have another activity which requires us to have a look at the system as prior mandatory step.
Setup for failure
So what is the dashboard look like you may ask? It is a generic one with charts for counting stuffs within 24h. It also has large tiles for percentage stuffs. By looking at it, we can see what were achieved by the system - the good part. This is “interesting” choice of dashboard design by the team, this dashboard shows us the past event and metrics but for happy scenarios because it seems to me this dashboard is designed to be looking at and feel good rather than it gives us system status and tendency.
In my opinion, for monitoring system health, this is a setup for failure since it can’t tell us any signal for something goes wrong. Let’s say, by looking at the charts, and we see low traffic in business hours we can’t tell this is normal or not! Or we see a spike or a deep suddenly happens, we also can’t confirm anything. The situation goes on and on if we take business perspective into the picture.
I think I need to have my own “good enough” “starting point” for monitor dashboard as below…
- Monitor system entry points where customer consume our services
- Monitor internal sub-system // solution component health
- Monitor business logics
System entry points monitoring
This is extremely important! I set it as top priority regardless the channels. If your system can be consumed by http protocol, then make sure that your server is running and health. We need to track our response in 4xx, 5xx ranges according to http protocol. Any spike in those errors need to be triage and react as fast as possible because it affects our customer directly. Depending on traffic shape, we can define a slide time window and apply count base metric so as we can react faster. Various configuration can result in more noise alerts and wasting engineer efforts but also if we configure those value to have lower number of alerts, we just trade off effort against consequence impact.
If you don’t have any clue on traffic shape, we can switch to percentage based. Let’s say we set an alert for 5% error rate within 5 minutes.
The idea is that we have a starting values to have alerts and tune them to a threshold where the team can monitor and react on alerts. This must because routine, and we need to consider tune them again and again when system load grows or shrink.
Monitor solution components
Once the entry points are well monitored, we need to care for infrastructure health. Not only the application server itself but also for all mandatory component of the solution: database, file server, Kafka broker, health check for external service our solution need to consume (restAPI, messages…etc). This is because those mandatory stuffs are crucial for our solutions to be well-functioned. Without them, our solution cannot complete its magic business logic, without them our solution cannot fulfill its promise to our customer. By answering above two questions, you will find our starting list of stuffs to monitor.
Most of the time, we will care about CPU, MEM consumption of the component physically if we own the components. If we don’t, then we need to check for it health by an agreed, light-weight polling for component’s health. And last but not least, when both above options are not available, we need to monitor response from the component. Set a threshold for detect bad trend and react on that event. But if this is the case, we need to engage the component owner for such support, without proper engagement, it is very hard to trace and monitor incidents.
Monitor business logics
After we monitor entry points and solution component health, we need also monitor our secret sauce: business logic. In this perspective, we care about what’s going on our system, either good nor bad happens on our system. Because doing business with customers, we have a strong set of how-our-business should work between our solution and customers. Any error spike happens on this with customer surely expose we have mismatched contracts with our customer. We need to triage this case by case.
Remember, our business has strong contracts with customer. We and they cannot just fire unreasonable stuffs to each other, hence this type of monitoring requires deep knowledge about the solution. We need to set up proper place where we want an alert and also have a plan to resolve them when they occur. Again, this required engagement with customer per case basis. I will come back to this type of monitoring at part 2 with more detail.