Recently, I worked on a problem how to calculate one microservice response time as a baseline for service level agreement (SLA) between the microservice and other teams. The microservice is very basic, a web application with http interface.
When a request receives at the service, here are some pseudo steps:
- It validates the requests
- It runs the required business logic
- It persists data into SQL database
- It notifies the request processing outcome event for upstream services
With basic test setup, team chooses to measure http response time from exposed endpoint. The result is pretty good. All p99 and p99.95 response times are far better than requirements of the service. But then there is a catch: team uses asynchronous code for notification at step 4.
When team measures http response time, we actually ignore the fact from business perspective, our microservice contract to outside work is not only http request and response but also our outcome of the processing. As a result of that, the microservice returns to client if business logic completes leaving network communication cost for Kafka message excluded from our calculation. During normal operation, the difference is minor with good configuration from Kafka producer (send messages more frequently).
However, when Kafka is not stable, the default mechanism of retrying from Kafka producer gets exposed. Requests during this period of times facing multiple retries. Time duration between when caller service sends http requests and receives response from the service is still very good, but they don’t see Kafka message yet. So the response time difference from http request-response and actually message sent can be go to from few hundreds milliseconds to a few minutes.
This hidden asynchronous mechanism breaks SLA when Kafka is unhealthy: could be quite a number of reasons for this such as:
- brokers has internal problem
- network between the microservice and Kafka
- planned maintenance for Kafka (patching, upgrade…etc)
This leads the team considers to implement microservice test code again. The right way is to count response time of the logic: from http request receives to timestamp of Kafka message.