Streamline latency attribution via service dashboards
This came out of a conversation I had with @sarahwalker about https://gitlab.com/gitlab-com/gl-infra/production/-/issues/18634.
Background
During that incident, it took quite a while to attribute the source of the issue to a badly performing gitaly node.
One approach I follow for latency attribution involves several steps that I will outline here.
- Go to service dashboard, in this case
web
oncny
stage. - Find the
rails_requests
SLI, click the "Kibana: Rails sum latency aggregated" link. - This brings us to kibana and shows the overall latency impact.
- Add additional y axes with
sum
of sub-components:db_duration_s
,redis_duration_s
,gitaly_duration_s
, which are all subsets ofduration_s
. We can also addcpu_s
, though that one is a bit less intuitive. - We now have a nice latency breakdown and can see that in this case, gitaly way the main contributor to overall duration.
Having a well documented guide for this method would be useful. But it still requires too many clicks, having something more readily available would go a long way.
Proposal
We can streamline this by adding two items to the service dashboards.
- Add a kibana link to SLIs that already has additional latency sources pre-filled as y-axes.
- Add a panel that includes the metrics based version of this same kibana query, with the kibana drill-down linked.
This can go a long way towards making troubleshooting gnarly apdex issues easier.