Skip to content

Streamline latency attribution via service dashboards

This came out of a conversation I had with @sarahwalker about https://gitlab.com/gitlab-com/gl-infra/production/-/issues/18634.

Background

During that incident, it took quite a while to attribute the source of the issue to a badly performing gitaly node.

One approach I follow for latency attribution involves several steps that I will outline here.

  1. Go to service dashboard, in this case web on cny stage.
  2. Find the rails_requests SLI, click the "Kibana: Rails sum latency aggregated" link. Screenshot_2024-10-01_at_17.14.26
  3. This brings us to kibana and shows the overall latency impact. Screenshot_2024-09-30_at_18.27.49
  4. Add additional y axes with sum of sub-components: db_duration_s, redis_duration_s, gitaly_duration_s, which are all subsets of duration_s. We can also add cpu_s, though that one is a bit less intuitive. Screenshot_2024-09-30_at_18.30.12
  5. We now have a nice latency breakdown and can see that in this case, gitaly way the main contributor to overall duration.

Having a well documented guide for this method would be useful. But it still requires too many clicks, having something more readily available would go a long way.

Proposal

We can streamline this by adding two items to the service dashboards.

  1. Add a kibana link to SLIs that already has additional latency sources pre-filled as y-axes.
  2. Add a panel that includes the metrics based version of this same kibana query, with the kibana drill-down linked.

This can go a long way towards making troubleshooting gnarly apdex issues easier.