Skip to content

Investigate and optimize slow Prometheus queries

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

Started originally in !37924 (merged):

  • @nolith started a discussion:

    Thanks @mkaeppler 💚

    I'm leaving a comment to make sure we have a follow-up to evaluate the benefits of this change and consider lowering those numbers again.

    I do agree with you, those numbers are really generous.

After using the stricter timeouts in PrometheusClient for a while, other issues have crept up and we have now moved those default timeouts further down the stack and into Gitlab::HTTP. It appears, however, that PrometheusService and Clusters::Applications::Prometheus, which sit on top of PrometheusClient and are used e.g. by GitLab Self-Monitoring, are still experiencing timeouts and so I have rolled the FF back for now.

A common offender appears to be DeploymentQuery, as seen here: https://sentry.gitlab.net/gitlab/gitlabcom/issues/1748090/?query=is%3Aunresolved%20ReactiveCachingWorker

But there could be others, especially since a lot of application code uses try_get, a non-throwing variant of the HTTP GET call that will silently log errors and return nulls. I have already extended our logging logic to capture all occurrences, regardless of which variant is used.

I think we should have the respective teams investigate why these queries time out so much, and improve them so as to run no longer than a second or so.

Edited by 🤖 GitLab Bot 🤖