Skip to content

Skip subsequent topology Prometheus queries if timeout occur

Qingyu Zhao requested to merge 232786-respect-stop-query-failures into master

What does this MR do?

In dashboard Query duration_s there are some long duration_s (540 seconds). image

These long duration failures are:

  • Net::ReadTimeout 540seconds( 60 seconds/query * 9 queries )
  • Net::OpenTimeout 540seconds( 60 seconds/query * 9 queries )
  • Errno::ETIMEDOUT 283seconds( 31.5 seconds/query * 9 queries )

It seems once these failures occur, all queries afterward will likely fail for the same reason.

MR !37924 (merged) improves this to be:

  • when Net::OpenTimeout : 45 seconds(5 seconds/query * 9 queries)
  • when Net::ReadTimeout : 90 seconds(10 seconds/query * 9 queries)
  • when Errno::ETIMEDOUT : no more than 45 seconds.
    • < 45 seconds, if Errno::ETIMEDOUT < Net::OpenTimeout(5 seconds).
    • 45 seconds, if Errno::ETIMEDOUT >= Net::OpenTimeout(5 seconds)

This MR will skip any future query if we encounter one of these failures. So it will reduce the total duration_s further:

  • when Net::OpenTimeout : 5 seconds
  • when Net::ReadTimeout : 10 seconds
  • when Errno::ETIMEDOUT : no more than 5 seconds

Conformity

Availability and Testing

Security

If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:

  • [-] Label as security and @ mention @gitlab-com/gl-security/appsec
  • [-] The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
  • [-] Security reports checked/validated by a reviewer from the AppSec team

Refs #232786

Edited by Peter Leitzen

Merge request reports