Skip subsequent topology Prometheus queries if timeout occur (!38293) · Merge requests · GitLab.org / GitLab

Qingyu Zhao requested to merge 232786-respect-stop-query-failures into master Jul 30, 2020

What does this MR do?

In dashboard Query duration_s there are some long duration_s (540 seconds).

These long duration failures are:

Net::ReadTimeout 540seconds( 60 seconds/query * 9 queries )
Net::OpenTimeout 540seconds( 60 seconds/query * 9 queries )
Errno::ETIMEDOUT 283seconds( 31.5 seconds/query * 9 queries )

It seems once these failures occur, all queries afterward will likely fail for the same reason.

MR !37924 (merged) improves this to be:

when Net::OpenTimeout : 45 seconds(5 seconds/query * 9 queries)
when Net::ReadTimeout : 90 seconds(10 seconds/query * 9 queries)
when Errno::ETIMEDOUT : no more than 45 seconds.
- < 45 seconds, if Errno::ETIMEDOUT < Net::OpenTimeout(5 seconds).
- 45 seconds, if Errno::ETIMEDOUT >= Net::OpenTimeout(5 seconds)

This MR will skip any future query if we encounter one of these failures. So it will reduce the total duration_s further:

when Net::OpenTimeout : 5 seconds
when Net::ReadTimeout : 10 seconds
when Errno::ETIMEDOUT : no more than 5 seconds

Conformity

Availability and Testing

Review and add/update tests for this feature/bug. Consider all test levels. See the Test Planning Process.
[-] Tested in all supported browsers
[-] Informed Infrastructure department of a default or new setting change, if applicable per definition of done

Security

If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:

[-] Label as security and @ mention @gitlab-com/gl-security/appsec
[-] The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
[-] Security reports checked/validated by a reviewer from the AppSec team

Refs #232786

Edited Aug 13, 2020 by Peter Leitzen

Skip subsequent topology Prometheus queries if timeout occur

What does this MR do?

Conformity

Availability and Testing

Security

Merge request reports