Investigate Agent QA test failures on Staging
Problem
The incident that happened on gitlab-com/gl-infra/production#6139 (closed) shows that KAS is a deployment blocker for GitLab. This means for that the Agent QA test should run either on the :smoke
or :reliable
staging test suites, so that we get notified for Agent problems before deploying a new version of GitLab. But before we can do this, we need to fix the test which is currently failing consistently on staging.
Moving the spec to the proper test suites will happen after we fix it and it's being tracked here: #351218 (closed)
This issue is aiming to understand why the test fails on staging. We have a hunch that it might have to do with the new ingress changes that had to be rolled back from gprd as described in the incident. The ingress changes are still live on gstg, so this is one of the things we should look into. Since these changes are necessary, we probably should investigate what KAS needs for it to work well with them.
Kas runs on staging and is part of the GitLab deployment escalation process.
On staging there is currently a testing gap where if Kas becomes unresponsive it doesnt trigger an error in a QA test.
Example of Kas becoming unresponsive during a deployment to staging: https://dashboards.gitlab.net/d/kas-main/kas-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg&var-stage=main&from=now-2d&to=now
This gap was identified during this incident
Definition of done
-
Pin point what exactly happened that caused the incident disruption -
Assess whether this problem was related to #349708 (closed) -
If the failure on staging is related, define the root cause for the failure
-
-
Create a separate issue with a proposal to fix the problem