June-7-2023: GitLab-bot not reacting to webhook events
Reported by @leetickett-gitlab in slack thread that gitlab-bot commands do not work since 2 hours ago:
I also confirmed this problem in #1337 by adding maintenancerefactor label, and noticing @gitlab-bot failed to apply typemaintenance. See comments below for investigation steps
Timeline
See #1352 (comment 1429068952)
culprit
gitlab-org group webhook was automatically disabled due to 503 errors. See #1352 (comment 1422999805)
Corrective actions
-
As much as possible, ensure that the entire triage-ops is not down if a single processor is misconfigured -
Triage-ops was not down because of a misconfiguration, but rather because of a node restart that made it answer HTTP 50X errors for more than 30s, and gitlab.com disabled the webhooks going to triage-ops.
-
-
Add holistic health checks/monitoring in GCP to ensure that we know 100% when triage-ops is working or not -
✅ Document what to do when adding an environment variable to triage-ops-
Will be done as part of #1352 (comment 1422744611). MR: gitlab-org/quality/engineering-productivity/team!114 (merged)
-
-
Consider make triage-ops more resilient to node outages. For this, we could deploy several replicas scheduled on separate nodes -
Consider running a dry-run
deploy job to try to catch those issues (cannot really be done easily with the current setup: see next CA)-
Will be done in #1358.
-
-
Migrate triage-ops to a helm chart instead of kubectl commands (and add chart linting to it) -
Will be done in #1356
-
-
Configure liveness/readiness probes for the triage deployment -
Will be done in #1357
-
-
log user agent or have some way of uniquely identifying the uptime checks requests from regular requests (see discussion). Issue: gitlab-org/quality/engineering-productivity/team#228 (closed)
Edited by David Dieulivol