Skip to content

GitLab Next

Why GitLab
Pricing
Contact Sales
Explore

Sign in
Get free trial

June-7-2023: GitLab-bot not reacting to webhook events

Reported by @leetickett-gitlab in slack thread that gitlab-bot commands do not work since 2 hours ago:

gitlab-org/gitlab-runner!4135 (comment 1422520944)
gitlab-org/gitlab!122948 (comment 1422621173)

I also confirmed this problem in #1337 by adding maintenancerefactor label, and noticing @gitlab-bot failed to apply typemaintenance. See comments below for investigation steps

Timeline

See #1352 (comment 1429068952)

culprit

gitlab-org group webhook was automatically disabled due to 503 errors. See #1352 (comment 1422999805)

Corrective actions

As much as possible, ensure that the entire triage-ops is not down if a single processor is misconfigured
- Triage-ops was not down because of a misconfiguration, but rather because of a node restart that made it answer HTTP 50X errors for more than 30s, and gitlab.com disabled the webhooks going to triage-ops.
Add holistic health checks/monitoring in GCP to ensure that we know 100% when triage-ops is working or not
- https://gitlab.com/gitlab-org/quality/engineering-productivity-infrastructure/-/merge_requests/391 and !2283 (merged)
✅ Document what to do when adding an environment variable to triage-ops
- Will be done as part of #1352 (comment 1422744611). MR: gitlab-org/quality/engineering-productivity/team!114 (merged)
Consider make triage-ops more resilient to node outages. For this, we could deploy several replicas scheduled on separate nodes
- Will be done in https://gitlab.com/gitlab-org/quality/engineering-productivity-infrastructure/-/issues/95
Consider running a dry-run deploy job to try to catch those issues (cannot really be done easily with the current setup: see next CA)
- Will be done in #1358.
Migrate triage-ops to a helm chart instead of kubectl commands (and add chart linting to it)
- Will be done in #1356
Configure liveness/readiness probes for the triage deployment
- Will be done in #1357
log user agent or have some way of uniquely identifying the uptime checks requests from regular requests (see discussion). Issue: gitlab-org/quality/engineering-productivity/team#228 (closed)
- https://gitlab.com/gitlab-org/quality/engineering-productivity-infrastructure/-/merge_requests/391 and !2283 (merged)

Edited Jul 06, 2023 by David Dieulivol

Assignee

Time tracking