Create issues from production alerts using external Prometheus
Problem
Today we offer the ability to leverage incident management (open issues on triggered alert) using internal Prometheus, this functionality does not exist in case external Prometheus is used.
Proposal
Allow an external Prometheus to use incident management and open issue on a triggered alert. The MR will instruct our users on the needed change to the Alertmanager.yml file (on an external Prometheus) and use a webhook to open incidents automatically
Original issue
Attempt to connect to our thanos endpoint, we have 2 servers for the purposes of redundancy:
thanos-query-01-inf-ops.c.gitlab-ops.internal:10902
thanos-query-02-inf-ops.c.gitlab-ops.internal:10902
Just in case running prometheus queries to thanos is not working properly, we'll have data spread across multiple prometheus servers which may hinder the ability for us to be successful with this. But for the purpose of starting out, here's the endpoints for those:
prometheus-01-inf-gprd.c.gitlab-production.internal:9090
prometheus-02-inf-gprd.c.gitlab-production.internal:9090
prometheus-app-01-inf-gprd.c.gitlab-production.internal:9090
prometheus-app-02-inf-gprd.c.gitlab-production.internal:9090
prometheus-db-01-inf-gprd.c.gitlab-production.internal:9090
prometheus-db-02-inf-gprd.c.gitlab-production.internal:9090
Start up a project and connect to one of the above on ops.gitlab.net. Then we'll have some custom metric fed into the project. For example the following:
- From https://dashboards.gitlab.net/d/RZmbBr7mk/gitlab-triage?orgId=1&refresh=30s -
5xx Responses
- From https://dashboards.gitlab.net/d/general-service/general-service-platform-metrics?orgId=1&var-type=git&from=now-30d&to=now:
Latency Apdex
Error Ratios
Service Availability
And configure alerts to create issues.
- https://gitlab.com/gitlab-com/runbooks/blob/master/alertmanager/alertmanager.yml.erb - the template for which we define where alerts go based on environment, severity, etc.
-
https://gitlab.com/gitlab-com/runbooks/tree/master/rules - all of our custom rules and alert definitions
- Searching in this directory for
severity: s1
would be a good start for figuring out which rules the Infrastructure team looks at.
- Searching in this directory for
This description is taken from the following: &1859 (comment 214112227)