Monitor the Monitor group's demo environments
What does this MR do?
Overview
The monitor team would like to be alerted when prometheus stops reporting metrics in our demo environments. For now this code will just run on 2-3 projects on .com and staging, but in the future it may run on more projects.
Part of: gitlab-org/monitor/general#58
Technical implementation
New columns in the cluster_applications_prometheus
table named healthy
a
Scheduled job that runs every x minutes and hits health and checks via prometheus.
This job stores a boolean in true
if healthy
, false
if not.
Sends an alert to our generic alert endpoint if healthy=true
changes to healthy=false
but not if the column already had that value.
This is just the service, the worker will come later.
Screenshots
Does this MR meet the acceptance criteria?
Conformity
-
Changelog entry - [-] Documentation (if required)
-
Code review guidelines -
Merge request performance guidelines - [-] Style guides
-
Database guides -
Separation of EE specific content
Availability and Testing
-
Review and add/update tests for this feature/bug. Consider all test levels. See the Test Planning Process. - [-] Tested in all supported browsers
- [-] Informed Infrastructure department of a default or new setting change, if applicable per definition of done
Security
If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:
- [-] Label as security and @ mention
@gitlab-com/gl-security/appsec
- [-] The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
- [-] Security reports checked/validated by a reviewer from the AppSec team