Help service owners onboard initial services using the Alert Playbook Template
Create some MR's for service owners containing the Alert Playbook Template , and some initial information to get the service owners started with creating playbooks. The intention here is not to fill out the entire template but to walk the owners of the services through filling out the details.
For each alert that we identify as needing a playbook
- Create an MR containing the alert template in the right location and named using the alert name
- Above the MR description edit box, select "Choose a Template" and choose
alert-playbook-template
, then paste inRelated to https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/issues/25386
at the bottom and save - Fill in all information that we are able to on our own
- Update the
runbook
link in the alert definition - Assign the MR to both an Ops team member and a member of the team owning the service
- Work with that identified service owner to help them gather the information they need to complete the template
- Before merging, each MR will require running
make generate
to update the links. It's best to save this step for last
Here is the list of alerts to choose from. When starting work on one, please paste the URL of the MR here as soon as you create it. Then check off the checkbox when the MR is merged.
-
ApdexSLOViolation 👉 gitlab-com/runbooks!7501 -
TrafficAbsent 👉 gitlab-com/runbooks!7502 -
ErrorSLOViolation 👉 gitlab-com/runbooks!7503 -
KubeContainersWaitingInError 👉 gitlab-com/runbooks!7415 (merged) -
component_saturation_slo_out_of_bounds:kube_persistent_volume_claim_disk_space -
BlackboxProbeFailures 👉 gitlab-com/runbooks!7609 -
SidekiqQueueTooLarge 👉 gitlab-com/runbooks!7433 (merged) -
CloudSQLDatabaseDown -
WALGBaseBackupFailed 👉 gitlab-com/runbooks!7550 (merged) -
PostgreSQL_UnusedReplicationSlot -
PostgreSQL_CommitRateTooLow -
walgBaseBackupDelayed 👉 gitlab-com/runbooks!7438 (merged) -
HAProxyServerDown 👉 gitlab-com/runbooks!7484 -
GitalyServiceGoserverTrafficCessationSingleNode 👉 gitlab-com/runbooks!7608 -
PostgreSQLPossibleFailover -
PatroniGCSSnapshotDelayed 👉 gitlab-com/runbooks!7568 (merged) -
GitalyVersionMismatch 👉 gitlab-com/runbooks!7520 -
GCPScheduledSnapshotsDelayed 👉 gitlab-com/runbooks!7497 -
GitalyFileServerDown 👉 gitlab-com/runbooks!7595 -
ComponentResourceRunningOut_disk_space 👉 gitlab-com/runbooks!7596 -
ChefClientErrorCritical 👉 gitlab-com/runbooks!7475 (merged) -
component_saturation_slo_out_of_bounds:pgbouncer_single_core -
component_saturation_slo_out_of_bounds:gcp_quota_limit -
PvsServiceHttpTrafficCessation 👉 gitlab-com/runbooks!7597 -
PostgresSplitBrain -
PostgreSQL_ReplicationLagTooLarge -
AlertmanagerNotificationsFailing -
AiGatewayServiceRunwayIngressTrafficCessationRegional 👉 gitlab-com/runbooks!7607 -
PatroniLongRunningTransactionDetected 👉 gitlab-com/runbooks!7033 (merged) gitlab-com/runbooks!7436 (merged)
Edited by Shreya Shah