Investigate ContainerRegistryPrimaryDatabaseCPUSaturation alerts firing every other day
Alert Details
The ContainerRegistryPrimaryDatabaseCPUSaturation alert is firing every other day in production (gprd), indicating CPU pressure on the patroni-registry primary database node is over three standard deviations above average.
Timeline
- Alert started firing after a new replica was added to the production cluster
- Related incident last month: gitlab-com/gl-infra/production-engineering#27942
- Unclear if replica addition is coincidental or causal
Investigation Notes
The previous incident raised a potential connection with storage usage calculation queries, which are known to be slow and timeout for large namespaces. These queries should now be routed to replicas (not primary), so the impact should be independent. However, this needs verification.
Relevant Links
- Runbook: https://runbooks.gitlab.com/patroni/primary_db_node_cpu_saturation/
- Alert graph: https://alerts.gitlab.net/graph?g0.expr=%28%28%28max_over_time%28rate%28node_pressure_cpu_waiting_seconds_total%7Benv%3D%22gprd%22%2Ctype%3D%22patroni-registry%22%7D%5B5m%5D%29%5B1h%3A%5D%29+%3E%3D+0.01%29+-+avg_over_time%28rate%28node_pressure_cpu_waiting_seconds_total%7Benv%3D%22gprd%22%2Ctype%3D%22patroni-registry%22%7D%5B5m%5D%29%5B1d%3A%5D%29%29+%2F+stddev_over_time%28rate%28node_pressure_cpu_waiting_seconds_total%7Benv%3D%22gprd%22%2Ctype%3D%22patroni-registry%22%7D%5B5m%3A%5D%29%5B1d%3A%5D%29+and+on+%28fqdn%29+%28pg_replication_is_replica+%3D%3D+0%29%29+%3E+3&g0.tab=1
- Database dashboard: https://dashboards.gitlab.net/d/postgres-ai-NEW_postgres_ai_04/cfd803b?from=now-6h&to=now&timezone=utc&var-prometheus=mimir-gitlab-gprd&var-environment=gprd&var-type=patroni-registry&var-wait_type=$__all&var-wait_event=$__all
- Related replica scaling issue: gitlab-com/gl-infra/data-access/dbo/dbo-issue-tracker#617 (closed)
- Previous incident: gitlab-com/gl-infra/production-engineering#27942
Next Steps
- Analyze query patterns on primary vs replicas to identify what's causing CPU saturation
- Verify that storage usage calculation queries are properly routed to replicas
- Determine if the new replica addition correlates with the alert frequency
- Consider if additional tuning or query optimization is needed