The changes introduce a new feature to monitor the health of RDS (Relational Database Service) instances in GitLab. It adds several saturation points to track CPU utilization, database connections, disk space, and memory usage. These metrics help identify potential issues and ensure the RDS instances are operating within optimal parameters. Additionally, the code includes configuration options to enable or disable RDS monitoring and specify instance-specific parameters like maximum allocated storage and RAM. These enhancements improve the overall monitoring and management of RDS instances within GitLab.
Adds AWS RDS as a service from which we can monitor. Adds the service declaration and a few other saturation metrics.
- CPU Utilization
- Freeable Memory
- Swap Usage
- Disk Utilization
AWS RDS autoscales storage to a limit configured when the instance is initially built. RDS will initially build a database using the smallest possible size as desired and autoscale storage when we either breach the 10% or 10GB remaining limitations, again, up to the absolute maximum. Since the data for where the maximum is configured is stored elsewhere, we make this a configurable asset. The metric
pg_database_size_bytes is then used, but summed to include all databases we have access to on a given RDS instance, we then take this and create a saturation metric based on the size specified by the configuration. For dedicated, we can see an example of how this is configured here: https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/instrumentor/-/merge_requests/2374/diffs#09001fc5abaa8142a88e86b14626ff73453d3f37_14_14
# HELP pg_database_size_bytes Disk space used by the database
# TYPE pg_database_size_bytes gauge
On a sandbox where only QA has run a few times, I've got a database consuming 2GB. Other relations bring this to roughly 2.024GB of space used. As seen from the screenshot, where
1000 is the absolute limit for RDS, we're consuming
0.002% of our saturation threshold:
Uses the native CPU Utilization from Cloud Watch for RDS for CPU saturation. We do some minor work to make this metric more friendly and consistent with the rest of our metrics system. Example, we leverage clamp min and max in various areas and use the float number
0.99 to represent 99%. The metric from AWS comes in as a regular
int. Thus I've added a simple
X / 100 to make this a float that works with our observability stack.
Memory is another fun one. We alert when we are low on freeable memory and when swap is being leveraged. When we start paging RAM, the performance of the database slows which will hinder the performance of GitLab. We set alerts when freeable memory is low; again like disk utilization and connections below, this needs to be configured elsewhere. This is considered a lower priority alert. An example of how we may configure memory for the freeable metric can be seen here: https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/instrumentor/-/merge_requests/2374/diffs?commit_id=1603b0abbb1af23cbfb1199c1f4a13f63a66d4dc#09001fc5abaa8142a88e86b14626ff73453d3f37_15_15
RDS, like all DBs, have limits on how many connections are allowed. Using similar defaults for alerting from our own metrics system and concentrating this MR on creating the ability to derive a metric using a similar pattern leveraged for disk utilization. A value would need to be specified to enable this metric to begin collecting and alerting at all. An example of how this could be done is over here: https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/instrumentor/-/commit/b40b98f1ffae849bca5e89e061056e1a9d4015ff#09001fc5abaa8142a88e86b14626ff73453d3f37_15_16
These metrics come from the cloudwatch exporter that was recently added to Dedicated via: https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/team/-/issues/2890