Skip to content

Adds initial RDS metrics for saturation for Dedicated

John Skarbek requested to merge jts/add-db-related-metrics-dedicated into master

What

Adds AWS RDS as a service from which we can monitor. Adds the service declaration and a few other saturation metrics.

  • CPU Utilization
  • Freeable Memory
  • Swap Usage
  • Disk Utilization
Disk Utilization

AWS RDS autoscales storage to a limit configured when the instance is initially built. RDS will initially build a database using the smallest possible size as desired and autoscale storage when we either breach the 10% or 10GB remaining limitations, again, up to the absolute maximum. Since the data for where the maximum is configured is stored elsewhere, we make this a configurable asset. The metric pg_database_size_bytes is then used, but summed to include all databases we have access to on a given RDS instance, we then take this and create a saturation metric based on the size specified by the configuration. For dedicated, we can see an example of how this is configured here: https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/instrumentor/-/merge_requests/2374/diffs#09001fc5abaa8142a88e86b14626ff73453d3f37_14_14

From promethues-postgres-exporter:

# HELP pg_database_size_bytes Disk space used by the database
# TYPE pg_database_size_bytes gauge
pg_database_size_bytes{datname="gitlabhq_production"} 2.191012399e+09
pg_database_size_bytes{datname="postgres"} 8.409647e+06
pg_database_size_bytes{datname="template0"} 8.233475e+06
pg_database_size_bytes{datname="template1"} 8.393263e+06

On a sandbox where only QA has run a few times, I've got a database consuming 2GB. Other relations bring this to roughly 2.024GB of space used. As seen from the screenshot, where 1000 is the absolute limit for RDS, we're consuming 0.002% of our saturation threshold:

image

CPU Utilization

Uses the native CPU Utilization from Cloud Watch for RDS for CPU saturation. We do some minor work to make this metric more friendly and consistent with the rest of our metrics system. Example, we leverage clamp min and max in various areas and use the float number 0.99 to represent 99%. The metric from AWS comes in as a regular int. Thus I've added a simple X / 100 to make this a float that works with our observability stack.

Memory Use

Memory is another fun one. We alert when we are low on freeable memory and when swap is being leveraged. When we start paging RAM, the performance of the database slows which will hinder the performance of GitLab. We set alerts when freeable memory is low; again like disk utilization and connections below, this needs to be configured elsewhere. This is considered a lower priority alert. An example of how we may configure memory for the freeable metric can be seen here: https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/instrumentor/-/merge_requests/2374/diffs?commit_id=1603b0abbb1af23cbfb1199c1f4a13f63a66d4dc#09001fc5abaa8142a88e86b14626ff73453d3f37_15_15

Connections

RDS, like all DBs, have limits on how many connections are allowed. Using similar defaults for alerting from our own metrics system and concentrating this MR on creating the ability to derive a metric using a similar pattern leveraged for disk utilization. A value would need to be specified to enable this metric to begin collecting and alerting at all. An example of how this could be done is over here: https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/instrumentor/-/commit/b40b98f1ffae849bca5e89e061056e1a9d4015ff#09001fc5abaa8142a88e86b14626ff73453d3f37_15_16


These metrics come from the cloudwatch exporter that was recently added to Dedicated via: https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/team/-/issues/2890

Addresses: https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/team/-/issues/2563

Reference:

Edited by John Skarbek

Merge request reports