Towards long-term capacity planning and alerting

For the past month we have been fighting fires on GitLab.com.

Much of this has been caused by overuse of resources by the application and many of the fixes have involved improving resource utilisation.

As an infrastructure team, we are responsible for communicating with the application development teams, to help them understand that the resources on GitLab.com infrastructure are not a magic carpet bag that will expand to handle everything that gets thrown in it.

One process which can help frame this communication, and, importantly, also help us to prioritise the work which will help avoid resource exhaustion is capacity planning.

Now that we are recording general saturation metrics, we can start using these metrics to do long term trend analysis and answer the following questions:

How long until a particular resource is exhausted?
How fast is our utilization of a particular resource increasing?

With this in mind, I've started experimenting with ways of monitoring and alerting on long term resource utilization.

As a proof of concept, I've put together a dashboard which I hope will us determine which services will have exhausted resources in one month from now.

The next step would be to start alerting on these events.

For example, if our Redis CPU growth growths at 10% a month, and we have 90% CPU utilization as our limit, we should get a notification that Redis CPU will be exhausted in one months time when it is at 80%

Ideally, it would be great to just have a dashboard, and associated alerts, which tells us which resources we need to increase now, before they are exhausted next month

https://dashboards.gitlab.net/d/TeJU3AIWz/capacity-planning?orgId=1

Related Merge Requests

Add saturation as a general metric gitlab-com/runbooks!1188 (merged)
Add saturation graphs to platform dashboards gitlab-com/runbooks!1190 (merged)
Adds disk space saturation metric gitlab-com/runbooks!1210 (merged)
Add linear prediction values for our saturation metrics. gitlab-com/runbooks!1211 (merged)
Add hard+soft saturation limits for single_threaded_cpu gitlab-com/runbooks!1212 (merged)
Long-term saturation forecasting gitlab-com/runbooks!1213 (merged)
Add memory saturation and SLO goals for memory gitlab-com/runbooks!1214 (merged)
Additional saturation levels gitlab-com/runbooks!1215 (merged)

Edited Aug 09, 2022 by 🤖 GitLab Bot 🤖