Towards long-term capacity planning and alerting

For the past month we have been fighting fires on GitLab.com.

Much of this has been caused by overuse of resources by the application and many of the fixes have involved improving resource utilisation.

As an infrastructure team, we are responsible for communicating with the application development teams, to help them understand that the resources on GitLab.com infrastructure are not a magic carpet bag that will expand to handle everything that gets thrown in it.

One process which can help frame this communication, and, importantly, also help us to prioritise the work which will help avoid resource exhaustion is capacity planning.

Now that we are recording general saturation metrics, we can start using these metrics to do long term trend analysis and answer the following questions:

  1. How long until a particular resource is exhausted?
  2. How fast is our utilization of a particular resource increasing?

With this in mind, I've started experimenting with ways of monitoring and alerting on long term resource utilization.

As a proof of concept, I've put together a dashboard which I hope will us determine which services will have exhausted resources in one month from now.

The next step would be to start alerting on these events.

For example, if our Redis CPU growth growths at 10% a month, and we have 90% CPU utilization as our limit, we should get a notification that Redis CPU will be exhausted in one months time when it is at 80%

Ideally, it would be great to just have a dashboard, and associated alerts, which tells us which resources we need to increase now, before they are exhausted next month

image

https://dashboards.gitlab.net/d/TeJU3AIWz/capacity-planning?orgId=1

Related Merge Requests

Edited by 🤖 GitLab Bot 🤖