Create table from https://dbt.gitlabdata.com/#!/source/source.gitlab_snowflake.thanos.periodic_queries
to calculate the percentage of horizontally and non-horizontally scalable metrics forecast to be at risk of a hitting their capacity thresholds in the next 90 days.
Thanos Link
Existing Measures / Systems (not exhaustive)
Overall instrumentation and observability via the many dashboards and alerts
Thanks @glopezfernandez, this is really interesting. Your approach here is a per-service metric, which makes sense. I could see rolling those metrics up to some aggregated score, but then the indicative power of individual problem services can be washed out.
What you have here is a great idea and maybe something to pick up again in the future, but it isn't a "one KPI" measure. Great context though, thank you!
We already model many of these metrics in the saturation monitoring framework. This is used for monitoring and alerting. There are a few metrics that are not included in this framework, but these are not for technical reasons. All of the database saturation metrics could be (and should be) modelled in the saturation monitoring framework.
Each saturation metric is labelled with additional metrics, such as severity, horizontal vs vertical scaling, service etc.
This framework is mature and the primary source of saturation monitoring inside the application. Each saturation resource generates a dashboard such as the one below, but is also integrated into the Service Overview dashboards.
We've discussed this in the past, and I've raised concerns around distilling so many different signals, each with it's own volatility and unique pattern, down into a single metric, however one option which might work well is generating a score based on Tamland forecasts, where the sooner a resource is predicted to hit capacity, the higher the severity, etc, the higher (or lower) the score.
This approach would be relatively easy to add on top of Tamland. It would also be relatively easy to understand, since the source of the predictive data is already in the Tamland report.
One last thing: we send a notification to the #infra-capacity_planning slack channel on a weekly basis, as a reminder to review the last report, and where potential capacity issues are discussed.
@sloyd one additional point: GitLab.com is a highly complex, interconnected system. I would caution against us putting too much emphasis on a capacity planning metric, especially a single value. It's worth keeping in mind that the primary issue in the availability issues over the past two weeks has been a bug in the postgres 11 query planner, with a hardcoded threshold very deep within a branch of the query optimiser. While there were many other extenuating circumstances, no capacity planning, no matter how well it was designed, could have predicted the query planner switching over to such a poor plan and severely degrading our capacity as it did.
@andrewn Thanks for this point and for the tamland approach you articulated above. What you're expressing here I think gets at the core of my concern about presenting a single metric. One metric for such a diverse set of potential problems could just result in masking existence of the next problem and supporting false confidence.
Wanted to get others take on this and see if there was a potential way to do this that I wasn't considering though.
@sloyd I'd like to get a better understanding of what problem a single metric is trying to solve. What questions is this metric trying to answer, and what would this metric be used for? I've read through the description but I'm not sure that the ideas and challenges listed reflect how this metric would be used.
@rnienaber my understanding around the ask for any new KPI is to better represent the high level health of an area of our business, in this case to what level our systems are loaded. In this case both "systems" and "load" are meant in the more generic sense.
The primary way I've seen this done with some (limited) success is by picking a representative demand scaling unit (transaction for example) and articulating what level the current system can sustain vs. what is the current experienced demand. However, I don't believe that is what has been asked or envisioned in this particular ask. I mostly wanted to see if there were other ideas on this from the team.
@andrewn With the recent updates to Tamland (https://gitlab.slack.com/archives/C01AHAD2H8W/p1618474069019600), what do you (or anyone else) think of instead tracking just the number of resources in an 80% confidence threshold violation state within 6 months (or similar)? This would create a capacity related PI to bring attention to the continuing demands, but wouldn't try to tie all the disparate metrics into an aggregated direct measure.
@sloyd I propose that we start collecting some metrics from Tamland in Prometheus.
The metrics would look something like:
# 26 days until patroni/pgbouncer_sync_replica_pool breaches "hard" threshold at 80% confidence intervaltamland_forecast_violation_days{env="gprd",type="patroni",component="pgbouncer_sync_replica_pool",threshold="hard",confidence="80%"}26# 46 days until patroni/pgbouncer_sync_replica_pool breaches "hard" threshold on mean confidence intervaltamland_forecast_violation_days{env="gprd",type="patroni",component="pgbouncer_sync_replica_pool",threshold="hard",confidence="mean"}46# Repeat for each service/component pairtamland_forecast_violation_days{env="gprd",type="redis",component="redis_primary_cpu",threshold="hard",confidence="80%"}120tamland_forecast_violation_days{env="gprd",type="redis",component="redis_primary_cpu",threshold="hard",confidence="mean"}NaN# Not predicted
Once we start collecting this data, we can look at ways to summarize it.
Since each saturation resource also has attributes for severity, and horizontallyScalable, we could look to summarize according to those dimension, either counting the number of forecast violations as you suggested, and/or median/mean number of days until violation.
As a first step, we should start collecting this data. Once we have some data to experiment with, it'll be much easier to figure out the best approach to summarising it.
Here is a query which shows the percentage of horizontally and non-horizontally scalable metrics forecast to be at risk of a hitting their capacity thresholds in the next 90 days.
At present, 20% of non-horizontally scalable are at risk, 11% of horizontally scalable.
I prefer percentages over absolute counts as adding new saturation sources won't skew the data as much over time.
At the moment, I've just broken it down by horizontal scaling vs non-horizontal scaling as I think this keeps it simple for a headline figure. We could also break this down by service, severity or other labels too, if that made sense.
We're still working on the path from GCS -> Snowflake in https://gitlab.com/gitlab-data/analytics/-/issues/7713. So we've gotten the data out of the protected environment already, but not yet into the data warehouse.
Here is a query which shows the percentage of horizontally and non-horizontally scalable metrics forecast to be at risk of a hitting their capacity thresholds in the next 90 days.
For future reference, for when we next need this, here is a slightly improved query:
this has been implemented and refined through the capacity planning process, tamland, and the continuing work from the Scalability teams. Closing this out