Set up full performance monitoring solution for all environments and test results

With the first iteration of the framework we currently have performance test pipelines running frequently against several reference environments. For results collection and monitoring they currently report into Slack.

Something that I think will create a lot of value moving forward is to expand this into a full monitoring solution (Grafana/Prometheus/Influx) that collects and presents all relevant information in one centralised location.

This solution would be something like as follows:

  • Collect relevant Prometheus metrics from all environments we test against (CPU, Memory, I/O) - Current environments we use are staging, onprem, 10k, pre and pre-puma.
  • Set each load testing tool to report their test results into a cloud location (Artillery and SiteSpeed appear to both support Influx. This would contain things such as response times, rps).
  • Set up Grafana dashboard(s) that present all of the data above in meaningful ways.

With this solution it's then envisaged that we can then view all relevant metrics and results in one location for every load test run moving forward. Furthermore this should enable teams across GitLab who would want to check how performance is going by going to this location.

To start, a POC was created in GCP that allows for full access and experimentation. The next task is to move it over to our central dashboards.gitlab.net instance.

After some experimentation with local setups and dashboards.gitlab.net the solution will likely be a separate setup for Quality that will be an aggregator that pulls metrics from all the environment's disparate prometheus servers into one location via Federation since we've found Grafana works better that way.

Edited by Grant Young