Inventory of self-managed environments with performance scoring mechanism (#1338) · Epics · GitLab.org

Inventory of self-managed environments with performance scoring mechanism

This epic is part of the work for the [Self-managed Scalability Working Group](https://about.gitlab.com/company/team/structure/working-groups/self-managed-scalability/) This epic will cover: - Inventory of self-managed customers with scores for their environments Questions: - What is the score scale/what is the scoring based on? - How do we obtain the score? @wchandler @tpazitny and I had a quick conversation about scoring customers' environments. We were looking not only at the testing Quality is doing but also the tools Will has built (namely, [fast-stats](https://gitlab.com/gitlab-com/support/toolbox/fast-stats). fast-stats is "a tool with minimal memory use to quickly create and compare performance statistics from and between GitLab logs." Currently, Quality is using Artillery (soon, K6) to run performance tests against a predefined dataset and to predefined endpoints. This is very useful for getting a baseline not only of the reference environment itself, but also to see how GitLab performance changes between releases. From a customer perspective, this type of test is useful to verify their environment performs identically, when built to the same specs as our reference environment. When it doesn't, we know that the differences should be due to network, disk speed or other variables specific to their environment. This will help us to know we need to track down those problems from the beginning. However, then there's the question of how well GitLab will perform with the customer's real-world use - Git repos of various dimensions, unique workloads, etc. This is where Will's fast-stats tools will come in. Quality can also run fast-stats against the logs from the reference environment and store that as a benchmark. After customers have been using their environment for a while we run fast-stats against their logs to see how it compares to the reference environment. The output will show us how many times slower their environment is at X (controller, Gitaly method, etc) (see the fast-stats README for examples). The data here may lead to different outcomes - it's certainly possible that a customer's instance is slower with real-world traffic because they're seeing higher RPS than expected, or IOPS are higher and disk is too slow, etc. *Or* the problem could be that GitLab isn't performing well in a certain place. For example, maybe due to the unique dimensions of their Git repos, the `MergeRequestsController` is slower. We can investigate further and improve on the GitLab side. Having this two-step approach to testing a customer's environment against our own will give us lots of information and hopefully set us up for success, with a good feedback loop. Tanya mentioned that there's probably still a third iteration here where we need to identify what changes should be made to a customer's environment based on the data - i.e. if we see Gitaly is slower in the customer's environment what does that mean - what are the steps to correct it?

epic