Performance instrumentation of PPI/SMAU
<!-- triage-serverless v3 PLEASE DO NOT REMOVE THIS SECTION --> *This page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.* <!-- triage-serverless v3 PLEASE DO NOT REMOVE THIS SECTION --> ## Problem Slow performance is one of the most common user complaints we hear. It was by far the number one piece of feedback from our most recent [NPS survey (GitLab employees only)](https://docs.google.com/spreadsheets/d/1kPthsAfZVppJHsAkM54MiRLc9R-waf6rdjyyJViuZfY/edit#gid=195588039), and it comes up frequently on Hacker News, and other social media. A competitor also published a website with performance results: https://forgeperf.org/. From this data, it is clear GitLab is significantly slower than GitHub, among others. We need to address the performance challenges we face, in particular on GitLab.com, in order to improve user satisfaction. ### Current state We have a [robust set of backend performance metrics](https://dashboards.gitlab.net/d/RZmbBr7mk/gitlab-triage?orgId=1&refresh=30s) available from GitLab.com, and are working to improve our understanding of self-managed: https://gitlab.com/groups/gitlab-org/-/epics/3209. We also run Sitespeed.io against a set of GitLab.com web pages, which tracks page load performance. While we have made progress in improving performance, it is taking too long. GitLab is still demonstrably slower than other products in this space, nearly 2x slower than GitHub in multiple tests. While efforts are underway with the Ecosystem (Frontend Foundation) and Memory teams to solve specific issues, there is a long tail of 1,000+ performance issues that have been open for some time: https://gitlab.com/groups/gitlab-org/-/issues?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=performance. ## Proposed solution Performance is everyones responsibility, and based on current state, performance issues are not being prioritized well. In order to truly solve this problem long-term, we will need to solve the prioritization problem. While individual teams like Memory and Ecosystem can work to address major/foundational items, they cannot solve this problem on their own. ### Elevating performance to a feature Currently we are working to understand the adoption of various parts of GitLab, via North Stars and other Performance Indicators: https://gitlab.com/gitlab-com/Product/-/issues/880. In many cases, we are looking at active users by tracking specific high value actions. For example, capturing interactions with issues for gitlab~10690691: https://gitlab.com/gitlab-data/analytics/-/issues/4291. These PI's are reviewed monthly, along with a set of supporting metrics. Currently for most groups, we track individual feature usage, but do not track the perceived performance of these features by end users. We should elevate performance to be a "feature", and track it as a PI on each group's dashboard. This should cause increased awareness and mind share, as well as an ability to correlate improved performance to additional adoption. For mature groups this may be highly correlated, with less mature areas seeing more return on building out the initial feature set. But the critical aspect here is to ensure the data is represented, to close the feedback loop. ### Understanding feature / workflow performance Currently we have backend performance metrics, along with individual page load times. This is helpful, but users don't accomplish a task by loading a single page. Often time it takes a series of interactions to complete what you were trying to do. For example: * Actioning a todo: Load the todo page, pick a todo, and load the issue/MR * Review an MR: Click on MR's assigned to you, pick first MR, load all comments, open the diff and expand all files, and leave a comment * Issue boards: Load the Issue Board page, load a Board. Move an issue from one column to another? * etc. All of these require multiple page loads, and in some cases some dynamic loading within a page, for example by expanding a file. We don't have this instrumentation available today, and so it is difficult to understand the time it takes to complete a specific task. The results can be pretty surprising. Loading an MR, expanding the pipeline minigraph, then loading a specific job takes **15 seconds** to complete. If you were to only look at the page load speed of an MR or a job, you may not appreciate the extent of the performance problem. ## Next Steps ### MVC: Elevate performance to feature by adding to Group PI's As an initial step, each group should select specific page(s) which are central to their group's feature usage. These can then be added to Sitespeed and a Grafana dashboard specific to each group can be created. This way each PM/EM can see the performance of key pages over time for their specific groups. Note that can be an overview page, without all of the detail present in Sitespeed. If performance needs to be looked into, detailed metrics can then be pulled up on a separate dashboard. ### Next: Expand from page load to workflow (AMAU) The next goal would be to shift from a single page load, to a series of scripted page loads when performing an action. Specifically helpful would be the workflow that we track for various AMAU. We explored a few solutions like scripted Sitespeed, which are potential candidates for this scripting, but have not made a final determination on tooling. ### Next +1: Incorporate data into SiSense SiSense is where most of our key product metrics are stored, and so representing this data there would provide the most visibility and mind share. This way PM and EM's wouldn't have to check another dashboard. While there is limited performance data these generate, the key measure we want is the time to complete the defined task. We can measure these by just measuring how long it takes to complete the capybara run(s). If one needs to do additional debugging/analysis, they can then leverage other tools to find root cause and fix.
epic