WIP: Client Timing Statistics
What does this MR do?
This is a fully functioning proof-of-concept showing how we can easily record real world usage timing statistics for GitLab.com (and other instances) with very few code changes.
Why do this?
Currently GitLab.com performance is measured using Pingdom and blackbox testing of a small set of known URLs.
The response times are used as an indicator of user experience, but in my opinion it's a terrible proxy for real data:
- The small set of URLs used in the tests does reflect the usage our users are performing
- Blackbox tests measure on a single component of the time it takes for a browser page to render.
Here are a list of things that have a huge impact on user experience which are not reflected in our current testing:
- Size of response. Most blackbox testing occurs from the cloud to the cloud. Real users have slower connections so large HTML payloads are far slower in reality.
- Related: compressed responses.
- Usage of, and performance of, CDN for serving static content.
- Best practices around resource loading (asynchronous loading, etc)
- Correct caching headers
- Preloading (DNS, resources, http/2 push in future,, etc)
All of these components have a major influence on the time it takes to render a page on GitLab.com, but currently we have no data that reflects whether they're good or bad - and more importantly whether they're getting better or worse.
Our current approach is to optimise for a single set of URLs. This makes it easy to optimise for a local maximum while ignoring the bigger picture.
What does it do?
This MR measures the actual response times on GitLab.com of real-world users and stores the data in Prometheus histograms.
It uses the actual data recorded by the browser, a much better metric to optimise against, since it includes all factors that impact the browser rendering times.
Graph of Results
This graph shows the average amount of time it takes for a webpage to be fully loaded. The data is obviously very granular since I don't have a great deal of data locally.
sum(rate(client_browser_timing_sum[10m])) / sum(rate(client_browser_timing_count[10m]))