Provide enhanced production metrics dashboard

For operations and managers, ensuring key targets are met by your service is important. For example, a company may have SLA's written into agreements for a specific level of uptime. Or, a company may have an internal goal that 99% of all requests are processed within 200ms.

With GitLab and our monitoring capabilities, we can help companies track these important metrics. Initially, the browser performance tests (https://gitlab.com/gitlab-org/gitlab-ee/issues/3046) could effectively act as both a pingdom-like service as well as a way to exercise multiple different parts of an application.

This would provide both a measure of the actual latency as perceived by a browser, as well as whether any requests were being returned. Further, since these are operated on Runners, it would be possible for a company to put these in various parts of the world for even deeper levels of testing.

A few key metrics that could be tracked:

Uptime & current SLA
p95/p99 response times
Error rates
Failed authentications
Current & Max Sessions
Deploys, rollbacks (?)
Cluster health / resources
Component Status (Red/Yellow/Green)

These could also trend into business metrics as well:

Conversion rates
Daily Active Users
New user registrations
Infrastructure cost/scaling
etc...

This could serve as an excellent executive dashboard, providing an overview of the health of their service in a few different dimensions. Initially, this could serve as an internal dashboard within GitLab, but some of these build blocks could also power a public facing SaaS Service Status page.

The general work flow could be:

Track the uptime over time of /
Support additional URL's (https://gitlab.com/gitlab-org/gitlab-ee/issues/3540)
Incorporate additional performance metrics like response times, error rates
Add support for business metrics
Add CI/CD data like deploys, rollbacks

Edited Jul 29, 2019 by Kenny Johnston