Add Monitoring Support for Additional Metrics (Errors, Throughput, Latency)

Description

As part of the Prometheus project, we will be adding support for metrics that k8s provides out of the box with the integration of Prometheus. This is a great set of baseline metrics that provide insight into CPU, Memory, and a handful of other statistics.

While this is a great beginning, there is significant additional value in providing monitoring of additional time series. High value ones we do not support today are Error Rates, Throughput, and Request Duration (Latency). This is largely because they are not supported by k8s, and must be provided by another Prometheus supported service.

Proposal

For customers whose applications support it, we should allow consumption of these metrics and monitoring of them. In most cases these are the "parakeets", key metrics which can indicate a problem below the service, without having to rely on the end-user instrumenting a large number of low level services.

Since these metrics are not provided by k8s, we need a way to identify the names of these new metrics, depending on the customer app. I would suggest that we ship with standard defaults for these names, which we can recommend an end-user to use. However in the event that a customer has already instrumented their code for these metrics, we should also allow a method to override these settings with custom names.

Once collected, these should also be shown on the Environment page and Merge Request page.

Add Monitoring Support for Additional Metrics (Errors, Throughput, Latency)

Description

Proposal

Links / references