Add Rails error rate by controller and action metric
Problem
From this discussion, it turns out that it's impossible to aggregate the request error rate by controller and action. The error rate definition widely used the services are the requests with 5xx status; or the requests with exceptions raised (eventually responded with 500 status). Recently, all the aggregations are done on type (git/web/api/...) or feature category. There are two prometheus metrics are exported to track HTTP requests from rails layer:
-
http_requests_total
(including method and status status) -
gitlab_transaction_duration_seconds
(including controller and action).
Both gitlab_transaction_duration_seconds
and http_requests_total
are both used in the runbook. In some cases, gitlab_transaction_duration_seconds
is used because http_requests_total
doesn't have equivalent information. http_requests_total
seems to provide a generic information while gitlab_transaction_duration
tends to be used in detail breakdown at Rails layer only. The cardinality of gitlab_transaction_duration_seconds
is much less. Therefore, the error rate should be tracked by this metric instead.
Proposal
-
Add status
intogitlab_transaction_duration_seconds
Gitlab::Metrics::RackMiddleware
. -
Add Rails Error rules to Prometheus rules in runbook. The rules include error amount and error rate aggregated by 1m and 5m. -
Add rails error rate into Rails controller dashboards (web:Rails controller, api:Rails controller, and git:Rails controller) -
Update dashboards for stage groups to adhere the controller/action filters (see gitlab-com/runbooks!3100 (comment 484283837) for more information).