Make it possible to define custom request duration thresholds in the Rails application
We currently use Gitlab::WithFeatureCategory
to set a feature category for an endpoint. Perhaps we could extend that interface to also include a request duration threshold.
We should update the specs for all controller endpoints and all api endpoints to check that the configured threshold is < 10. This is a very rudimentary test to make sure that we don't configure excessive request durations.
The default 1 and maximum 10 are based on what we are currently using as SLI for all requests: https://gitlab.com/gitlab-com/runbooks/blob/838f9171e8445192d3c9f57c0cf1933e6f6207eb/metrics-catalog/services/api.jsonnet#L158-159. The error budget is currently capped at 1 second. So this configuration would allow some requests to be configured more lenient.
We will probably need to iterate on this, if very busy endpoints get set to 10s we'd have a bad time. So perhaps we should think of a process to keep an eye on this? Perhaps require a scalability review for each threshold configured?
Proposal
One good way of doing this is to extend the recent feature_category
DSL to include action options. It's fortunate for us that the recent DSL signature declares the actions as an array, instead of spreading out: def feature_category(category, actions = []
. We can easily add the action options as a keyword argument. I think the new interface looks like this:
feature_category :code_review # Default is 1s - medium
feature_category :code_review, expected_duration: :slow
feature_category :accessibility_testing, [:accessibility_reports] # Default is 1s - medium
feature_category :infrastructure_as_code, [:terraform_reports], expected_duration: :fast
feature_category :continuous_integration, [:pipeline_status, :exposed_artifacts], expected_duration: :slow
# Grape API, declare at API handler level
feature_category :users, [
'/users/:id/custom_attributes',
'/users/:id/custom_attributes/:key'
], expected_duration: :fast
# Grape API, declare at handler level
post '/two_factor_config', feature_category: :authentication_and_authorization, expected_duration: :very_fast do
# Blah blah
end
I don't use the term threshold
because it is confused about the behavior of an endpoint having this option. The developers may have an impression that the endpoint may have a timeout, and returns 408 status. The term "expected duration" gives an impression that we are expecting an endpoint's duration to be fast or slow and it's not the end of the world if the endpoint does not meet the expectation. The stage groups can set an expectation for each endpoint. It looks like a good name for the option.
Technically, they can use a number to indicate arbitrary number. But I don't think we should at this stage, or any stage group found that useful. The new options consist of very fast
(0.25s), fast
(0.5s), medium
(1s), and slow
(5s) only.
In the future, I think we can implement a Danger bot to post a polite recommendation and tag the scalability team as a reviewer if an endpoint's expected duration is set to slow.
I'll update the description and move forward.