Use http read total timeout by default (logging errors rather than raising)
What does this MR do?
Related to https://gitlab.com/gitlab-org/gitlab/-/issues/334930 (this is a confidential issue but it was decided that we don't have to follow the security workflow)
In https://gitlab.com/gitlab-org/security/gitlab/-/merge_requests/1379 (internal only) we introduced a total read timeout for HTTP requests. This changed a fundamental part of the system, but we couldn't put it behind a feature flag and roll it out slowly while making sure it doesn't break anything. Instead, we added an opt-in option and used the new timeout only where it was necessary. Now it looks like the timeout is stable, so we can make it the default behavior.
This MR changes:
- all requests default to using the read total timeout, but only the existing calls that pass
use_read_total_timeout
will raise an exception (current behaviour) all other calls will instead log the error. This allows us to monitor impacts of making the read total timeout the default. - adds
skip_read_total_timeout
opt-out option. Skipping the total timeout is needed in some cases where a request is expected to be streamed, like downloading a file. -
response.to_s
toresponse.body
in several places becauseto_s
is not guaranteed to returnbody
as a string (see httparty/response.rb#L88-L94)
Does this MR meet the acceptance criteria?
Conformity
-
I have included changelog trailers, or none are needed. (Does this MR need a changelog?) -
I have added/updated documentation, or it's not needed. (Is documentation required?) -
I have properly separated EE content from FOSS, or this MR is FOSS only. (Where should EE code go?) -
I have added information for database reviewers in the MR description, or it's not needed. (Does this MR have database related changes?) -
I have self-reviewed this MR per code review guidelines. -
This MR does not harm performance, or I have asked a reviewer to help assess the performance impact. (Merge request performance guidelines) -
I have followed the style guides. -
This change is backwards compatible across updates, or this does not apply.
Availability and Testing
-
I have added/updated tests following the Testing Guide, or it's not needed. (Consider all test levels. See the Test Planning Process.) -
I have tested this MR in all supported browsers, or it's not needed. -
I have informed the Infrastructure department of a default or new setting change per definition of done, or it's not needed.
Merge request reports
Activity
changed milestone to %14.2
added backend devopscreate groupecosystem [DEPRECATED] typemaintenance labels
added typefeature label
1 Message CHANGELOG missing: If you want to create a changelog entry for GitLab FOSS, add the
Changelog
trailer to the commit message you want to add to the changelog.If you want to create a changelog entry for GitLab EE, also add the
EE: true
trailer to your commit message.If this merge request doesn't need a CHANGELOG entry, feel free to ignore this message.
Reviewer roulette
Changes that require review have been detected!
Please refer to the table below for assigning reviewers and maintainers suggested by Danger in the specified category:
Category Reviewer Maintainer backend Sashi Kumar Kumaresan ( @sashi_kumar
) (UTC+5.5, 3.5 hours ahead of@Andysoiron
)Kerri Miller ( @kerrizor
) (UTC-7, 9 hours behind@Andysoiron
)To spread load more evenly across eligible reviewers, Danger has picked a candidate for each review slot, based on their timezone. Feel free to override these selections if you think someone else would be better-suited or use the GitLab Review Workload Dashboard to find other available reviewers.
To read more on how to use the reviewer roulette, please take a look at the Engineering workflow and code review guidelines. Please consider assigning a reviewer or maintainer who is a domain expert in the area of the merge request.
Once you've decided who will review this merge request, assign them as a reviewer! Danger does not automatically notify them for you.
If needed, you can retry the
danger-review
job that generated this comment.Generated by
Dangeradded sectiondev label
added 99 commits
-
0c0573c1...3b85ee3a - 98 commits from branch
master
- ad624948 - Use http read total timeout by default
-
0c0573c1...3b85ee3a - 98 commits from branch
added 233 commits
-
e7b8d3e1...c83cf1f8 - 231 commits from branch
master
- 2e847531 - Use http read total timeout by default
- 59416ecf - Fix usage of response.to_s
-
e7b8d3e1...c83cf1f8 - 231 commits from branch
added 82 commits
-
59416ecf...6e700f55 - 79 commits from branch
master
- 9093149d - Use http read total timeout by default
- 7f435006 - Fix usage of response.to_s
- 170b1cac - Skip read total timeout for Jira client
Toggle commit list-
59416ecf...6e700f55 - 79 commits from branch
marked the checklist item I have included changelog trailers, or none are needed. (Does this MR need a changelog?) as completed
marked the checklist item I have added/updated documentation, or it's not needed. (Is documentation required?) as completed
marked the checklist item I have properly separated EE content from FOSS, or this MR is FOSS only. (Where should EE code go?) as completed
marked the checklist item I have added information for database reviewers in the MR description, or it's not needed. (Does this MR have database related changes?) as completed
marked the checklist item I have self-reviewed this MR per code review guidelines. as completed
marked the checklist item This MR does not harm performance, or I have asked a reviewer to help assess the performance impact. (Merge request performance guidelines) as completed
marked the checklist item I have followed the style guides. as completed
marked the checklist item This change is backwards compatible across updates, or this does not apply. as completed
marked the checklist item I have added/updated tests following the Testing Guide, or it's not needed. (Consider all test levels. See the Test Planning Process.) as completed
marked the checklist item I have tested this MR in all supported browsers, or it's not needed. as completed
marked the checklist item I have informed the Infrastructure department of a default or new setting change per definition of done, or it's not needed. as completed
added 305 commits
-
170b1cac...2f5a0d41 - 304 commits from branch
master
- 95c8e4d2 - Use http read total timeout by default
-
170b1cac...2f5a0d41 - 304 commits from branch
- Resolved by Andy Schoenen
Hello @pskorupa
can you do the initial review?
requested review from @pskorupa
- Resolved by Andy Schoenen
removed review request for @pskorupa
- Resolved by Luke Duncalfe
@pskorupa, thanks for approving this merge request.Please consider starting a new pipeline if:
- This is the first time the merge request is approved, or
- The merge request is ready to be merged, and there has not been a merge request pipeline in the last 2 hours.
For more info, refer to the guideline.
added 1 commit
- 73ac5227 - Use named subject in download_blob_serivce_spec
Hey @.luke would you be up for the maintainer review?
requested review from @.luke
- Resolved by Luke Duncalfe
- Resolved by Luke Duncalfe
Skipping the total timeout is needed in some cases where a request is expected to be streamed, like downloading a file.
@Andysoiron What happens to a streaming request when total timeout is applied that means we can't apply it?
I'm assuming this means that those parts of the app are "vulnerable" to https://gitlab.com/gitlab-org/gitlab/-/issues/334930, but because both parts of the app that are skipping the total read timeout are interacting with external services that can't be controlled by an attacker (LFS object storage and wherever we store container images, unlike webhooks) then this is okay.
I've left one suggestion! When I've approved I'll also send this to a second maintainer for extra
.
removed review request for @.luke
requested review from @.luke
- Resolved by Luke Duncalfe
@mhenriksen could you do a security review here?
This is applying the total read timeout introduced in https://gitlab.com/gitlab-org/security/gitlab/-/merge_requests/1379 to all HTTP request except for downloadBlobService and LfsDownloadService because those need to download larger files that could be over the timeout limit. Since it's hard to come up with a timeout, and the download server can't be modified by a user, is it okay to skip the timeout in those cases?
requested review from @mhenriksen
- Resolved by Luke Duncalfe
@mhenriksen, thanks for approving this merge request.Please consider starting a new pipeline if:
- This is the first time the merge request is approved, or
- The merge request is ready to be merged, and there has not been a merge request pipeline in the last 2 hours.
For more info, refer to the guideline.
- Resolved by Andy Schoenen
- Resolved by Luke Duncalfe
Thanks, @Andysoiron!
I've left a couple of suggestions!
I'm beginning to wonder if we should deploy an intermediary step, which might increase the overall time to deliver this feature, where we continue to raise the exception for the HTTP calls that currently use
use_read_total_timeout
(the calls that we know were vulnerable due to the server being external), and for all other calls, we log to Sentry rather than raise.Then after a couple of weeks, we can see if there should be some parts of the app that should be exempt, before finalising the feature.
This is because I wonder if a possibility exists that there is a part of the app that performs non-streaming requests, and that previously was able to initiate long network requests, that might now find some of those requests fail? If I understand,
read_timeout
doesn't apply once the remote server is sending data.It could be a safer way to deploy this change in a way that doesn't disrupt services.
What do you think?
I think the change would look something like this:
diff --git a/lib/gitlab/http.rb b/lib/gitlab/http.rb index 7e45cd216f5..2ba3184aad1 100644 --- a/lib/gitlab/http.rb +++ b/lib/gitlab/http.rb @@ -43,16 +43,22 @@ def self.perform_request(http_method, path, options, &block) options end - unless options.has_key?(:use_read_total_timeout) + if options[:skip_read_total_timeout] return httparty_perform_request(http_method, path, options_with_timeouts, &block) end start_time = Gitlab::Metrics::System.monotonic_time read_total_timeout = options.fetch(:timeout, DEFAULT_READ_TOTAL_TIMEOUT) + read_total_timeout_logged = false httparty_perform_request(http_method, path, options_with_timeouts) do |fragment| elapsed = Gitlab::Metrics::System.monotonic_time - start_time - raise ReadTotalTimeout, "Request timed out after #{elapsed} seconds" if elapsed > read_total_timeout + + if options.has_key?(:use_read_total_timeout) + raise ReadTotalTimeout, "Request timed out after #{elapsed} seconds" if elapsed > read_total_timeout + else + # Log here unless read_total_timeout_logged + end block.call fragment if block
removed review request for @.luke
added 974 commits
-
73ac5227...139c8fb9 - 973 commits from branch
master
- d778157b - Use http read total timeout by default
-
73ac5227...139c8fb9 - 973 commits from branch
Thank you @.luke! I've applied your suggestions. Can you have another look?
requested review from @.luke and removed review request for @mhenriksen