Fix HTTP retries not working properly
Previously if GitLab rate-limited an HTTP request to the runner API, the runner would attempt to retry but omit the body. The job log would show a message such as:
couldn't execute PUT against https://gitlab.example.com/api/v4/jobs/123: Put "https://gitlab.example.com/api/v4/jobs/123": http: ContentLength=972 with Body length 0
This happened because the rate limiter requester would reuse the
http.Request, but the Body had already been consumed by the first
attempt. This prevented HTTP retries from working altogether. The
GitLab server displayed EOF errors when this occurred.
Go supplies a way to handle this: the GetBody function can be set
in an http.Request. This function is expected to produce the
body.
This commit changes the way that the request body is passed to the
network client. If the bytes to send are known and content length is
predefined, clients can pass in a BytesProvider. For example, this
is used for job log traces and PUT API requests that update the job
status. BytesProvider will always produce an io.ReadCloser with the given set of bytes.
Things get a little more complicated for artifacts uploads. This is how artifact uploads work:
-
ArtifactsUploaderCommandcreates a stream for archiving a list of files. -
GitLabClientcreates a read-write pipe and then a Goroutine that reads from that archive stream and writes to the pipe. -
GitLabClientthen sends the pipe's readio.ReadCloserto thehttp.Request, which then reads from that pipe and writes the body to the netowrk.
BytesProvider doesn't work here because the content is not known ahead of time. For this case, this commit introduces a StreamProvider that
wraps these pipes and streams. The caller is responsible for returning a factory that wraps the building of the archive and produces a io.ReadCloser type for each call to GetReader().
With both BytesProvider and StreamProvider, the runner now can set the GetBody function. The rate limit requester must call this manually and set Content-Length if known for retries to work.
Note that a side effect of setting this function is that Go will automatically and quietly handle 307 and 308 redirections (https://go-review.googlesource.com/c/go/+/31733). Previously the runner manually handled redirects and displayed a message in the job log. With this change, however, users will no longer see an indication that a redirection is happening behind the scenes.
Relates to #38651 (closed)
How to test locally
I tested this locally by hacking my GitLab instance:
- Enable Rack Attack: in
/admin->Settings->Network, checkEnable unauthenticated API request rate limit, and clickSave. - Be careful: lower the limits of unauthenticated API requests to a low value (e.g. 5). Note that this might lock you out.
😄 - In
lib/gitlab/rack_attack/request.rb(Omnibus instace file:/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/rack_attack/request.rb), I changedthrottle_unauthenticated_apit this:
def throttle_unauthenticated_api?
if path =~ %r{api/v4/jobs/\d+} && !get?
if rand(100) < 75
Rails.logger.info "throttling #{path}"
return true
end
end
...
-
gdk restart rails-webfor the changes to take effect. - Hit
curl http://gdk.test:3000/api/v4/usersat least N times until you seeRetry later. - Run a CI job and observe. You should no longer see these HTTP errors. Instead, you should see something like:
Updating job... bytesize=4030 checksum=crc32:207b9dca job=447 runner=xXVaTgQHw
Sleeping due to rate limit context=ratelimit-requester-gitlab-request duration=40.704992558s method=PUT url=https://gitlab.example.com/api/v4/jobs/447
Dialing: tcp gitlab.example.com:443 ...
Submitting job to coordinator...ok bytesize=4030 checksum=crc32:207b9dca code=200 job=447 job-status=success runner=xXVaTgQHw update-interval=0s
- Disable Rack Attack.