Fix HTTP retries not working properly

Previously if GitLab rate-limited an HTTP request to the runner API, the runner would attempt to retry but omit the body. The job log would show a message such as:

couldn't execute PUT against https://gitlab.example.com/api/v4/jobs/123: Put "https://gitlab.example.com/api/v4/jobs/123": http: ContentLength=972 with Body length 0

This happened because the rate limiter requester would reuse the http.Request, but the Body had already been consumed by the first attempt. This prevented HTTP retries from working altogether. The GitLab server displayed EOF errors when this occurred.

Go supplies a way to handle this: the GetBody function can be set in an http.Request. This function is expected to produce the body.

This commit changes the way that the request body is passed to the network client. If the bytes to send are known and content length is predefined, clients can pass in a BytesProvider. For example, this is used for job log traces and PUT API requests that update the job status. BytesProvider will always produce an io.ReadCloser with the given set of bytes.

Things get a little more complicated for artifacts uploads. This is how artifact uploads work:

  1. ArtifactsUploaderCommand creates a stream for archiving a list of files.
  2. GitLabClient creates a read-write pipe and then a Goroutine that reads from that archive stream and writes to the pipe.
  3. GitLabClient then sends the pipe's read io.ReadCloser to the http.Request, which then reads from that pipe and writes the body to the netowrk.

BytesProvider doesn't work here because the content is not known ahead of time. For this case, this commit introduces a StreamProvider that wraps these pipes and streams. The caller is responsible for returning a factory that wraps the building of the archive and produces a io.ReadCloser type for each call to GetReader().

With both BytesProvider and StreamProvider, the runner now can set the GetBody function. The rate limit requester must call this manually and set Content-Length if known for retries to work.

Note that a side effect of setting this function is that Go will automatically and quietly handle 307 and 308 redirections (https://go-review.googlesource.com/c/go/+/31733). Previously the runner manually handled redirects and displayed a message in the job log. With this change, however, users will no longer see an indication that a redirection is happening behind the scenes.

Relates to #38651 (closed)

How to test locally

I tested this locally by hacking my GitLab instance:

  1. Enable Rack Attack: in /admin -> Settings -> Network, check Enable unauthenticated API request rate limit, and click Save.
  2. Be careful: lower the limits of unauthenticated API requests to a low value (e.g. 5). Note that this might lock you out. 😄
  3. In lib/gitlab/rack_attack/request.rb (Omnibus instace file: /opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/rack_attack/request.rb), I changed throttle_unauthenticated_api t this:
      def throttle_unauthenticated_api?
        if path =~ %r{api/v4/jobs/\d+} && !get?
          if rand(100) < 75
            Rails.logger.info "throttling #{path}"
            return true
          end
        end

        ...
  1. gdk restart rails-web for the changes to take effect.
  2. Hit curl http://gdk.test:3000/api/v4/users at least N times until you see Retry later.
  3. Run a CI job and observe. You should no longer see these HTTP errors. Instead, you should see something like:
Updating job...                                     bytesize=4030 checksum=crc32:207b9dca job=447 runner=xXVaTgQHw
Sleeping due to rate limit                          context=ratelimit-requester-gitlab-request duration=40.704992558s method=PUT url=https://gitlab.example.com/api/v4/jobs/447
Dialing: tcp gitlab.example.com:443 ...
Submitting job to coordinator...ok                  bytesize=4030 checksum=crc32:207b9dca code=200 job=447 job-status=success runner=xXVaTgQHw update-interval=0s
  1. Disable Rack Attack.
Edited by Stan Hu

Merge request reports

Loading