Runner verification fails with panic when receiving invalid JSON response

Summary

Our runner verification code has a fallback to the legacy text-based response format when JSON parsing fails. The fallback path does not populate the response object before returning it to verifyRunner in network/gitlab.go.

This causes the runner to log a panic message even when the runner verification is possibly successful. This is confusing, because it can lead to situations where the log is contradictory.

This issue was brought to my attention by a GitLab Premium customer.

Steps to reproduce

Unfortunately, I haven't been able to reproduce this bug, and it only happened once to the customer who brought it to my attention. It seems like it might have been a random network hiccup in their case.

Actual behavior

The runner will log both ... is valid and a PANIC: Failed to verify the runner. message, which is somewhat confusing. In the case I saw, they were able to manually retry verification which made the runners verify properly.

Expected behavior

If we get an invalid JSON response and hit the legacy code path, I'd expect one or more of the following:

  1. A warning that we got invalid JSON or that there was a JSON parsing failure, explaining that we're trying the legacy path.
  2. The legacy path should populate the response with something that we can handle. Right now I think it's just passing along the zero-initialized VerifyRunnerResponse which causes the confusing panic.
  3. I'm assuming that we might actually be succeeding when we hit the legacy path. However, because we panic if the response has a 0 ID later on, we never actually finish the verification.

Relevant logs and/or screenshots

Here's what showed up in the runner logs I got from the customer (anonymized):

PS D:\GitLab-Runner> .\gitlab-runner.exe --debug --debug register --config D:\GitLab-Runner\config.toml --non-interactive --executor shell --shell powershell --name REDACTED --url https://gitlab.example.com --registration-token REDACTED --template-config D:\GitLab-Runner\config.toml.template --output-limit 20000
Runtime platform                                    arch=amd64 os=windows pid=2960 revision=50bc0499 version=18.2.2
Checking runtime mode                               GOOS=windows uid=-1
Merging configuration from template file "D:\\GitLab-Runner\\config.toml.template"
Dialing: tcp gitlab.example.com:443 ...
Dialing: tcp gitlab.example.com:443 ...
Verifying runner... is valid                        correlation_id=01K6GKWAHFGK48PWQ8CTEWC7H2 runner=kNPcd47jP
PANIC: Failed to verify the runner.

The confusing part here is these two lines. It feels like we shouldn't say the runner is valid and then immediately panic, which caused some confusion for the customer:

Verifying runner... is valid                        correlation_id=01K6GKWAHFGK48PWQ8CTEWC7H2 runner=kNPcd47jP
PANIC: Failed to verify the runner.

Environment description

This issue only seemed to occur on the customer's Windows runners. Here's the config.toml snippet they sent:

concurrent = 1
check_interval = 0
listen_address = "0.0.0.0:9254"

I also have some of their template config (anonymized):

[[runners]]
  environment = ["no_proxy=.example.com"]

As you can see, their setup is pretty normal. There's not much that I think could be hitching them here.

Used GitLab Runner version

18.2.2

Possible fixes

I'm not 100% sure on an actual fix, but here's the bits of code that I think might be responsible. First, in network/gitlab.go:

  var response common.VerifyRunnerResponse   // <- zeroed by default?
  //nolint:bodyclose
  // body is closed with closeResponseBody function call
  result, statusText, resp := n.doJSON(
    context.Background(),
    &runner,
    http.MethodPost,
    "runners/verify",
    http.StatusOK,
    headers,
    &request,
    &response,
  )
  if result == -1 {
    // if server is not able to return JSON, let's try plain text (the legacy response format)
    //nolint:bodyclose
    // body is closed with closeResponseBody function call
    result, statusText, resp = n.doJSON(
      context.Background(),
      &runner,
      http.MethodPost,
      "runners/verify",
      http.StatusOK,
      headers,
      &request,
      nil,      // <- pass nil instead of &response, which makes doJSON try plain text.
    )
  }
  defer closeResponseBody(resp, false)

  logger := runner.Log().WithField("correlation_id", getCorrelationID(resp, correlationID))

  switch result {
  case http.StatusOK:
    // this is expected due to fact that we ask for non-existing job
    if TokenIsCreatedRunnerToken(runner.Token) {
      logger.Println("Verifying runner...", "is valid")
    } else {
      logger.Println("Verifying runner...", "is alive")
    }
    return &response     // <- if we hit the legacy path, response is still zeroed here!
  case http.StatusForbidden:
    if TokenIsCreatedRunnerToken(runner.Token) {
      logger.Println("Verifying runner...", "is not valid")
    } else {
      logger.WithField("status", statusText).Errorln("Verifying runner...", "is removed")
    }
    return nil
  case clientError:
    logger.WithField("status", statusText).Errorln("Verifying runner...", "client error")
    return &response
  default:
    logger.WithField("status", statusText).Errorln("Verifying runner...", "failed")
    return &response
  }

That's all called from commands/register.go:

func (s *RegisterCommand) verifyRunner() {
  // If a runner authentication token is specified in place of a registration token, let's accept it and process it as
  // an authentication token. This allows for an easier transition for users by simply replacing the
  // registration token with the new authentication token.
  result := s.network.VerifyRunner(s.RunnerCredentials, s.SystemID)

  // result is &response from the code above. since it's zero-initialized, 
  //  it will trigger the panic even on success.
  if result == nil || result.ID == 0 {
    logrus.Panicln("Failed to verify the runner.")
  }
  s.ID = result.ID
  s.TokenObtainedAt = s.timeNowFn().UTC().Truncate(time.Second)
  s.TokenExpiresAt = result.TokenExpiresAt
  s.registered = true
}