Skip to content

Runner not picking up jobs after upgrade to 17.10 or 17.11

If you upgrade to GitLab 17.10 or 17.11, you may find that runners receive a 404 response when checking for jobs:

May 15 03:30:32 git2 gitlab-runner[702]: WARNING: Checking for jobs... failed                runner=**** status=POST https://****/api/v4/jobs/request: 404 Not Found (404 Not found)
May 15 03:30:40 git2 gitlab-runner[702]: WARNING: Checking for jobs... failed                runner=**** status=POST https://****/api/v4/jobs/request: 404 Not Found (404 Not found)
May 15 03:30:48 git2 gitlab-runner[702]: WARNING: Checking for jobs... failed                runner=**** status=POST https://****/api/v4/jobs/request: 404 Not Found (404 Not found)
May 15 03:31:32 git2 gitlab-runner[702]: WARNING: Checking for jobs... failed                runner=**** status=POST https://****/api/v4/jobs/request: 404 Not Found (404 Not found)

This also happens when gitlab-runner verify:

$ gitlab-runner verify
Runtime platform                                    arch=amd64 os=linux pid=101033 revision=3e653c4e version=18.0.1
WARNING: Running in user-mode.
WARNING: The user-mode requires you to manually start builds processing:
WARNING: $ gitlab-runner run
WARNING: Use sudo for system-mode:
WARNING: $ sudo gitlab-runner...

ERROR: Verifying runner... failed                   runner=Siov-TtCq status=POST https://gitlab.example.com/api/v4/runners/verify: 404 Not Found (404 Not found)
ERROR: Verifying runner... failed                   runner=vgXEBWXnw status=POST https://gitlab.example.com/api/v4/runners/verify: 404 Not Found (404 Not found)
ERROR: Verifying runner... failed                   runner=U1JRrAbw8 status=POST https://gitlab.example.com/api/v4/runners/verify: 404 Not Found (404 Not found)
ERROR: Verifying runner... failed                   runner=GuRfT2ATy status=POST https://gitlab.example.com/api/v4/runners/verify: 404 Not Found (404 Not found)

Explanation

It appears this 404 likely occurs because the data in ci_runner_machines table was not properly migrated, most likely because the background migration BackfillCiRunnerMachinesPartitionedTable introduced in GitLab 17.7 via !171901 (merged) did not finish before the upgrade to GitLab 17.11.2. UPDATE: This migration was zero'ed out because it caused issues with other background migrations. Therefore it is expected that the ci_runner_machines and ci_runner_machines_archived tables are out of sync.

Workaround

The data in ci_runner_machines is not that important because it is received by a runner every time it requests a new job.

Option 1: Resync the data from ci_runner_machines_archived to ci_runner_machines

  1. Create the script below as /tmp/sync.rb.
  2. Run gitlab-rails console
  3. Run load '/tmp/sync.rb'.
ActiveRecord::Base.logger = Logger.new(STDOUT)

class ArchivedRunnerMachine < ActiveRecord::Base
  self.table_name = 'ci_runner_machines_archived'
end

class RunnerMachine < ActiveRecord::Base
  self.table_name = 'ci_runner_machines'
end

# Find all archived records whose runner IDs don't exist in the main table
archived_records = ArchivedRunnerMachine.where.not(
  runner_id: RunnerMachine.select(:runner_id)
)

# Disable the trigger before starting
ActiveRecord::Base.connection.execute("ALTER TABLE ci_runner_machines DISABLE TRIGGER table_sync_trigger_bc3e7b56bd;")

begin
  archived_records.find_each(batch_size: 1000) do |archived_machine|
    attributes = archived_machine.attributes.except('id')

    runner = Ci::Runner.find(archived_machine.runner_id)
    attributes['sharding_key_id'] = runner.sharding_key_id

    # Create the record in the main table
    RunnerMachine.create!(attributes)
  end
ensure
  # Re-enable the trigger after all operations
  ActiveRecord::Base.connection.execute("ALTER TABLE ci_runner_machines ENABLE TRIGGER table_sync_trigger_bc3e7b56bd;")
end

Option 2a: Delete the archived table

While the data isn't critical, it does record the history of stale/offline runners.

I found that the easiest workaround is to:

  1. Back up your database.
  2. Drop the data in ci_runner_machines_archived. Run gitlab-psql and then:
DELETE FROM ci_runner_machines_archived;

Option 2b: Upgrade to GitLab 18.0

Alternatively, you can also upgrade to GitLab 18.0, which drops the ci_runner_machines_archived table altogether.

Why would this lead to a 404 in the /api/v4/runners/verify and other runner endpoints?

It turns out that this happens because:

  1. When the runner authenticates, it attempts to update the entry in ci_runner_machines: https://gitlab.com/gitlab-org/gitlab/-/blob/5457b685b4eb1bf52d8b06697eba6cdbb5ce5710/lib/api/ci/helpers/runner.rb#L23
  2. There is no entry there, so the call to current_runner&.ensure_manager(system_xid) attempts to insert an entry via ApplicationRecord.safe_find_or_create_by!: https://gitlab.com/gitlab-org/gitlab/-/blob/b1e30cf78cd7794691b6ec9bfdc343624edc5e14/app/models/application_record.rb#L61-67
  3. There is a trigger (table_sync_function_e438f29263) that replicates all inserts to ci_runner_machines to ci_runner_machines_archived.
  4. However, since the entry already exists in ci_runner_machines_archived, PostgreSQL rejects this insert with a unique constraint error. From the SQL logs:
2025-05-17_22:14:02.13834 LOG:  execute <unnamed>: /*application:web,correlation_id:01JVG4ZD0YJBY0P7N88C7EXNAZ,endpoint_id:POST /api/:version/runners/verify,db_config_database:gitlabhq_production,db_config_name:ci*/ INSERT INTO "ci_runner_machines" ("runner_id", "sharding_key_id", "created_at", "updated_at", "runner_type", "system_xid") VALUES (1, 1, '2025-05-17 22:14:02.135543', '2025-05-17 22:14:02.135543', 3, 's_8f8333054652') RETURNING "id"
2025-05-17_22:14:02.14081 ERROR:  duplicate key value violates unique constraint "index_ci_runner_machines_on_runner_id_and_system_xid"
2025-05-17_22:14:02.14083 DETAIL:  Key (runner_id, system_xid)=(1, s_8f8333054652) already exists.
2025-05-17_22:14:02.14083 CONTEXT:  SQL statement "INSERT INTO ci_runner_machines_archived ("id",
2025-05-17_22:14:02.14083           "runner_id",
2025-05-17_22:14:02.14083           "executor_type",
2025-05-17_22:14:02.14084           "created_at",
2025-05-17_22:14:02.14084           "updated_at",
2025-05-17_22:14:02.14084           "contacted_at",
2025-05-17_22:14:02.14084           "version",
2025-05-17_22:14:02.14084           "revision",
2025-05-17_22:14:02.14084           "platform",
2025-05-17_22:14:02.14085           "architecture",
2025-05-17_22:14:02.14085           "ip_address",
2025-05-17_22:14:02.14085           "config",
2025-05-17_22:14:02.14085           "system_xid",
2025-05-17_22:14:02.14085           "creation_state",
2025-05-17_22:14:02.14085           "runner_type",
2025-05-17_22:14:02.14086           "sharding_key_id",
2025-05-17_22:14:02.14086           "runtime_features")
2025-05-17_22:14:02.14086         VALUES (NEW."id",
2025-05-17_22:14:02.14086           NEW."runner_id",
2025-05-17_22:14:02.14086           NEW."executor_type",
2025-05-17_22:14:02.14087           NEW."created_at",
2025-05-17_22:14:02.14088           NEW."updated_at",
2025-05-17_22:14:02.14088           NEW."contacted_at",
2025-05-17_22:14:02.14088           NEW."version",
2025-05-17_22:14:02.14088           NEW."revision",
2025-05-17_22:14:02.14089           NEW."platform",
2025-05-17_22:14:02.14089           NEW."architecture",
2025-05-17_22:14:02.14089           NEW."ip_address",
2025-05-17_22:14:02.14089           NEW."config",
2025-05-17_22:14:02.14089           NEW."system_xid",
2025-05-17_22:14:02.14089           NEW."creation_state",
2025-05-17_22:14:02.14090           NEW."runner_type",
2025-05-17_22:14:02.14090           NEW."sharding_key_id",
2025-05-17_22:14:02.14090           NEW."runtime_features")"
2025-05-17_22:14:02.14090       PL/pgSQL function table_sync_function_e438f29263() line 25 at SQL statement
  1. This failure causes the inner safe_find_or_create (https://gitlab.com/gitlab-org/gitlab/-/blob/b1e30cf78cd7794691b6ec9bfdc343624edc5e14/app/models/application_record.rb#L62) to return an empty record, causing it to raise the ActiveRecord::RecordNotFound error.
  2. This ActiveRecord::RecordNotFound gets returned as a 404.

Related issues:

Edited by Stan Hu