Runner not picking up jobs after upgrade to 17.10 or 17.11
If you upgrade to GitLab 17.10 or 17.11, you may find that runners receive a 404 response when checking for jobs:
May 15 03:30:32 git2 gitlab-runner[702]: WARNING: Checking for jobs... failed runner=**** status=POST https://****/api/v4/jobs/request: 404 Not Found (404 Not found)
May 15 03:30:40 git2 gitlab-runner[702]: WARNING: Checking for jobs... failed runner=**** status=POST https://****/api/v4/jobs/request: 404 Not Found (404 Not found)
May 15 03:30:48 git2 gitlab-runner[702]: WARNING: Checking for jobs... failed runner=**** status=POST https://****/api/v4/jobs/request: 404 Not Found (404 Not found)
May 15 03:31:32 git2 gitlab-runner[702]: WARNING: Checking for jobs... failed runner=**** status=POST https://****/api/v4/jobs/request: 404 Not Found (404 Not found)
This also happens when gitlab-runner verify
:
$ gitlab-runner verify
Runtime platform arch=amd64 os=linux pid=101033 revision=3e653c4e version=18.0.1
WARNING: Running in user-mode.
WARNING: The user-mode requires you to manually start builds processing:
WARNING: $ gitlab-runner run
WARNING: Use sudo for system-mode:
WARNING: $ sudo gitlab-runner...
ERROR: Verifying runner... failed runner=Siov-TtCq status=POST https://gitlab.example.com/api/v4/runners/verify: 404 Not Found (404 Not found)
ERROR: Verifying runner... failed runner=vgXEBWXnw status=POST https://gitlab.example.com/api/v4/runners/verify: 404 Not Found (404 Not found)
ERROR: Verifying runner... failed runner=U1JRrAbw8 status=POST https://gitlab.example.com/api/v4/runners/verify: 404 Not Found (404 Not found)
ERROR: Verifying runner... failed runner=GuRfT2ATy status=POST https://gitlab.example.com/api/v4/runners/verify: 404 Not Found (404 Not found)
Explanation
It appears this 404 likely occurs because the data in ci_runner_machines
table was not properly migrated, most likely because the background migration BackfillCiRunnerMachinesPartitionedTable
introduced in GitLab 17.7 via !171901 (merged) did not finish before the upgrade to GitLab 17.11.2. UPDATE: This migration was zero'ed out because it caused issues with other background migrations. Therefore it is expected that the ci_runner_machines
and ci_runner_machines_archived
tables are out of sync.
Workaround
The data in ci_runner_machines
is not that important because it is received by a runner every time it requests a new job.
ci_runner_machines_archived
to ci_runner_machines
Option 1: Resync the data from - Create the script below as
/tmp/sync.rb
. - Run
gitlab-rails console
- Run
load '/tmp/sync.rb'
.
ActiveRecord::Base.logger = Logger.new(STDOUT)
class ArchivedRunnerMachine < ActiveRecord::Base
self.table_name = 'ci_runner_machines_archived'
end
class RunnerMachine < ActiveRecord::Base
self.table_name = 'ci_runner_machines'
end
# Find all archived records whose runner IDs don't exist in the main table
archived_records = ArchivedRunnerMachine.where.not(
runner_id: RunnerMachine.select(:runner_id)
)
# Disable the trigger before starting
ActiveRecord::Base.connection.execute("ALTER TABLE ci_runner_machines DISABLE TRIGGER table_sync_trigger_bc3e7b56bd;")
begin
archived_records.find_each(batch_size: 1000) do |archived_machine|
attributes = archived_machine.attributes.except('id')
runner = Ci::Runner.find(archived_machine.runner_id)
attributes['sharding_key_id'] = runner.sharding_key_id
# Create the record in the main table
RunnerMachine.create!(attributes)
end
ensure
# Re-enable the trigger after all operations
ActiveRecord::Base.connection.execute("ALTER TABLE ci_runner_machines ENABLE TRIGGER table_sync_trigger_bc3e7b56bd;")
end
Option 2a: Delete the archived table
While the data isn't critical, it does record the history of stale/offline runners.
I found that the easiest workaround is to:
- Back up your database.
- Drop the data in
ci_runner_machines_archived
. Rungitlab-psql
and then:
DELETE FROM ci_runner_machines_archived;
Option 2b: Upgrade to GitLab 18.0
Alternatively, you can also upgrade to GitLab 18.0, which drops the ci_runner_machines_archived
table altogether.
/api/v4/runners/verify
and other runner endpoints?
Why would this lead to a 404 in the It turns out that this happens because:
- When the runner authenticates, it attempts to update the entry in
ci_runner_machines
: https://gitlab.com/gitlab-org/gitlab/-/blob/5457b685b4eb1bf52d8b06697eba6cdbb5ce5710/lib/api/ci/helpers/runner.rb#L23 - There is no entry there, so the call to
current_runner&.ensure_manager(system_xid)
attempts to insert an entry viaApplicationRecord.safe_find_or_create_by!
: https://gitlab.com/gitlab-org/gitlab/-/blob/b1e30cf78cd7794691b6ec9bfdc343624edc5e14/app/models/application_record.rb#L61-67 - There is a trigger (
table_sync_function_e438f29263
) that replicates all inserts toci_runner_machines
toci_runner_machines_archived
. - However, since the entry already exists in
ci_runner_machines_archived
, PostgreSQL rejects this insert with aunique constraint error
. From the SQL logs:
2025-05-17_22:14:02.13834 LOG: execute <unnamed>: /*application:web,correlation_id:01JVG4ZD0YJBY0P7N88C7EXNAZ,endpoint_id:POST /api/:version/runners/verify,db_config_database:gitlabhq_production,db_config_name:ci*/ INSERT INTO "ci_runner_machines" ("runner_id", "sharding_key_id", "created_at", "updated_at", "runner_type", "system_xid") VALUES (1, 1, '2025-05-17 22:14:02.135543', '2025-05-17 22:14:02.135543', 3, 's_8f8333054652') RETURNING "id"
2025-05-17_22:14:02.14081 ERROR: duplicate key value violates unique constraint "index_ci_runner_machines_on_runner_id_and_system_xid"
2025-05-17_22:14:02.14083 DETAIL: Key (runner_id, system_xid)=(1, s_8f8333054652) already exists.
2025-05-17_22:14:02.14083 CONTEXT: SQL statement "INSERT INTO ci_runner_machines_archived ("id",
2025-05-17_22:14:02.14083 "runner_id",
2025-05-17_22:14:02.14083 "executor_type",
2025-05-17_22:14:02.14084 "created_at",
2025-05-17_22:14:02.14084 "updated_at",
2025-05-17_22:14:02.14084 "contacted_at",
2025-05-17_22:14:02.14084 "version",
2025-05-17_22:14:02.14084 "revision",
2025-05-17_22:14:02.14084 "platform",
2025-05-17_22:14:02.14085 "architecture",
2025-05-17_22:14:02.14085 "ip_address",
2025-05-17_22:14:02.14085 "config",
2025-05-17_22:14:02.14085 "system_xid",
2025-05-17_22:14:02.14085 "creation_state",
2025-05-17_22:14:02.14085 "runner_type",
2025-05-17_22:14:02.14086 "sharding_key_id",
2025-05-17_22:14:02.14086 "runtime_features")
2025-05-17_22:14:02.14086 VALUES (NEW."id",
2025-05-17_22:14:02.14086 NEW."runner_id",
2025-05-17_22:14:02.14086 NEW."executor_type",
2025-05-17_22:14:02.14087 NEW."created_at",
2025-05-17_22:14:02.14088 NEW."updated_at",
2025-05-17_22:14:02.14088 NEW."contacted_at",
2025-05-17_22:14:02.14088 NEW."version",
2025-05-17_22:14:02.14088 NEW."revision",
2025-05-17_22:14:02.14089 NEW."platform",
2025-05-17_22:14:02.14089 NEW."architecture",
2025-05-17_22:14:02.14089 NEW."ip_address",
2025-05-17_22:14:02.14089 NEW."config",
2025-05-17_22:14:02.14089 NEW."system_xid",
2025-05-17_22:14:02.14089 NEW."creation_state",
2025-05-17_22:14:02.14090 NEW."runner_type",
2025-05-17_22:14:02.14090 NEW."sharding_key_id",
2025-05-17_22:14:02.14090 NEW."runtime_features")"
2025-05-17_22:14:02.14090 PL/pgSQL function table_sync_function_e438f29263() line 25 at SQL statement
- This failure causes the inner
safe_find_or_create
(https://gitlab.com/gitlab-org/gitlab/-/blob/b1e30cf78cd7794691b6ec9bfdc343624edc5e14/app/models/application_record.rb#L62) to return an emptyrecord
, causing it to raise theActiveRecord::RecordNotFound
error. - This
ActiveRecord::RecordNotFound
gets returned as a 404.
Related issues: