Upgrading self-managed Gitlab to v18.5 leads to a failing DB migration when there is an instance-level Slack integration

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

Close this issue

Summary

Upgrading self-managed Gitlab CE to v18.5 leads to a failing DB migration when there is an instance-level Slack integration.

Steps to reproduce

Have a self-managed Gitlab v18.4 CE instance, which was previously affected by the bug described in !202790 (merged) - that is - it has (a now disabled) instance-level Slack integration.
Initiate an omnibus package upgrade of gitlab-ce to v18.5.

Example Project

N/A - need a complete Gitlab instance to demonstrate.

What is the current bug behavior?

Downtime of the Gitlab instance - sudo gitlab-ctl status shows most services as down.

There is a DB migration which fails during e.g. gitlab-ctl reconfigure, which also prevents further needed migrations for the new v18.5, leading to an unusable Gitlab with most services in the down state.

What is the expected correct behavior?

sudo gitlab-ctl status shows all services up and running

Relevant logs and/or screenshots

Most services are down:

sudo gitlab-ctl status
down: alertmanager: 46771s, normally up; run: log: (pid 922) 1861459s
down: crond: 46771s, normally up; run: log: (pid 913) 1861459s
run: gitaly: (pid 3969406) 46668s; run: log: (pid 920) 1861459s
down: gitlab-exporter: 46770s, normally up; run: log: (pid 938) 1861459s
down: gitlab-workhorse: 46770s, normally up; run: log: (pid 912) 1861459s
down: logrotate: 46769s, normally up; run: log: (pid 905) 1861459s
down: nginx: 46769s, normally up; run: log: (pid 923) 1861459s
down: node-exporter: 46769s, normally up; run: log: (pid 897) 1861459s
down: postgres-exporter: 46768s, normally up; run: log: (pid 907) 1861459s
run: postgresql: (pid 15276) 1861060s; run: log: (pid 921) 1861459s
down: prometheus: 46768s, normally up; run: log: (pid 900) 1861459s
down: puma: 46765s, normally up; run: log: (pid 903) 1861459s
run: redis: (pid 926) 1861459s; run: log: (pid 910) 1861459s
down: redis-exporter: 46765s, normally up; run: log: (pid 898) 1861459s
down: registry: 46764s, normally up; run: log: (pid 924) 1861459s
down: sidekiq: 46760s, normally up; run: log: (pid 902) 1861459s

Discovered a failing DB (post) migration (e.g. try to do gitlab-ctl reconfigure and you get the error):

...

bash_hide_env[migrate gitlab-rails database] action run
      [execute] Skipping Topology Service health check due to the cell being disabled
                Running db:migrate rake task
                main: == [advisory_lock_connection] object_id: 68660, pg_backend_pid: 4052022
                main: == 20250922093672 IntegrationsValidateMultipleColumnNotNullConstraint: migrating 
                main: -- execute("SET statement_timeout TO 0")
                main:    -> 0.0009s
                main: -- execute("ALTER TABLE integrations VALIDATE CONSTRAINT check_2aae034509;")
                main: -- execute("RESET statement_timeout")
                main: == [advisory_lock_connection] object_id: 68660, pg_backend_pid: 4052022
                rake aborted!
                StandardError: An error has occurred, this and all later migrations canceled:
                
                PG::InFailedSqlTransaction: ERROR:  current transaction is aborted, commands ignored until end of transaction block

...

ActiveRecord::StatementInvalid: PG::CheckViolation: ERROR:  check constraint "check_2aae034509" of relation "integrations" is violated by some row

The failing migration is added with !204744 (merged) - IntegrationsValidateMultipleColumnNotNullConstraint in db/post_migrate/20250922093672_integrations_validate_multiple_column_not_null_constraint.rb
.

The MR from above appears to try to validate the presence of some combinations of attributes (sharding keys) for the integrations DB table.

Connect to the DB:

sudo -u gitlab-psql /opt/gitlab/embedded/bin/psql -h /var/opt/gitlab/postgresql -d gitlabhq_production

Track down violators - we want exactly 1 non-null from 3 columns - project_id, group_id, organization_id:

check for both group_id + organization_id being non-null - 0 results
- SELECT * FROM integrations WHERE group_id IS NOT NULL AND organization_id IS NOT NULL;
check for both project_id + organization_id being non-null - 0 results
- SELECT * FROM integrations WHERE project_id IS NOT NULL AND organization_id IS NOT NULL;
check for both group_id + project_id being non-null - 0 results
- SELECT * FROM integrations WHERE group_id IS NOT NULL AND project_id IS NOT NULL;
check for all 3 being null - HAS 1 VIOLATOR FOR SLACK INTEGRATION:
- SELECT id, created_at, updated_at, active, category, instance, inherit_from_id, type_new FROM integrations WHERE group_id IS NULL AND project_id IS NULL AND organization_id IS NULL;
  - 6 | 2025-06-02 09:50:48.681851 | 2025-09-03 09:04:21.757086 | f | chat | t | | Integrations::GitlabSlackApplication

Fixed by deleting the violating record:

DELETE FROM integrations WHERE group_id IS NULL AND project_id IS NULL AND organization_id IS NULL;

Then running sudo gitlab-ctl reconfigure passed OK.

Output of checks

Results of GitLab environment info

Added output of sudo gitlab-rake gitlab:env:info below

Expand for output related to GitLab environment info

 sudo gitlab-rake gitlab:env:info  System information  System: Debian 12  Current User: git  Using RVM: no  Ruby Version: 3.2.8  Gem Version: 3.7.1  Bundler Version:2.7.1  Rake Version: 13.0.6  Redis Version: 7.2.10  Sidekiq Version:7.3.9  Go Version: unknown  GitLab information  Version: 18.5.0  Revision: a2f69d15eba  Directory: /opt/gitlab/embedded/service/gitlab-rails  DB Adapter: PostgreSQL  DB Version: 16.10  URL: REDACTED  HTTP Clone URL: REDACTED  SSH Clone URL: REDACTED  Using LDAP: no  Using Omniauth: yes  Omniauth Providers: google_oauth2  GitLab Shell  Version: 14.45.3  Repository storages:  - default: unix:/var/opt/gitlab/gitaly/gitaly.socket  GitLab Shell path: /opt/gitlab/embedded/service/gitlab-shell  Gitaly  - default Address: unix:/var/opt/gitlab/gitaly/gitaly.socket - default Version: 18.5.0 - default Git Version: 2.50.1

Results of GitLab application Check

Did NOT manage to run gitlab:check before resolving the issue.

Expand for output related to the GitLab application check


  (For installations with omnibus-gitlab package run and paste the output of: \\\`sudo gitlab-rake gitlab:check SANITIZE=true\\\`)  (For installations from source run and paste the output of: \\\`sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production SANITIZE=true\\\`)  (we will only investigate if the tests are passing)

Workarounds

Manually connect to the DB and delete the violating record for the Slack integration:

sudo -u gitlab-psql /opt/gitlab/embedded/bin/psql -h /var/opt/gitlab/postgresql -d gitlabhq_production for omnibus self-managed on a single node
DELETE FROM integrations WHERE group_id IS NULL AND project_id IS NULL AND organization_id IS NULL;

Possible fixes

Not sure what the root cause is, need to analyze prior linked issued. Possibly a bad assumption in !204744 (merged) that there are no cases with all 3 attributes being NULL.

Patch release information for backports

If the bug fix needs to be backported in a patch release to a version under the maintenance policy, please follow the steps on the patch release runbook for GitLab engineers.

Refer to the internal "Release Information" dashboard for information about the next patch release, including the targeted versions, expected release date, and current status.

High-severity bug remediation

To remediate high-severity issues requiring an internal release for single-tenant SaaS instances, refer to the internal release process for engineers.

Edited Oct 26, 2025 by 🤖 GitLab Bot 🤖