Skip to content

Schema version check for ClickHouse is preventing new GitLab Dedicated tenants from being created

Context

When adding support for running ClickHouse migrations to the GitLab-Migrations chart (#21436 (closed) and gitlab-org/charts/gitlab!4458 (merged)) a check to ensure that the ClickHouse database schema matches the application version was added to CNG in feat: Add schema version check script for Click... (gitlab-org/build/CNG!2624 - merged) and feat: Add schema version check script for Click... (gitlab-org/build/CNG!2637 - merged). This schema check prevents webservice pods from starting up when ClickHouse is enabled for an installation, but the ClickHouse instance is not accessible, not configured correctly (missing credentials), or does not have the requisite database.

The schema check intends to prevent a Rails application which is unable to connect to a properly configured ClickHouse instance from starting up. This is because post-start-up, the Rails application will make assumptions about the state of ClickHouse, which can cause issues when the user is using GitLab. This schema check was requested by @WarheadsSE when I added support for running ClickHouse migrations to the GitLab-Migrations chart.

The requirement introduced by the schema check is aligned with our ClickHouse setup documentation, where we explicitly ask users to create users, credentials, and a database before adding ClickHouse to GitLab: https://docs.gitlab.com/integration/clickhouse/#run-and-configure-clickhouse

However, the order outlined in the documentation is not the one that is followed by GitLab Dedicated when provisioning a new Dedicated tenant, due to some limitations in GET / Instrumentor.

The tool used by Dedicated (Instrumentor) attempts to deploy webservice pods first, and then initialize and configures ClickHouse. The schema check prevents this deployment from completing successfully. There is no environment variable that an be used to side-step this behavior in the schema check currently.

This issue was discovered when a new Dedicated tenant was created on 2025-10-31. An incident was started by Dedicated team members: https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/incident-management/-/issues/2107

Root cause analysis

Comment: https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/incident-management/-/issues/2107#note_2859412491

Dependencies container logs: https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/incident-management/-/issues/2107#note_2859383391

/home/instrumentor$ kubectl logs gitlab-webservice-default-6bb5b5fbf9-mng94 -n default -c dependencies
Begin parsing .erb templates from /var/opt/gitlab/templates
Writing /srv/gitlab/config/cable.yml
Writing /srv/gitlab/config/click_house.yml
Writing /srv/gitlab/config/database.yml
Writing /srv/gitlab/config/gitlab.yml
Writing /srv/gitlab/config/resque.yml
Writing /srv/gitlab/config/session_store.yml
Begin parsing .tpl templates from /var/opt/gitlab/templates
Copying other config files found in /var/opt/gitlab/templates to /srv/gitlab/config
Copying smtp_settings.rb into /srv/gitlab/config
NOTICE: Bypassing post-migrations for database version checks
Checking: resque.yml, cable.yml
[ClickHouse] INFO: Configuring ClickHouse DB main
[ClickHouse] INFO: Checking migration schema state for ClickHouse database main
[ClickHouse] INFO: ClickHouse - Database main
I, [2025-10-31T16:04:49.438895 #22]  INFO -- : SELECT version FROM schema_migrations
+ SUCCESS connecting to 'rediss://master.ggekstest-redis.lxepwl.use1.cache.amazonaws.com:6379' from cable.yml, through master.ggekstest-redis.lxepwl.use1.cache.amazonaws.com
[ClickHouse] FATAL: Unexpected error while fetching the database versions for ClickHouse main DB: Code: 516. DB::Exception: gitlab: Authentication failed: password is incorrect, or there is no user with such name. (AUTHENTICATION_FAILED) (version 25.6.2.6261 (official build))
[ClickHouse] NOTICE: Database has not been initialized yet.
[ClickHouse] INFO: There are 135 migrations pending.
[ClickHouse] INFO: schema version check bypassed by BYPASS_CLICKHOUSE_SCHEMA_VERSION='true'
+ SUCCESS connecting to 'rediss://master.ggekstest-redis.lxepwl.use1.cache.amazonaws.com:6379' from resque.yml, through master.ggekstest-redis.lxepwl.use1.cache.amazonaws.com
Checking: main
Database Schema - main (gitlabhq_production)
WARNING: schema version check bypassed by BYPASS_SCHEMA_VERSION='true'
[ClickHouse] INFO: Checking migration schema state for ClickHouse database main
[ClickHouse] INFO: ClickHouse - Database main
I, [2025-10-31T16:04:50.517645 #22]  INFO -- : SELECT version FROM schema_migrations
[ClickHouse] FATAL: Unexpected error while fetching the database versions for ClickHouse main DB: Code: 516. DB::Exception: gitlab: Authentication failed: password is incorrect, or there is no user with such name. (AUTHENTICATION_FAILED) (version 25.6.2.6261 (official build))
[ClickHouse] NOTICE: Database has not been initialized yet.
[ClickHouse] INFO: There are 135 migrations pending.
[ClickHouse] INFO: schema version check bypassed by BYPASS_CLICKHOUSE_SCHEMA_VERSION='true'
[ClickHouse] INFO: Checking migration schema state for ClickHouse database main
[ClickHouse] INFO: ClickHouse - Database main
I, [2025-10-31T16:04:51.528977 #22]  INFO -- : SELECT version FROM schema_migrations
[ClickHouse] FATAL: Unexpected error while fetching the database versions for ClickHouse main DB: Code: 516. DB::Exception: gitlab: Authentication failed: password is incorrect, or there is no user with such name. (AUTHENTICATION_FAILED) (version 25.6.2.6261 (official build))
[ClickHouse] NOTICE: Database has not been initialized yet.
[ClickHouse] INFO: There are 135 migrations pending.
[ClickHouse] INFO: schema version check bypassed by BYPASS_CLICKHOUSE_SCHEMA_VERSION='true'
[ClickHouse] INFO: Checking migration schema state for ClickHouse database main
[ClickHouse] INFO: ClickHouse - Database main
I, [2025-10-31T16:04:52.539239 #22]  INFO -- : SELECT version FROM schema_migrations
[ClickHouse] FATAL: Unexpected error while fetching the database versions for ClickHouse main DB: Code: 516. DB::Exception: gitlab: Authentication failed: password is incorrect, or there is no user with such name. (AUTHENTICATION_FAILED) (version 25.6.2.6261 (official build))
[ClickHouse] NOTICE: Database has not been initialized yet.
[ClickHouse] INFO: There are 135 migrations pending.
[ClickHouse] INFO: schema version check bypassed by BYPASS_CLICKHOUSE_SCHEMA_VERSION='true'
[ClickHouse] INFO: Checking migration schema state for ClickHouse database main
[ClickHouse] INFO: ClickHouse - Database main
I, [2025-10-31T16:04:53.549963 #22]  INFO -- : SELECT version FROM schema_migrations
[ClickHouse] FATAL: Unexpected error while fetching the database versions for ClickHouse main DB: Code: 516. DB::Exception: gitlab: Authentication failed: password is incorrect, or there is no user with such name. (AUTHENTICATION_FAILED) (version 25.6.2.6261 (official build))
[ClickHouse] NOTICE: Database has not been initialized yet.
[ClickHouse] INFO: There are 135 migrations pending.
[ClickHouse] INFO: schema version check bypassed by BYPASS_CLICKHOUSE_SCHEMA_VERSION='true'
[ClickHouse] INFO: Checking migration schema state for ClickHouse database main
[ClickHouse] INFO: ClickHouse - Database main

The error is coming from this line in CNG: https://gitlab.com/gitlab-org/build/CNG/-/blob/5b15f1558128a2dd610f6e59f97967eb3500bfd7/gitlab-rails/scripts/lib/checks/clickhouse.rb#L168

This error causes the return value from database_schema_versions to be False, which in turn causes check_schema_version to return False. There is no BYPASS_* environment variable that can be used to side-step this behavior.

Instrumentor code walk-through

ClickHouse is initialized and configured in https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/instrumentor/-/blob/00451fc8dcd381365c6111067148519f2b02a47e/aws/configure/ansible/roles/clickhouse_cloud/tasks/main.yml#L1

… which is called by https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/instrumentor/-/blob/00451fc8dcd381365c6111067148519f2b02a47e/aws/configure/ansible/clickhouse_cloud.yml#L1

… which is in turn called in the first part of the configure stage: https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/instrumentor/-/blob/00451fc8dcd381365c6111067148519f2b02a47e/aws/configure/ansible/all.yml#L2

Apparently, GET's Ansible also runs in the same configure stage as Instrumentor's Ansible according to Stages | GitLab Dedicated. But it looks like the Toolbox pod is created before ClickHouse is initialized and configured, because the migrations run in the Toolbox pod using a Rake task: https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/instrumentor/-/blob/00451fc8dcd381365c6111067148519f2b02a47e/common/ansible/playbooks/configure/roles/toolbox_pod/tasks/main.yml#L11

I don't know Instrumentor's codebase well enough to figure out why the Ansible play gitlab_charts.yml from GET runs before the ClickHouse initialization playbook.

Comparison with Postgres

Schema check for Postgres fails when the connection to Postgres fails, when the DB has not yet been created, or when the DB has been created and the schema_migrations table exists but no migrations have been executed yet.

Possible solutions

ClickHouse initialization and configuration are both within the clickhouse_cloud Ansible role inside Instrumentor. These could be separated into two different roles which run before and after GET's Ansible. Then, it would be possible to create users, credentials, roles, and databases before deploying the webservice pods, and run migrations after starting the webservice pods using the Rake task inside the Toolbox pod. Instrumentor would still have to use the BYPASS_* environment variable.

I am not sure if this is technically feasible though, because some concerns were raised about how much control we have over GET and Instrumentor ordering.

Links