Schema version check for ClickHouse is preventing new GitLab Dedicated tenants from being created
Context
When adding support for running ClickHouse migrations to the GitLab-Migrations chart
(#21436 (closed) and
gitlab-org/charts/gitlab!4458 (merged)) a check to ensure that the
ClickHouse database schema matches the application version was added to CNG in
feat: Add schema version check script for Click... (gitlab-org/build/CNG!2624 - merged) and
feat: Add schema version check script for Click... (gitlab-org/build/CNG!2637 - merged). This schema check prevents
webservice pods from starting up when ClickHouse is enabled for an installation, but the ClickHouse instance is
not accessible, not configured correctly (missing credentials), or does not have the requisite
database.
The schema check intends to prevent a Rails application which is unable to connect to a properly configured ClickHouse instance from starting up. This is because post-start-up, the Rails application will make assumptions about the state of ClickHouse, which can cause issues when the user is using GitLab. This schema check was requested by @WarheadsSE when I added support for running ClickHouse migrations to the GitLab-Migrations chart.
The requirement introduced by the schema check is aligned with our ClickHouse setup documentation, where we explicitly ask users to create users, credentials, and a database before adding ClickHouse to GitLab: https://docs.gitlab.com/integration/clickhouse/#run-and-configure-clickhouse
However, the order outlined in the documentation is not the one that is followed by GitLab Dedicated when provisioning a new Dedicated tenant, due to some limitations in GET / Instrumentor.
The tool used by Dedicated (Instrumentor) attempts to deploy webservice pods first, and
then initialize and configures ClickHouse. The schema check prevents this deployment from completing successfully. There
is no environment variable that an be used to side-step this behavior in the schema check currently.
This issue was discovered when a new Dedicated tenant was created on 2025-10-31. An incident was started by Dedicated team members: https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/incident-management/-/issues/2107
Root cause analysis
Dependencies container logs: https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/incident-management/-/issues/2107#note_2859383391
/home/instrumentor$ kubectl logs gitlab-webservice-default-6bb5b5fbf9-mng94 -n default -c dependencies
Begin parsing .erb templates from /var/opt/gitlab/templates
Writing /srv/gitlab/config/cable.yml
Writing /srv/gitlab/config/click_house.yml
Writing /srv/gitlab/config/database.yml
Writing /srv/gitlab/config/gitlab.yml
Writing /srv/gitlab/config/resque.yml
Writing /srv/gitlab/config/session_store.yml
Begin parsing .tpl templates from /var/opt/gitlab/templates
Copying other config files found in /var/opt/gitlab/templates to /srv/gitlab/config
Copying smtp_settings.rb into /srv/gitlab/config
NOTICE: Bypassing post-migrations for database version checks
Checking: resque.yml, cable.yml
[ClickHouse] INFO: Configuring ClickHouse DB main
[ClickHouse] INFO: Checking migration schema state for ClickHouse database main
[ClickHouse] INFO: ClickHouse - Database main
I, [2025-10-31T16:04:49.438895 #22] INFO -- : SELECT version FROM schema_migrations
+ SUCCESS connecting to 'rediss://master.ggekstest-redis.lxepwl.use1.cache.amazonaws.com:6379' from cable.yml, through master.ggekstest-redis.lxepwl.use1.cache.amazonaws.com
[ClickHouse] FATAL: Unexpected error while fetching the database versions for ClickHouse main DB: Code: 516. DB::Exception: gitlab: Authentication failed: password is incorrect, or there is no user with such name. (AUTHENTICATION_FAILED) (version 25.6.2.6261 (official build))
[ClickHouse] NOTICE: Database has not been initialized yet.
[ClickHouse] INFO: There are 135 migrations pending.
[ClickHouse] INFO: schema version check bypassed by BYPASS_CLICKHOUSE_SCHEMA_VERSION='true'
+ SUCCESS connecting to 'rediss://master.ggekstest-redis.lxepwl.use1.cache.amazonaws.com:6379' from resque.yml, through master.ggekstest-redis.lxepwl.use1.cache.amazonaws.com
Checking: main
Database Schema - main (gitlabhq_production)
WARNING: schema version check bypassed by BYPASS_SCHEMA_VERSION='true'
[ClickHouse] INFO: Checking migration schema state for ClickHouse database main
[ClickHouse] INFO: ClickHouse - Database main
I, [2025-10-31T16:04:50.517645 #22] INFO -- : SELECT version FROM schema_migrations
[ClickHouse] FATAL: Unexpected error while fetching the database versions for ClickHouse main DB: Code: 516. DB::Exception: gitlab: Authentication failed: password is incorrect, or there is no user with such name. (AUTHENTICATION_FAILED) (version 25.6.2.6261 (official build))
[ClickHouse] NOTICE: Database has not been initialized yet.
[ClickHouse] INFO: There are 135 migrations pending.
[ClickHouse] INFO: schema version check bypassed by BYPASS_CLICKHOUSE_SCHEMA_VERSION='true'
[ClickHouse] INFO: Checking migration schema state for ClickHouse database main
[ClickHouse] INFO: ClickHouse - Database main
I, [2025-10-31T16:04:51.528977 #22] INFO -- : SELECT version FROM schema_migrations
[ClickHouse] FATAL: Unexpected error while fetching the database versions for ClickHouse main DB: Code: 516. DB::Exception: gitlab: Authentication failed: password is incorrect, or there is no user with such name. (AUTHENTICATION_FAILED) (version 25.6.2.6261 (official build))
[ClickHouse] NOTICE: Database has not been initialized yet.
[ClickHouse] INFO: There are 135 migrations pending.
[ClickHouse] INFO: schema version check bypassed by BYPASS_CLICKHOUSE_SCHEMA_VERSION='true'
[ClickHouse] INFO: Checking migration schema state for ClickHouse database main
[ClickHouse] INFO: ClickHouse - Database main
I, [2025-10-31T16:04:52.539239 #22] INFO -- : SELECT version FROM schema_migrations
[ClickHouse] FATAL: Unexpected error while fetching the database versions for ClickHouse main DB: Code: 516. DB::Exception: gitlab: Authentication failed: password is incorrect, or there is no user with such name. (AUTHENTICATION_FAILED) (version 25.6.2.6261 (official build))
[ClickHouse] NOTICE: Database has not been initialized yet.
[ClickHouse] INFO: There are 135 migrations pending.
[ClickHouse] INFO: schema version check bypassed by BYPASS_CLICKHOUSE_SCHEMA_VERSION='true'
[ClickHouse] INFO: Checking migration schema state for ClickHouse database main
[ClickHouse] INFO: ClickHouse - Database main
I, [2025-10-31T16:04:53.549963 #22] INFO -- : SELECT version FROM schema_migrations
[ClickHouse] FATAL: Unexpected error while fetching the database versions for ClickHouse main DB: Code: 516. DB::Exception: gitlab: Authentication failed: password is incorrect, or there is no user with such name. (AUTHENTICATION_FAILED) (version 25.6.2.6261 (official build))
[ClickHouse] NOTICE: Database has not been initialized yet.
[ClickHouse] INFO: There are 135 migrations pending.
[ClickHouse] INFO: schema version check bypassed by BYPASS_CLICKHOUSE_SCHEMA_VERSION='true'
[ClickHouse] INFO: Checking migration schema state for ClickHouse database main
[ClickHouse] INFO: ClickHouse - Database main
The error is coming from this line in CNG: https://gitlab.com/gitlab-org/build/CNG/-/blob/5b15f1558128a2dd610f6e59f97967eb3500bfd7/gitlab-rails/scripts/lib/checks/clickhouse.rb#L168
This error causes the return value from database_schema_versions to be False, which in turn
causes check_schema_version to return False. There is no BYPASS_* environment variable that
can be used to side-step this behavior.
Instrumentor code walk-through
ClickHouse is initialized and configured in https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/instrumentor/-/blob/00451fc8dcd381365c6111067148519f2b02a47e/aws/configure/ansible/roles/clickhouse_cloud/tasks/main.yml#L1
… which is called by https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/instrumentor/-/blob/00451fc8dcd381365c6111067148519f2b02a47e/aws/configure/ansible/clickhouse_cloud.yml#L1
… which is in turn called in the first part of the configure stage:
https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/instrumentor/-/blob/00451fc8dcd381365c6111067148519f2b02a47e/aws/configure/ansible/all.yml#L2
Apparently, GET's Ansible also runs in the same configure stage as Instrumentor's Ansible
according to Stages | GitLab Dedicated. But it looks like the Toolbox pod is created before
ClickHouse is initialized and configured, because the migrations run in the Toolbox pod using a Rake
task: https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/instrumentor/-/blob/00451fc8dcd381365c6111067148519f2b02a47e/common/ansible/playbooks/configure/roles/toolbox_pod/tasks/main.yml#L11
I don't know Instrumentor's codebase well enough to figure out why the Ansible play gitlab_charts.yml from GET runs before the ClickHouse initialization playbook.
Comparison with Postgres
Schema check for Postgres fails when the connection to Postgres fails, when the DB has not yet been
created, or when the DB has been created and the schema_migrations table exists but no migrations
have been executed yet.
Possible solutions
ClickHouse initialization and configuration are both within the clickhouse_cloud Ansible role
inside Instrumentor. These could be separated into two different roles which run before and after
GET's Ansible. Then, it would be possible to create users, credentials, roles, and databases
before deploying the webservice pods, and run migrations after starting the webservice pods
using the Rake task inside the Toolbox pod. Instrumentor would still have to use the BYPASS_*
environment variable.
I am not sure if this is technically feasible though, because some concerns were raised about how much control we have over GET and Instrumentor ordering.
Links
- Some part of this discussion happened in the MR where this schema check was added: gitlab-org/build/CNG!2624 (comment 2858297074)