ClickHouse migrations fail with TABLE_ALREADY_EXISTS when using Replicated database engine due to non-replicated schema_migrations table
Summary
When using ClickHouse with the Replicated database engine in a multi-node cluster, the gitlab:clickhouse:migrate rake task fails with TABLE_ALREADY_EXISTS errors. This (likely) occurs because the schema_migrations table is created with a non-replicated ReplacingMergeTree engine instead of ReplicatedReplacingMergeTree, causing migration state to be inconsistent across replica nodes.
Steps to reproduce
- Set up a ClickHouse cluster with the
Replicateddatabase engine (3+ nodes) - Configure GitLab to connect to the ClickHouse cluster with load balancing across replicas
- Run gitlab-rake
gitlab:clickhouse:migrate:main - Observe migration failures with
TABLE_ALREADY_EXISTSerrors
Example error
== 20240702162000 CreateCiFinishedPipelinesTable: migrating ===================
rake aborted!
StandardError: An error has occurred, all later migrations canceled:
{
"exception": "Code: 57. DB::Exception: Table gitlab_clickhouse_main_replicated.ci_finished_pipelines already exists. (TABLE_ALREADY_EXISTS) (version 25.10.1.3832 (official build))"
}
Current behavior
The schema_migrations table is created with a non-replicated engine:
CREATE TABLE gitlab_clickhouse_main_replicated.schema_migrations
(
`version` LowCardinality(String),
`active` UInt8 DEFAULT 1,
`applied_at` DateTime64(6, 'UTC') DEFAULT now64()
)
ENGINE = ReplacingMergeTree(applied_at)
PRIMARY KEY version
ORDER BY version
This causes:
- Migration records only exist on one replica node (whichever node received the initial migration)
- When GitLab connects to a different replica, it doesn't see the migration records
- GitLab attempts to re-run migrations, causing
TABLE_ALREADY_EXISTSerrors
Expected behavior
When using a Replicated database engine, the schema_migrations table should be created with ReplicatedReplacingMergeTree engine to ensure that the migration state is consistent across all replica nodes.
Root cause
In lib/click_house/migration_support/schema_migration.rb, the ensure_table method creates the schema_migrations table by calling connection.execute(query) directly, which bypasses the ReplicatedTableEnginePatcher that is used in lib/click_house/migration.rb for regular migration files.
Duo Proposed solution
Apply the same replicated engine patching when creating the schema_migrations table:
def ensure_table
return if connection.table_exists?(table_name)
query = <<~SQL
CREATE TABLE #{table_name} (
version LowCardinality(String),
active UInt8 NOT NULL DEFAULT 1,
applied_at DateTime64(6, 'UTC') NOT NULL DEFAULT now64()
)
ENGINE = ReplacingMergeTree(applied_at)
PRIMARY KEY(version)
ORDER BY (version)
SQL
# Apply replicated engine patching if using Replicated database
query = ClickHouse::ReplicatedTableEnginePatcher.patch_replicated(query) if connection.replicated_engine?
connection.execute(query)
end
Current workaround
Temporarily remove replica nodes without schema_migrations data from the load balancer, run migrations against the node that has the data, then re-add the nodes. This is operationally disruptive and defeats the purpose of having a replicated cluster.
Environment
- ClickHouse version: 25.10.1.3832
- ClickHouse setup: Multi-node cluster with
Replicateddatabase engine - Deployment type: Self-managed
Impact
This bug prevents successful GitLab upgrades for users running ClickHouse with replicated databases, which is the recommended setup for high-availability production environments according to the ClickHouse Integration documentation.