Investigate and fix LFK partition sliding problem on staging
The sliding partition strategy is currently broken for the LFK feature on staging. Assumption: the default value of the partition
was not updated properly.
Partitioned table "public.loose_foreign_keys_deleted_records"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
----------------------------+--------------------------+-----------+----------+----------------------------------------------------------------+----------+--------------+-------------
id | bigint | | not null | nextval('loose_foreign_keys_deleted_records_id_seq'::regclass) | plain | |
partition | bigint | | not null | 49 | plain | |
primary_key_value | bigint | | not null | | plain | |
status | smallint | | not null | 1 | plain | |
created_at | timestamp with time zone | | not null | now() | plain | |
fully_qualified_table_name | text | | not null | | extended | |
consume_after | timestamp with time zone | | | now() | plain | |
cleanup_attempts | smallint | | | 0 | plain | |
Partition key: LIST (partition)
Indexes:
"loose_foreign_keys_deleted_records_pkey" PRIMARY KEY, btree (partition, id)
"index_loose_foreign_keys_deleted_records_for_partitioned_query" btree (partition, fully_qualified_table_name, consume_after, id) WHERE status = 1
Check constraints:
"check_1a541f3235" CHECK (char_length(fully_qualified_table_name) <= 150)
Partitions: gitlab_partitions_dynamic.loose_foreign_keys_deleted_records_49 FOR VALUES IN ('49'),
gitlab_partitions_dynamic.loose_foreign_keys_deleted_records_50 FOR VALUES IN ('50')
In an ideal scenario, when a new partition is created, the default value of the partition
column must be updated to the same value. In this case, the default value should be 50.
The core feature works and it processes deleted records, however new partitions are no longer created which can be a risk on larger installations.
How does the partition switching work?
- Start a transaction.
- Lock the table in
EXCLUSIVE
mode. - Create the new partition(s).
- Query the active partitions, take the last one.
- Change the default value of the
partition
column to the number of the last partition.
The lock on the table should prevent concurrent modification issues. The only suspicious part is the querying of the active partitions. If this happens through a different connection (replica) then we might get stale results and would result in an incorrect partition
value.
Ideas:
- Investigate the LB connection handling and verify that the query for loading the
active_partition
uses the same connection where the transaction was started. - Implement a safety mechanism to automatically correct this problem.