Initialize conversion of events.id to bigint [RUN ALL RSPEC] [RUN AS-IF-FOSS]

Merged Yannis Roussos requested to merge 288004-initialize-events-id-conversion-to-bigint into master

What does this MR do?

Related to #288004 (closed), #292841 (closed) and &5654 (closed)

With this MR we use the migration helper introduced with !49778 (merged) to initialize the conversion of events.id and all related foreign key that reference it from int to bigint.

  • Create new column events.id_convert_to_bigint, with bigint NOT NULL DEFAULT 0
  • Install sync triggers
  • Create a batched_background_migration using CopyColumnUsingBackgroundMigrationJob jobs to backfill the column
  • Same for the column push_event_payloads.event_id that references events.id

Migration Analysis

Previous Analysis from initial iteration

At the moment, events has 921M records (https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/snippets/2056787#l001-table-sizes)

While testing with postgres.ai, the only viable batch size for a single update seems to be 100 records at a time, which takes around 1.5 seconds per update.

Given a worst case scenario of 1.5 seconds per sub-batch in production, we chose to set the migration with a batch size of 7000 and a sub-batch size of 100 (i.e. 70 batches run per job x 1.5 sec --> 105 seconds worst case scenario with a 2 min delay between jobs)

With the current batch size of 7000 and 921M records in events, we will have to:

  1. Run 131,572 jobs (= 921000000 / 7000)
  2. With a 2 min delay time per job, that will require 263144 minutes ( ~= 4385 hours ~= 182 days)

The analysis for push_event_payloads is similar, but as it only has ~665M records, it should be ~60% faster to finish.

Query Analysis

As an example of a worst case scenario we found, I am going to pick a 100 batch around the middle of the table:

gitlabhq_production=> SELECT MIN(id), MAX(id) FROM events;
   min    |    max
----------+------------
 71595568 | 1062234093

SELECT id
FROM events
WHERE id < 762234093
ORDER BY id desc
LIMIT 1;
    id
-----------
 762234092

SELECT id
FROM events
WHERE id < 762234093
ORDER BY id desc
LIMIT 1 OFFSET 100;
    id
-----------
 762233991

Queries:

exec ALTER TABLE events ADD COLUMN id_convert_to_bigint bigint NOT NULL DEFAULT 0;

The query has been executed. Duration: 79.000 ms

Find min id for the batch: https://explain.depesz.com/s/laKG

explain SELECT id
FROM events
WHERE id < 762234093
ORDER BY id desc
LIMIT 1 OFFSET 100;

Time: 13.892 ms  
  - planning: 0.202 ms  
  - execution: 13.690 ms  

Find max id for the batch: https://explain.depesz.com/s/owMpw

explain SELECT id
FROM events
WHERE id < 762234093
ORDER BY id desc
LIMIT 1;

Time: 0.307 ms  
  - planning: 0.226 ms  

Worst case found for updating 100 records: https://explain.depesz.com/s/etMG

explain UPDATE events
SET id_convert_to_bigint = id
WHERE id BETWEEN 762233991 AND 762234092

Time: 1.741 s  
  - planning: 0.183 ms  
  - execution: 1.741 s  

Current Approach

Based on the previous analysis, we saw worst case UPDATE times of 1.5s in database-lab, with a batch size of 100. With those estimates, this events migration would alone would take ~182 days to complete (at the time the analysis was done).

However, we also know that UPDATE queries are particularly slow in database-lab. We expect them to be at least 3x faster on production hardware, with even that being a (hopefully) conservative estimate. Unfortunately, we don't have any great way to due this type of testing on a real production environment.

To solve this, we introduced a new migration framework to track execution in the database, which uses a sidekiq cron to run each job. Since the background jobs are not pre-queued, we can adjust the batch_size and sub_batch_size as the migration runs, and the next job will pick up the changes.

This is a first migration using the new framework, so we can adjust configuration and attempt to greatly speed up the int4-to-int8 conversions for some of our large tables. Due to the unknown nature of the changes and urgency of migrating the columns on GitLab.com, we're rolling the migration out there first, where we can monitor the approach.

Migration Output

db:migrate:up
rails db:migrate:up VERSION=20210311120152
== 20210311120152 AddMetricsToBatchedBackgroundMigrationJobs: migrating =======
-- add_column(:batched_background_migration_jobs, :metrics, :jsonb, {:null=>false, :default=>{}})
   -> 0.0048s
== 20210311120152 AddMetricsToBatchedBackgroundMigrationJobs: migrated (0.0048s)

rails db:migrate:up VERSION=20210311120153
== 20210311120153 InitializeConversionOfEventsIdToBigint: migrating ===========
-- table_exists?(:events)
   -> 0.0011s
-- column_exists?(:events, :id)
   -> 0.0009s
-- column_exists?(:events, :id)
   -> 0.0009s
-- columns(:events)
   -> 0.0007s
-- add_column(:events, "id_convert_to_bigint", :bigint, {:default=>0, :null=>false})
   -> 0.0016s
-- quote_table_name(:events)
   -> 0.0000s
-- quote_column_name(:id)
   -> 0.0000s
-- quote_column_name("id_convert_to_bigint")
   -> 0.0000s
-- execute("CREATE OR REPLACE FUNCTION trigger_69523443cc10()\nRETURNS trigger AS\n$BODY$\nBEGIN\n  NEW.\"id_convert_to_bigint\" := NEW.\"id\";\n  RETURN NEW;\nEND;\n$BODY$\nLANGUAGE 'plpgsql'\nVOLATILE\n")
   -> 0.0033s
-- execute("DROP TRIGGER IF EXISTS trigger_69523443cc10\nON \"events\"\n")
   -> 0.0006s
-- execute("CREATE TRIGGER trigger_69523443cc10\nBEFORE INSERT OR UPDATE\nON \"events\"\nFOR EACH ROW\nEXECUTE FUNCTION trigger_69523443cc10()\n")
   -> 0.0007s
== 20210311120153 InitializeConversionOfEventsIdToBigint: migrated (0.0178s) ==

rails db:migrate:up VERSION=20210311120154
== 20210311120154 InitializeConversionOfPushEventPayloadsEventIdToBigint: migrating
-- table_exists?(:push_event_payloads)
   -> 0.0010s
-- column_exists?(:push_event_payloads, :event_id)
   -> 0.0009s
-- column_exists?(:push_event_payloads, :event_id)
   -> 0.0010s
-- columns(:push_event_payloads)
   -> 0.0007s
-- add_column(:push_event_payloads, "event_id_convert_to_bigint", :bigint, {:default=>0, :null=>false})
   -> 0.0016s
-- quote_table_name(:push_event_payloads)
   -> 0.0000s
-- quote_column_name(:event_id)
   -> 0.0000s
-- quote_column_name("event_id_convert_to_bigint")
   -> 0.0000s
-- execute("CREATE OR REPLACE FUNCTION trigger_07c94931164e()\nRETURNS trigger AS\n$BODY$\nBEGIN\n  NEW.\"event_id_convert_to_bigint\" := NEW.\"event_id\";\n  RETURN NEW;\nEND;\n$BODY$\nLANGUAGE 'plpgsql'\nVOLATILE\n")
   -> 0.0017s
-- execute("DROP TRIGGER IF EXISTS trigger_07c94931164e\nON \"push_event_payloads\"\n")
   -> 0.0005s
-- execute("CREATE TRIGGER trigger_07c94931164e\nBEFORE INSERT OR UPDATE\nON \"push_event_payloads\"\nFOR EACH ROW\nEXECUTE FUNCTION trigger_07c94931164e()\n")
   -> 0.0007s
== 20210311120154 InitializeConversionOfPushEventPayloadsEventIdToBigint: migrated (0.0139s)

rails db:migrate:up VERSION=20210311120155
== 20210311120155 BackfillEventsIdForBigintConversion: migrating ==============
-- table_exists?(:events)
   -> 0.0011s
-- column_exists?(:events, :id)
   -> 0.0011s
-- column_exists?(:events, :id)
   -> 0.0008s
-- column_exists?(:events, "id_convert_to_bigint")
   -> 0.0008s
== 20210311120155 BackfillEventsIdForBigintConversion: migrated (0.6227s) =====

rails db:migrate:up VERSION=20210311120156
== 20210311120156 BackfillPushEventPayloadEventIdForBigintConversion: migrating
-- table_exists?(:push_event_payloads)
   -> 0.0011s
-- column_exists?(:push_event_payloads, :event_id)
   -> 0.0011s
-- column_exists?(:push_event_payloads, :event_id)
   -> 0.0008s
-- column_exists?(:push_event_payloads, "event_id_convert_to_bigint")
   -> 0.0008s
== 20210311120156 BackfillPushEventPayloadEventIdForBigintConversion: migrated (0.1801s)
db:migrate:down
rails db:migrate:down VERSION=20210311120156
== 20210311120156 BackfillPushEventPayloadEventIdForBigintConversion: reverting
== 20210311120156 BackfillPushEventPayloadEventIdForBigintConversion: reverted (0.0080s)

rails db:migrate:down VERSION=20210311120155
== 20210311120155 BackfillEventsIdForBigintConversion: reverting ==============
== 20210311120155 BackfillEventsIdForBigintConversion: reverted (0.0088s) =====

rails db:migrate:down VERSION=20210311120154
== 20210311120154 InitializeConversionOfPushEventPayloadsEventIdToBigint: reverting
-- execute("DROP TRIGGER IF EXISTS trigger_07c94931164e ON push_event_payloads")
   -> 0.0016s
-- execute("DROP FUNCTION IF EXISTS trigger_07c94931164e()")
   -> 0.0005s
-- remove_column(:push_event_payloads, :event_id_convert_to_bigint)
   -> 0.0009s
== 20210311120154 InitializeConversionOfPushEventPayloadsEventIdToBigint: reverted (0.0031s)

rails db:migrate:down VERSION=20210311120153
== 20210311120153 InitializeConversionOfEventsIdToBigint: reverting ===========
-- execute("DROP TRIGGER IF EXISTS trigger_69523443cc10 ON events")
   -> 0.0018s
-- execute("DROP FUNCTION IF EXISTS trigger_69523443cc10()")
   -> 0.0006s
-- remove_column(:events, :id_convert_to_bigint)
   -> 0.0021s
== 20210311120153 InitializeConversionOfEventsIdToBigint: reverted (0.0046s) ==

rails db:migrate:down VERSION=20210311120152
== 20210311120152 AddMetricsToBatchedBackgroundMigrationJobs: reverting =======
-- remove_column(:batched_background_migration_jobs, :metrics, :jsonb, {:null=>false, :default=>{}})
   -> 0.0024s
== 20210311120152 AddMetricsToBatchedBackgroundMigrationJobs: reverted (0.0056s)

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Security

If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:

  • [-] Label as security and @ mention @gitlab-com/gl-security/appsec
  • [-] The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
  • [-] Security reports checked/validated by a reviewer from the AppSec team
Edited by Patrick Bair