Skip to content

Prevent int4 out-of-range problem in events.id, push_event_payloads.event_id, and ci_build_trace_sections.id – Stage 1 of 2

Nikolay Samokhvalov requested to merge nik_int4_to_int8 into master

What does this MR do?

The goal of this MR is to convert int4 PK in

  • events table (events.id),
  • related column "push_event_payloads"."event_id",
  • and PK in ci_build_trace_sections table (ci_build_trace_sections.id)

to int8 without downtime.

Table sizes (as January 2019)

Table Rows ▼ Total size Table size Index(es) Size TOAST Size
*** TOTAL *** ~4B 2698 GB (100.00%) 1273 GB (100.00%) 898 GB (100.00%) 526 GB (100.00%)
... ... ... ... ... ...
ci_build_trace_sections ~551M 97 GB (3.59%) 39 GB (3.09%) 57 GB (6.39%)
events ~282M 74 GB (2.76%) 18 GB (1.44%) 56 GB (6.26%) 8192 bytes (0.00%)
push_event_payloads ~217M 31 GB (1.15%) 26 GB (2.00%) 5593 MB (0.61%) 8192 bytes (0.00%)
... ... ... ... ... ...

So ~25% this change affects ~25% of all records in the database.

Approach

Among various alternatives, after discussions with @abrandl and @Finotto, the most conservative yet reliable method was chosen. Briefly, it can be described as follows:

  • add new int8 column
  • create unique index on it
  • fill new column with values
  • re-define constraints (primary and foreign keys)

The detailed explanation with SQL code snippets is provided in this comment (other methods considered explained there as well): https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5828#note_132625268

Once everything is finished, if the size of database is significantly large (>>1 GiB), it will be recommended (and required for GitLab.com ) to run pg_repack to eliminate the heap bloat and re-create indexes on all tables involved. Instructions will be provided.

Discuss and decide: how to deploy it?

Initially, it was implemented in a way requiring long deployment time – the step 3, where new columns get values (processing is being done in chunks of 2000), is expected to be executed during 12-36 hours on GitLab.com.

From Background Migrations documentation:

Since these migrations can take a long time to run it’s possible for new versions to be deployed while they are still running.

This is not what we might want here – step 4 must be executed after the step 3 is finished.

So the question is, what is the best approach among the following:

  1. Leave it as is, prepare for the fact that step 4 will last many hours

    • pros: exact order of execution, full control;
    • cons: the step 3 is too long (~10-30 hours), might not.
  2. [CHOSEN] Split whole thing into 2 phases: (1) setup new column and trigger, and invoke old-rows processing using post- or background migrations. Then, in the next GitLab version, when all new columns have non-null values already, do (2) constraints re-definition

    • pros: main part of deployment is shorter;
    • cons: 2 deployments instead of one.
  3. any other approach to ship which is better than the previous two?

Solution and plan

After analysis and discussions, it was decided that the whole migration will be split to two stages:

  1. [STAGE 1] New column + helper index + background migration

    • add column
    • add trigger to fill new column in freshly inserted rows
    • create helper index
    • schedule background migration to fill new column in existing rows
  2. [STAGE 2] Redefine PK, FK(s) (if any), cleanup // this will be presented in a separate MR; to be released 1 month after this one

    • 🗝 create unique index (basis for the new PK)
    • 🗝 create CHECK constraint with NOT NULL predicate (create invalid constraint + validate)
    • 🗝 if needed, redefine FKs (drop old FKs; create invalid constraint + validate)
    • in a single transaction: 🗝 drop old PK, 🗝 define new PK, rename columns
    • cleanup: drop helper index, drop trigger, drop old column

Here 🗝 means that this action is required only for cases, when PK column is being converted.

This MR only contains actions for the Stage 1. The Stage 2 is to be deployed with different MR, 1 month after this one.

Example of sidekiq log:

01:37:05 rails-background-jobs.1 | 2019-03-07T01:37:05.961Z 31365 TID-3rq9vd BackgroundMigrationWorker JID-2394e945057590a498b03f98 INFO: arguments: ["Int4ToInt8Update",["events","id","id_new",120,2500,20,96032410]]
01:37:05 rails-background-jobs.1 | 2019-03-07T01:37:05.966Z 31365 TID-3rq9vd BackgroundMigrationWorker JID-2394e945057590a498b03f98 INFO: done: 0.262 sec

- from here we can see that:

  • events table is being processed to fill id_new with values from id,
  • delays between two runs are 120 seconds,
  • 2500 rows are processed in a single batch,
  • in one "shot" (one iteration of processing) we process 20 batches (so 20 * 2500 = 50000 rows),
  • the last processed row (last processed row in the previous iteration) had ID 96032410, and
  • duration of this whole iteration was 262 ms.

In this example the following params were used:

  DELAY = 2.minutes.to_i
  BATCH_SIZE = 2_500
  BATCHES_IN_ITERATION = 20
  CONCURRENCY = 5

This gives the processing speed ~756000 rows per hour. The events table ahving 282M rows will be processed during ~15 days, so we might want to adjust parameters (for example, increasing BATCHES_IN_ITERATION and/or decreasing DELAY) to speed it up.

What are the relevant issue numbers?

https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5828

Does this MR meet the acceptance criteria?

TODO

"Since these migrations can take a long time to run it’s possible for new versions to be deployed while they are still running." -- check / protect from proceeding https://docs.gitlab.com/ee/development/background_migrations.html Looks like background migrations will not help here.

  • Update existing rows in chunks
  • Finalize / cleanup
  • Add if postgres everywhere
  • Handle MySQL separately (regular ALTER)
  • All inline TODOs are done
  • Comments
  • Also convert FK-referencing columns (for events: push_event_payloads.event_id)
  • pg_repack instructions (post-migration notes) moved to stage2
  • abstraction to be able to process other tables/columns (a separate issue?)
  • Same thing for: ci_build_trace_sections
  • progressbar: fix or remove
  • Move helper methods from MigrationHelpers to new place (Int4ToInt8Converter)
  • Specs
  • move "remembering" to the Step 2
  • SPLIT to 2 releases
  • Use background migrations in the Step 3
  • add VACUUM ANALYZE (gentle) before updating existing rows (to update visibility maps and have efficient Index-Only Scans)
  • mysql / schema.rb
    • step 1
    • step 2
Edited by Nikolay Samokhvalov

Merge request reports