Nikolay Samokhvalov requested to merge nik_int4_to_int8 into master Jan 20, 2019

What does this MR do?

The goal of this MR is to convert int4 PK in

events table (events.id),
related column "push_event_payloads"."event_id",
and PK in ci_build_trace_sections table (ci_build_trace_sections.id)

to int8 without downtime.

Table sizes (as January 2019)

Table	Rows	▼ Total size	Table size	Index(es) Size	TOAST Size
* TOTAL *	~4B	2698 GB (100.00%)	1273 GB (100.00%)	898 GB (100.00%)	526 GB (100.00%)
...	...	...	...	...	...
ci_build_trace_sections	~551M	97 GB (3.59%)	39 GB (3.09%)	57 GB (6.39%)
events	~282M	74 GB (2.76%)	18 GB (1.44%)	56 GB (6.26%)	8192 bytes (0.00%)
push_event_payloads	~217M	31 GB (1.15%)	26 GB (2.00%)	5593 MB (0.61%)	8192 bytes (0.00%)
...	...	...	...	...	...

So ~25% this change affects ~25% of all records in the database.

Approach

Among various alternatives, after discussions with @abrandl and @Finotto, the most conservative yet reliable method was chosen. Briefly, it can be described as follows:

add new int8 column
create unique index on it
fill new column with values
re-define constraints (primary and foreign keys)

The detailed explanation with SQL code snippets is provided in this comment (other methods considered explained there as well): https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5828#note_132625268

Once everything is finished, if the size of database is significantly large (>>1 GiB), it will be recommended (and required for GitLab.com ) to run pg_repack to eliminate the heap bloat and re-create indexes on all tables involved. Instructions will be provided.

Discuss and decide: how to deploy it?

Initially, it was implemented in a way requiring long deployment time – the step 3, where new columns get values (processing is being done in chunks of 2000), is expected to be executed during 12-36 hours on GitLab.com.

From Background Migrations documentation:

Since these migrations can take a long time to run it’s possible for new versions to be deployed while they are still running.

This is not what we might want here – step 4 must be executed after the step 3 is finished.

So the question is, what is the best approach among the following:

~~Leave it as is, prepare for the fact that step 4 will last many hours~~
- pros: exact order of execution, full control;
- cons: the step 3 is too long (~10-30 hours), might not.
[CHOSEN] Split whole thing into 2 phases: (1) setup new column and trigger, and invoke old-rows processing using post- or background migrations. Then, in the next GitLab version, when all new columns have non-null values already, do (2) constraints re-definition
- pros: main part of deployment is shorter;
- cons: 2 deployments instead of one.
~~any other approach to ship which is better than the previous two?~~

Solution and plan

After analysis and discussions, it was decided that the whole migration will be split to two stages:

[STAGE 1] New column + helper index + background migration
- add column
- add trigger to fill new column in freshly inserted rows
- create helper index
- schedule background migration to fill new column in existing rows
[STAGE 2] Redefine PK, FK(s) (if any), cleanup // this will be presented in a separate MR; to be released 1 month after this one
- 🗝 create unique index (basis for the new PK)
- 🗝 create CHECK constraint with NOT NULL predicate (create invalid constraint + validate)
- 🗝 if needed, redefine FKs (drop old FKs; create invalid constraint + validate)
- in a single transaction: 🗝 drop old PK, 🗝 define new PK, rename columns
- cleanup: drop helper index, drop trigger, drop old column

Here 🗝 means that this action is required only for cases, when PK column is being converted.

This MR only contains actions for the Stage 1. The Stage 2 is to be deployed with different MR, 1 month after this one.

Example of sidekiq log:

01:37:05 rails-background-jobs.1 | 2019-03-07T01:37:05.961Z 31365 TID-3rq9vd BackgroundMigrationWorker JID-2394e945057590a498b03f98 INFO: arguments: ["Int4ToInt8Update",["events","id","id_new",120,2500,20,96032410]]
01:37:05 rails-background-jobs.1 | 2019-03-07T01:37:05.966Z 31365 TID-3rq9vd BackgroundMigrationWorker JID-2394e945057590a498b03f98 INFO: done: 0.262 sec

- from here we can see that:

events table is being processed to fill id_new with values from id,
delays between two runs are 120 seconds,
2500 rows are processed in a single batch,
in one "shot" (one iteration of processing) we process 20 batches (so 20 * 2500 = 50000 rows),
the last processed row (last processed row in the previous iteration) had ID 96032410, and
duration of this whole iteration was 262 ms.

In this example the following params were used:

  DELAY = 2.minutes.to_i
  BATCH_SIZE = 2_500
  BATCHES_IN_ITERATION = 20
  CONCURRENCY = 5

This gives the processing speed ~756000 rows per hour. The events table ahving 282M rows will be processed during ~15 days, so we might want to adjust parameters (for example, increasing BATCHES_IN_ITERATION and/or decreasing DELAY) to speed it up.

What are the relevant issue numbers?

https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5828

Does this MR meet the acceptance criteria?

Changelog entry added, if necessary
Documentation created/updated via this MR
Documentation reviewed by technical writer or follow-up review issue gitlab-com/www-gitlab-com!20217 (closed)
~~Tests added for this feature/bug~~
~~Tested in all supported browsers~~
Conforms to the code review guidelines
Conforms to the merge request performance guidelines
Conforms to the style guides
Conforms to the database guides
~~Link to e2e tests MR added if this MR has Requires e2e tests label. See the Test Planning Process.~~
~~Security reports checked/validated by reviewer~~

TODO

Conduct a DB experiment on a production clone, get/document time estimates
⚠ fix max(new_id) logic
discuss with developers / infra: how to deploy? Design doc: https://docs.google.com/document/d/1R7GfID7rOD7FmxNSNFMYo1rqEYN5jS_CABuGog-bwGI/edit

"Since these migrations can take a long time to run it’s possible for new versions to be deployed while they are still running." -- check / protect from proceeding https://docs.gitlab.com/ee/development/background_migrations.html Looks like background migrations will not help here.

Edited Mar 19, 2019 by Nikolay Samokhvalov

Prevent int4 out-of-range problem in events.id, push_event_payloads.event_id, and ci_build_trace_sections.id – Stage 1 of 2