Corrective actions should be put here as soon as an incident is mitigated, ensure that all corrective actions mentioned in the notes below are included.
Note:
In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally.
By default, all information we can share, will be public, in accordance to our transparency value.
Click to expand or collapse the Incident Review section.
Incident Review
Summary
Service(s) affected:
Team attribution:
Time to detection:
Minutes downtime or degradation:
Metrics
Customer Impact
Who was impacted by this incident? (i.e. external customers, internal customers)
...
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
...
How many customers were affected?
...
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
...
What were the root causes?
...
Incident Response Analysis
How was the incident detected?
...
How could detection time be improved?
...
How was the root cause diagnosed?
...
How could time to diagnosis be improved?
...
How did we reach the point where we knew how to mitigate the impact?
...
How could time to mitigation be improved?
...
What went well?
...
Post Incident Analysis
Did we have other events in the past with the same root cause?
...
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
...
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
fatal: [deploy-01-sv-gstg.c.gitlab-staging-1.internal]: FAILED! => changed=true cmd: - /usr/bin/gitlab-rake - db:migrate delta: '0:00:56.040233' end: '2021-06-08 19:57:41.193581' msg: non-zero return code rc: 1 start: '2021-06-08 19:56:45.153348' stderr: |- rake aborted! StandardError: An error has occurred, all later migrations canceled: PG::QueryCanceled: ERROR: canceling statement due to statement timeout /opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/migration_helpers.rb:507:in `block in update_column_in_batches' /opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/migration_helpers.rb:484:in `loop' /opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/migration_helpers.rb:484:in `update_column_in_batches' /opt/gitlab/embedded/service/gitlab-rails/db/post_migrate/20210526222715_backfill_draft_status_on_merge_requests.rb:17:in `up' /opt/gitlab/embedded/bin/bundle:23:in `load' /opt/gitlab/embedded/bin/bundle:23:in `<main>' Caused by: ActiveRecord::QueryCanceled: PG::QueryCanceled: ERROR: canceling statement due to statement timeout /opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/migration_helpers.rb:507:in `block in update_column_in_batches' /opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/migration_helpers.rb:484:in `loop' /opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/migration_helpers.rb:484:in `update_column_in_batches' /opt/gitlab/embedded/service/gitlab-rails/db/post_migrate/20210526222715_backfill_draft_status_on_merge_requests.rb:17:in `up' /opt/gitlab/embedded/bin/bundle:23:in `load' /opt/gitlab/embedded/bin/bundle:23:in `<main>' Caused by: PG::QueryCanceled: ERROR: canceling statement due to statement timeout /opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/migration_helpers.rb:507:in `block in update_column_in_batches' /opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/migration_helpers.rb:484:in `loop' /opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/migration_helpers.rb:484:in `update_column_in_batches' /opt/gitlab/embedded/service/gitlab-rails/db/post_migrate/20210526222715_backfill_draft_status_on_merge_requests.rb:17:in `up' /opt/gitlab/embedded/bin/bundle:23:in `load' /opt/gitlab/embedded/bin/bundle:23:in `<main>' Tasks: TOP => db:migrate (See full trace by running task with --trace) stderr_lines: <omitted> stdout: |- == 20210518074332 ScheduleDisableExpirationPoliciesLinkedToNoContainerImages: migrating -- Scheduled 1 DisableExpirationPoliciesLinkedToNoContainerImages jobs with a maximum of 30000 records per batch and an interval of 120 seconds. The migration is expected to take at least 120 seconds. Expect all jobs to have completed after 2021-06-08 19:59:04 UTC." == 20210518074332 ScheduleDisableExpirationPoliciesLinkedToNoContainerImages: migrated (0.0860s) == 20210526222715 BackfillDraftStatusOnMergeRequests: migrating =============== -- transaction_open?() -> 0.0000s -- index_exists?(:merge_requests, :id, {:where=>"draft = false AND state_id = 1 AND ((title)::text ~* '^\\[draft\\]|\\(draft\\)|draft:|draft|\\[WIP\\]|WIP:|WIP'::text)", :name=>"tmp_index_merge_requests_draft_and_status", :algorithm=>:concurrently}) -> 0.0386s -- execute("SET statement_timeout TO 0") -> 0.0005s -- add_index(:merge_requests, :id, {:where=>"draft = false AND state_id = 1 AND ((title)::text ~* '^\\[draft\\]|\\(draft\\)|draft:|draft|\\[WIP\\]|WIP:|WIP'::text)", :name=>"tmp_index_merge_requests_draft_and_status", :algorithm=>:concurrently}) -> 10.7479s -- execute("RESET ALL") -> 0.0013s -- transaction_open?() -> 0.0000s -- exec_query("SELECT COUNT(*) AS count FROM \"merge_requests\" WHERE \"merge_requests\".\"state_id\" = 1 AND \"merge_requests\".\"draft\" = FALSE AND \"merge_requests\".\"title\" ~* '^\\[draft\\]|\\(draft\\)|draft:|draft|\\[WIP\\]|WIP:|WIP'") -> 0.0389s -- exec_query("SELECT \"merge_requests\".\"id\" FROM \"merge_requests\" WHERE \"merge_requests\".\"state_id\" = 1 AND \"merge_requests\".\"draft\" = FALSE AND \"merge_requests\".\"title\" ~* '^\\[draft\\]|\\(draft\\)|draft:|draft|\\[WIP\\]|WIP:|WIP' ORDER BY \"merge_requests\".\"id\" ASC LIMIT 1") -> 0.0240s -- exec_query("SELECT \"merge_requests\".\"id\" FROM \"merge_requests\" WHERE \"merge_requests\".\"id\" >= 3910 AND \"merge_requests\".\"state_id\" = 1 AND \"merge_requests\".\"draft\" = FALSE AND \"merge_requests\".\"title\" ~* '^\\[draft\\]|\\(draft\\)|draft:|draft|\\[WIP\\]|WIP:|WIP' ORDER BY \"merge_requests\".\"id\" ASC LIMIT 1 OFFSET 100") -> 0.0239s -- execute("UPDATE \"merge_requests\" SET \"draft\" = TRUE WHERE \"merge_requests\".\"id\" >= 3910 AND \"merge_requests\".\"id\" < 149105 AND \"merge_requests\".\"state_id\" = 1 AND \"merge_requests\".\"draft\" = FALSE AND \"merge_requests\".\"title\" ~* '^\\[draft\\]|\\(draft\\)|draft:|draft|\\[WIP\\]|WIP:|WIP'") -> 1.5358s -- exec_query("SELECT \"merge_requests\".\"id\" FROM \"merge_requests\" WHERE \"merge_requests\".\"id\" >= 149105 AND \"merge_requests\".\"state_id\" = 1 AND \"merge_requests\".\"draft\" = FALSE AND \"merge_requests\".\"title\" ~* '^\\[draft\\]|\\(draft\\)|draft:|draft|\\[WIP\\]|WIP:|WIP' ORDER BY \"merge_requests\".\"id\" ASC LIMIT 1 OFFSET 100") -> 0.0248s -- execute("UPDATE \"merge_requests\" SET \"draft\" = TRUE WHERE \"merge_requests\".\"id\" >= 149105 AND \"merge_requests\".\"id\" < 225739 AND \"merge_requests\".\"state_id\" = 1 AND \"merge_requests\".\"draft\" = FALSE AND \"merge_requests\".\"title\" ~* '^\\[draft\\]|\\(draft\\)|draft:|draft|\\[WIP\\]|WIP:|WIP'") -> 1.3018s -- exec_query("SELECT \"merge_requests\".\"id\" FROM \"merge_requests\" WHERE \"merge_requests\".\"id\" >= 225739 AND \"merge_requests\".\"state_id\" = 1 AND \"merge_requests\".\"draft\" = FALSE AND \"merge_requests\".\"title\" ~* '^\\[draft\\]|\\(draft\\)|draft:|draft|\\[WIP\\]|WIP:|WIP' ORDER BY \"merge_requests\".\"id\" ASC LIMIT 1 OFFSET 100") -> 0.0243s -- execute("UPDATE \"merge_requests\" SET \"draft\" = TRUE WHERE \"merge_requests\".\"id\" >= 225739 AND \"merge_requests\".\"id\" < 304889 AND \"merge_requests\".\"state_id\" = 1 AND \"merge_requests\".\"draft\" = FALSE AND \"merge_requests\".\"title\" ~* '^\\[draft\\]|\\(draft\\)|draft:|draft|\\[WIP\\]|WIP:|WIP'") -> 1.3157s -- exec_query("SELECT \"merge_requests\".\"id\" FROM \"merge_requests\" WHERE \"merge_requests\".\"id\" >= 304889 AND \"merge_requests\".\"state_id\" = 1 AND \"merge_requests\".\"draft\" = FALSE AND \"merge_requests\".\"title\" ~* '^\\[draft\\]|\\(draft\\)|draft:|draft|\\[WIP\\]|WIP:|WIP' ORDER BY \"merge_requests\".\"id\" ASC LIMIT 1 OFFSET 100") -> 0.0331s -- execute("UPDATE \"merge_requests\" SET \"draft\" = TRUE WHERE \"merge_requests\".\"id\" >= 304889 AND \"merge_requests\".\"id\" < 381103 AND \"merge_requests\".\"state_id\" = 1 AND \"merge_requests\".\"draft\" = FALSE AND \"merge_requests\".\"title\" ~* '^\\[draft\\]|\\(draft\\)|draft:|draft|\\[WIP\\]|WIP:|WIP'") -> 1.1871s -- exec_query("SELECT \"merge_requests\".\"id\" FROM \"merge_requests\" WHERE \"merge_requests\".\"id\" >= 381103 AND \"merge_requests\".\"state_id\" = 1 AND \"merge_requests\".\"draft\" = FALSE AND \"merge_requests\".\"title\" ~* '^\\[draft\\]|\\(draft\\)|draft:|draft|\\[WIP\\]|WIP:|WIP' ORDER BY \"merge_requests\".\"id\" ASC LIMIT 1 OFFSET 100") -> 0.0243s -- execute("UPDATE \"merge_requests\" SET \"draft\" = TRUE WHERE \"merge_requests\".\"id\" >= 381103 AND \"merge_requests\".\"id\" < 466040 AND \"merge_requests\".\"state_id\" = 1 AND \"merge_requests\".\"draft\" = FALSE AND \"merge_requests\".\"title\" ~* '^\\[draft\\]|\\(draft\\)|draft:|draft|\\[WIP\\]|WIP:|WIP'") -> 1.1548s -- exec_query("SELECT \"merge_requests\".\"id\" FROM \"merge_requests\" WHERE \"merge_requests\".\"id\" >= 466040 AND \"merge_requests\".\"state_id\" = 1 AND \"merge_requests\".\"draft\" = FALSE AND \"merge_requests\".\"title\" ~* '^\\[draft\\]|\\(draft\\)|draft:|draft|\\[WIP\\]|WIP:|WIP' ORDER BY \"merge_requests\".\"id\" ASC LIMIT 1 OFFSET 100") -> 0.0359s -- execute("UPDATE \"merge_requests\" SET \"draft\" = TRUE WHERE \"merge_requests\".\"id\" >= 466040 AND \"merge_requests\".\"id\" < 549247 AND \"merge_requests\".\"state_id\" = 1 AND \"merge_requests\".\"draft\" = FALSE AND \"merge_requests\".\"title\" ~* '^\\[draft\\]|\\(draft\\)|draft:|draft|\\[WIP\\]|WIP:|WIP'") -> 1.1476s -- exec_query("SELECT \"merge_requests\".\"id\" FROM \"merge_requests\" WHERE \"merge_requests\".\"id\" >= 549247 AND \"merge_requests\".\"state_id\" = 1 AND \"merge_requests\".\"draft\" = FALSE AND \"merge_requests\".\"title\" ~* '^\\[draft\\]|\\(draft\\)|draft:|draft|\\[WIP\\]|WIP:|WIP' ORDER BY \"merge_requests\".\"id\" ASC LIMIT 1 OFFSET 100") -> 0.0242s -- execute("UPDATE \"merge_requests\" SET \"draft\" = TRUE WHERE \"merge_requests\".\"id\" >= 549247 AND \"merge_requests\".\"id\" < 610890 AND \"merge_requests\".\"state_id\" = 1 AND \"merge_requests\".\"draft\" = FALSE AND \"merge_requests\".\"title\" ~* '^\\[draft\\]|\\(draft\\)|draft:|draft|\\[WIP\\]|WIP:|WIP'") -> 0.9824s -- exec_query("SELECT \"merge_requests\".\"id\" FROM \"merge_requests\" WHERE \"merge_requests\".\"id\" >= 610890 AND \"merge_requests\".\"state_id\" = 1 AND \"merge_requests\".\"draft\" = FALSE AND \"merge_requests\".\"title\" ~* '^\\[draft\\]|\\(draft\\)|draft:|draft|\\[WIP\\]|WIP:|WIP' ORDER BY \"merge_requests\".\"id\" ASC LIMIT 1 OFFSET 100") -> 0.0336s -- execute("UPDATE \"merge_requests\" SET \"draft\" = TRUE WHERE \"merge_requests\".\"id\" >= 610890 AND \"merge_requests\".\"id\" < 683391 AND \"merge_requests\".\"state_id\" = 1 AND \"merge_requests\".\"draft\" = FALSE AND \"merge_requests\".\"title\" ~* '^\\[draft\\]|\\(draft\\)|draft:|draft|\\[WIP\\]|WIP:|WIP'") -> 1.1215s -- exec_query("SELECT \"merge_requests\".\"id\" FROM \"merge_requests\" WHERE \"merge_requests\".\"id\" >= 683391 AND \"merge_requests\".\"state_id\" = 1 AND \"merge_requests\".\"draft\" = FALSE AND \"merge_requests\".\"title\" ~* '^\\[draft\\]|\\(draft\\)|draft:|draft|\\[WIP\\]|WIP:|WIP' ORDER BY \"merge_requests\".\"id\" ASC LIMIT 1 OFFSET 100") -> 0.0343s -- execute("UPDATE \"merge_requests\" SET \"draft\" = TRUE WHERE \"merge_requests\".\"id\" >= 683391 AND \"merge_requests\".\"id\" < 754385 AND \"merge_requests\".\"state_id\" = 1 AND \"merge_requests\".\"draft\" = FALSE AND \"merge_requests\".\"title\" ~* '^\\[draft\\]|\\(draft\\)|draft:|draft|\\[WIP\\]|WIP:|WIP'") -> 1.1210s -- exec_query("SELECT \"merge_requests\".\"id\" FROM \"merge_requests\" WHERE \"merge_requests\".\"id\" >= 754385 AND \"merge_requests\".\"state_id\" = 1 AND \"merge_requests\".\"draft\" = FALSE AND \"merge_requests\".\"title\" ~* '^\\[draft\\]|\\(draft\\)|draft:|draft|\\[WIP\\]|WIP:|WIP' ORDER BY \"merge_requests\".\"id\" ASC LIMIT 1 OFFSET 100") -> 0.0272s -- execute("UPDATE \"merge_requests\" SET \"draft\" = TRUE WHERE \"merge_requests\".\"id\" >= 754385 AND \"merge_requests\".\"id\" < 826256 AND \"merge_requests\".\"state_id\" = 1 AND \"merge_requests\".\"draft\" = FALSE AND \"merge_requests\".\"title\" ~* '^\\[draft\\]|\\(draft\\)|draft:|draft|\\[WIP\\]|WIP:|WIP'") -> 1.1223s -- exec_query("SELECT \"merge_requests\".\"id\" FROM \"merge_requests\" WHERE \"merge_requests\".\"id\" >= 826256 AND \"merge_requests\".\"state_id\" = 1 AND \"merge_requests\".\"draft\" = FALSE AND \"merge_requests\".\"title\" ~* '^\\[draft\\]|\\(draft\\)|draft:|draft|\\[WIP\\]|WIP:|WIP' ORDER BY \"merge_requests\".\"id\" ASC LIMIT 1 OFFSET 100") -> 0.0307s -- execute("UPDATE \"merge_requests\" SET \"draft\" = TRUE WHERE \"merge_requests\".\"id\" >= 826256 AND \"merge_requests\".\"id\" < 889295 AND \"merge_requests\".\"state_id\" = 1 AND \"merge_requests\".\"draft\" = FALSE AND \"merge_requests\".\"title\" ~* '^\\[draft\\]|\\(draft\\)|draft:|draft|\\[WIP\\]|WIP:|WIP'") -> 0.9920s -- exec_query("SELECT \"merge_requests\".\"id\" FROM \"merge_requests\" WHERE \"merge_requests\".\"id\" >= 889295 AND \"merge_requests\".\"state_id\" = 1 AND \"merge_requests\".\"draft\" = FALSE AND \"merge_requests\".\"title\" ~* '^\\[draft\\]|\\(draft\\)|draft:|draft|\\[WIP\\]|WIP:|WIP' ORDER BY \"merge_requests\".\"id\" ASC LIMIT 1 OFFSET 100") -> 0.0316s -- execute("UPDATE \"merge_requests\" SET \"draft\" = TRUE WHERE \"merge_requests\".\"id\" >= 889295 AND \"merge_requests\".\"id\" < 971969 AND \"merge_requests\".\"state_id\" = 1 AND \"merge_requests\".\"draft\" = FALSE AND \"merge_requests\".\"title\" ~* '^\\[draft\\]|\\(draft\\)|draft:|draft|\\[WIP\\]|WIP:|WIP'") -> 1.0824s -- exec_query("SELECT \"merge_requests\".\"id\" FROM \"merge_requests\" WHERE \"merge_requests\".\"id\" >= 971969 AND \"merge_requests\".\"state_id\" = 1 AND \"merge_requests\".\"draft\" = FALSE AND \"merge_requests\".\"title\" ~* '^\\[draft\\]|\\(draft\\)|draft:|draft|\\[WIP\\]|WIP:|WIP' ORDER BY \"merge_requests\".\"id\" ASC LIMIT 1 OFFSET 100") -> 0.0246s -- execute("UPDATE \"merge_requests\" SET \"draft\" = TRUE WHERE \"merge_requests\".\"id\" >= 971969 AND \"merge_requests\".\"id\" < 1041208 AND \"merge_requests\".\"state_id\" = 1 AND \"merge_requests\".\"draft\" = FALSE AND \"merge_requests\".\"title\" ~* '^\\[draft\\]|\\(draft\\)|draft:|draft|\\[WIP\\]|WIP:|WIP'") -> 1.0063s -- exec_query("SELECT \"merge_requests\".\"id\" FROM \"merge_requests\" WHERE \"merge_requests\".\"id\" >= 1041208 AND \"merge_requests\".\"state_id\" = 1 AND \"merge_requests\".\"draft\" = FALSE AND \"merge_requests\".\"title\" ~* '^\\[draft\\]|\\(draft\\)|draft:|draft|\\[WIP\\]|WIP:|WIP' ORDER BY \"merge_requests\".\"id\" ASC LIMIT 1 OFFSET 100") -> 0.0371s -- execute("UPDATE \"merge_requests\" SET \"draft\" = TRUE WHERE \"merge_requests\".\"id\" >= 1041208 AND \"merge_requests\".\"id\" < 1077072 AND \"merge_requests\".\"state_id\" = 1 AND \"merge_requests\".\"draft\" = FALSE AND \"merge_requests\".\"title\" ~* '^\\[draft\\]|\\(draft\\)|draft:|draft|\\[WIP\\]|WIP:|WIP'") stdout_lines: <omitted>
By the output, SELECT and UPDATE statements are not the ones timing out, it's the fact that the post-migration is trying to update all the data inline that it's causing the timeout. This should probably be handled as a background migration
Execution took 75min on gitlab-com-database-testing, it'll probably be faster on production. Still it may be better to execute those updates async through a background migration instead of executing them inline.
@kerrizor when resubmitting the changes consider the index is already present on staging but it's not present in production environment.
By the output, SELECT and UPDATE statements are not the ones timing out, it's the fact that the post-migration is trying to update all the data inline that it's causing the timeout.
PG::QueryCanceled: ERROR: canceling statement due to statement timeout /opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/migration_helpers.rb:507:in `block in update_column_in_batches'
@mayra-cabrera I'm unable to access job logs either, and it is not very clear to me which statement did timeout? I agree it would have been better to have the migration running as a BG job, but I am not sure I understand how the inline execution triggers statement timeout? It would be helpful to understand this as it will help us test that we do not end up with statement timeout on the background migration as well. Thank you.
This is just a guess, but I assume in production like data we have hit an update query where the number of rows that need to be scanned for update is very large, thus triggering a timeout. This is to be confirmed by actual query if we can get it. E.g:
Yeah, I've also tested the migrations and saw the count(*) over the entire merge_requests table being very fast(order of milliseconds on replica), so I did not think batch update would cause an issue, but obviously updates are heavier than selects.
I think we should:
move the migration to bg job
if it takes longer to improve and test the change to update_column_in_batches then we can just iterate through entire merge_requests table in batches of say 10_000 and apply the "merge_requests"."draft" = FALSE AND "merge_requests"."title" ~* '^\[draft\]|\(draft\)|draft:|draft|\[WIP\]|WIP:|WIP'") filtering on each batch which should prevent timeouts I would think.
it takes longer to improve and test the change to update_column_in_batches then we can just iterate through entire merge_requests table in batches of say 10_000 and apply the
It will take long time to improve or fork the update_column_in_batches - so we should go with the background migration approach avoiding timeouts.
@a_akgun, @acroitor Good call out! I originally suspected the iteration of merge requests caused the post-migration to fail, but it failed when an UPDATE statement was performed, probably due to data distribution on the merge_requests table.
Just wanted to clarify, does this qualify as an incident given that I don't think we have access to prod like data in development, so catching the failure on staging prevented a production incident?
As for corrective actions besides what @a_akgun suggested above, we probably could catch the failure prior for it getting to staging if we were to run the migration on a production like db(data size wise) before merging the MR, i.e. while in development. Can we have that as a specific pipeline job for MRs that contain data migrations though? That might help catch some of these timeout issues while still in development
we probably could catch the failure prior for it getting to staging if we were to run the migration on a production like db(data size wise) before merging the MR, i.e. while in development
Actually I also ran a few loops (select) query of the select/update pairs for update_column_in_batches on dbarchive replica while questioning update_column_in_batches.
Also, sometimes migrations could run for days - curious what would happen with 1 & 2.
@acroitor this is one excerpt ' "merge_requests"."id" >= 102644687 AND "merge_requests"."id" < 102851619 - checking the complete raw, it looks like it's gprd wit 103 million merge requests.
Isn't database testing pipeline for this purpose already
Right, did we run the testing pipeline before merging the MR? I think it could have helped to catch this.
don't have access to see that pipeline , but if I get it right, that pipeline is being run on a merged MR on staging isn't it ?
We limit access to the testing pipeline (for maintainers mostly). It has all the details including logs and other artifacts. The main purpose however is to execute migrations on a copy of production data and report back on the original MR with a summary including query statistics etc...
Kicking off the testing pipeline and getting results reported back on the MR is available to everyone. We haven't advertised this broadly yet, because we're still ironing out a few things on the backend before we can open this up. In order to kick it off, there is a manual job in the gitlab pipeline that triggers this:
Feel free to use this even though not widely announced.
In the case of this MR, AFAIS the testing pipeline executed only after the merge and this is a bit unexpected from a testing perspective. This is why the result came back empty at first. See gitlab-org/database-team/gitlab-com-database-testing#11 (closed) for context. Had it run before merge, we would have seen this result which shows the high runtime for the UPDATE query (up to 30s) in the summary.
Does it make sense to return the result as a red instead of in case any of the queries max time for any of the queries take over 15 seconds? Or in the job output may be we shoul see a red "Job failed" with an error return instead of a "Job succeeded". If that makes sense.
Also not for this case but for longer migrations it could be not realistic to wait for the database testing job. So somehow I was reluctant to running a database migration over a 100 million table. Maybe we should clarify this as a policy.
@a_akgun This is the plan - see gitlab-org/database-team/gitlab-com-database-testing#6 (closed) where we're going to add visuals to indicate a protential problem with the migration. What's currently being displayed is just the success/failure of the migration, not that it is potentially problematic. cc @alexives who's just picked up the topic
Also not for this case but for longer migrations it could be not realistic to wait for the database testing job.
Re 2 - understood. Those should likely become background migrations anyways. We have plans in gitlab-org/database-team/team-tasks#100 to also support testing background migration jobs. We would randomly only pick a few and execute those instead of the full migration.
That's some time out though, so today - feel free to trigger the pipeline regardless. If it takes too long, it's a good indicator that the migration is too long, too.
That's some time out though, so today - feel free to trigger the pipeline regardless. If it takes too long, it's a good indicator that the migration is too long, too.
@abrandl another corrective action could be to "well define" what a long post migration or migration takes and maybe "interrupt" the database testing job and return an error.
@a_akgun That sounds like a good idea and something we might want to consider after getting a start at what a "problematic" migration is in the first place (gitlab-org/database-team/team-tasks#100).
I don't think this is a corrective action to this incident though. Had we executed the pipeline, we would have needed to wait for 1-2 hours to get a result and that's at least not a problem for the runners used in CI testing or the thin clones. Is this a problem while reviewing?
It may be nice for a database maintainer to already have the result from the testing pipeline at hand. We perhaps can transition to having database reviewers or developers kick off the pipeline earlier in the process, so the wait time happens before it's sent to final review. We would need to see how stable our testing backend already is in that regard and if it can take the load...
Just wanted to clarify, does this qualify as an incident given that I don't think we have access to prod like data in development, so catching the failure on staging prevented a production incident?
@acroitor it does qualify as failed migrations block deployments. Being unable to deploy to GitLab.com puts us in a risky position where we may be unable to deploy changes needed to resolve another problem. As such a deployment blocker is always treated as an ~s2 incident.
database maintainer to already have the result from the testing pipeline at hand. We perhaps can transition to having database reviewers or developers kick off the pipeline earlier in the process
@abrandl yes it makes sense in database review process:
for every regular & post-migration we should kick off the testing pipeline,
wait for the results, check general failures among other things,
check the Max Time for every query type and compare it to our 15 seconds current timeout, and whenever we see probably 5 (M) seconds or so as max time we can revisit the queries.
Also if the total migration time is higher than N minutes, we should move the migration to background migration.
it does quality as failed migrations block deployments. Being unable to deploy to GitLab.com puts us in a risky position where we may be unable to deploy changes needed to resolve another problem. As such a deployment blocker is always treated as an ~s2 incident.
database maintainer to already have the result from the testing pipeline at hand. We perhaps can transition to having database reviewers or developers kick off the pipeline earlier in the process
If this is going to be a manual job, can we have that as a checklist sub-point in the MR description under database checklist point ? or at least in the database review guideline ? It gets a bit hard to remember all the things that one needs to check while reviewing an MR.
If this is going to be a manual job, can we have that as a checklist sub-point in the MR description under database checklist point ? or at least in the database review guideline
@acroitor Andreas is the best to comment, but I think we're approaching there - so the database testing shines as the solution to prevent future incidents.
Let me also add that you had a timing calculation and reduced the batch size to 100, which more or less matches exactly what's happening. Catching one of the queries will go over 15 milliseconds is quite a challenge.
The part which I slightly am concerned about is that, a background migration instead of a post-migration would prevent an "incident" and a production deploy freeze which is the topic here. But it wouldn't actually per se prevent the core issue. With the same implementation that error would occur in a background job. A batch size of 10 now is closer to not batching at all. As our tables get larger, I think we're having more trouble running data migrations reliably vs fast.
What's currently being displayed is just the success/failure of the migration, not that it is potentially problematic.
I'm hoping to have improvements on this front in the next couple of days. Mainly, adding a warning table to the beginning of the comment with information on queries or migrations that exceed guidelines.
We perhaps can transition to having database reviewers or developers kick off the pipeline earlier in the process
I think we should update the guidelines for preparing for database review to include kicking off that job. Every developer has access, but I'm not sure they know to do it. Maybe we could have danger post a link to kick off the job even?
It may be nice for a database maintainer to already have the result from the testing pipeline at hand. We perhaps can transition to having database reviewers or developers kick off the pipeline earlier in the process, so the wait time happens before it's sent to final review. We would need to see how stable our testing backend already is in that regard and if it can take the load.
I like this idea. When executing the testing pipeline for the original MR it took ~70 minutes to complete, having the DB reviewer to trigger the pipeline before sending the MR to the maintainer could help save some time.
Maybe we could have danger post a link to kick off the job even?
This is a neat idea. We can also start simple and instruct reviewers to kick this pipeline if the MR has migrations.
I think we should update the guidelines for preparing for database review to include kicking off that job. Every developer has access, but I'm not sure they know to do it. Maybe we could have danger post a link to kick off the job even?
Definitely did not know about the job. On top of knowing about the job would be nice to also have some instructions on what to look for within the job's results.
I ponder if we should figure out if we can somehow get the logs of our migration deployment jobs bubbled up somewhere that can be seen by Engineers. Deployments in general do not contain any Red data. It'd be neat if RM's are not the bottle neck for sharing the results of failed deployments.
A corrective action is when defining the batch size we can take attempt to consider, stricterTiming guidelines for queries
Single queries in a migration 100ms -> A batch size of 10/ or 5 or less with the current implementation, so probably we need a new implementation like @acroitor described.
Concurrent operations in a migration 5min -> This looks quite a challenge.
I think we can consider this incident closed. Thanks for all the discussion and suggestions here. If there's anything else we should add please feel free to re-open this incident.