12:19 - jarv declares incident in Slack using /incident declare command.
Click to expand or collapse the Incident Review section.
Incident Review
Summary
Service(s) affected:
Team attribution:
Minutes downtime or degradation:
Metrics
Customer Impact
Who was impacted by this incident? (i.e. external customers, internal customers)
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
How many customers were affected?
If a precise customer impact number is unknown, what is the estimated potential impact?
Incident Response Analysis
How was the event detected?
How could detection time be improved?
How did we reach the point where we knew how to mitigate the impact?
How could time to mitigation be improved?
Post Incident Analysis
How was the root cause diagnosed?
How could time to diagnosis be improved?
Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident?
Was this incident triggered by a change (deployment of code or change to infrastructure. If yes, have you linked the issue which represents the change?)?
The failing query is the in_batches version of this:
blocking_issues_ids = <<-SQL SELECT issue_links.source_id AS blocking_issue_id FROM issue_links INNER JOIN issues ON issue_links.source_id = issues.id WHERE issue_links.link_type = 1 AND issues.state_id = 1 AND issues.blocking_issues_count = 0 UNION SELECT issue_links.target_id AS blocking_issue_id FROM issue_links INNER JOIN issues ON issue_links.target_id = issues.id WHERE issue_links.link_type = 2 AND issues.state_id = 1 AND issues.blocking_issues_count = 0 SQL relation = Issue.where("id IN(#{blocking_issues_ids})") # rubocop:disable GitlabSecurity/SqlInjection
Executed in_batches via queue_background_migration_jobs_by_range_at_intervals.
corrective action : For every postdeploy migration that fails, start by reverting the MR that caused it, and take it from there. We could automate this step or make it quite standard in our Release Troubleshooting procedure.
This would be a solution for unblocking deploys quickly:
Locate the MR that introduced the post-deploy migration
Delete the migration from the codebase and pick this into auto-deploy
This should be a safe approach for all post-deploy migrations.
It may become a little more complicated when there is another post-deploy migration dependent on the problematic one (we would also have to remove the ones dependent on the problematic migration).
Once we determine this is a correct approach in all cases, we may want to create runbook for this, so the on-call or delivery team can handle this without needing to rely on dev escalation.
We've decided to remove that migration to unblock the deploy. @abrandl is performing the revert. We'll mark this issue as mitigated once the revert MR is merged.
corrective action given that we've had this situation of failed background migrations quite frequently, one idea @abrandl had was to make this process more efficient, e.g. auto-reverting failed background migrations.
@felipe_artur Can you please take a look and follow-up with a fixed version? Please do not reuse the same schema version number, as this has succeeded in staging.
The assumption here is that running the background migration jobs is idempotent, so we can just leave staging in this state. Please let me know if this assumption does not hold.
@abrandl Yes sure. Thanks for taking care of this.
The assumption here is that running the background migration jobs is idempotent, so we can just leave staging in this state. Please let me know if this assumption does not hold.
Yes there should be no problem. We do have some temporary support indexes also added on gitlab-org/gitlab!42277 (merged), but maybe we can remove them after the fixed version runs?
Yes @felipe_artur , that sounds fine to me. We would ship a fixed version of the post-deploy and then in a later release remove the temp indexes as planned.
This hasn't shown in staging, because staging is just so much smaller in terms of data size that production. Therefore, the queries just succeeded there. The corrective action is to align staging better with production and/or gitlab-org/database-team/team-tasks#82 (closed).
There are a few more issues around this topic (adding as I find more):
corrective action The MR review comments indicate the migration has been thoroughly vetted on a dblabs clone. It may be worthwhile to figure out what we can improve with dblabs and how to improve the tooling for testing database migrations.
I did several measurements on postgres.ai and DB-lab. The queries were also executed on a PRD replica where the temporary indexes supporting the query were not present.
If I remember correctly @felipe_artur also did some testing on the PRD rails console.
I think the problem was not running the whole migration locally, just the first few iterations. The UPDATE query from the BG job was running long and my ssh connection was disconnecting from time to time. (I have about 300ms latency to the postgres.ai server)
Idea: Maybe we should have a dedicated instance with gdk installed, close to the replica where we can leave these migrations running in a screen session.
We've been discussing similar approaches recently (around gitlab-org/database-team/team-tasks#82 (closed)). It'd be a good start to have a full instance installed that is able to execute a migration fully (with access for db maintainers). We may also want to isolate this instance, e.g. from sending any data to the outside world. cc @craig-gomes since we talked about this yesterday, too.
I think the problem was not running the whole migration locally, just the first few iterations.
In this case, the problem was the scheduling logic. It gets more involved with the background migration. Ideally, we would have an environment that lets us run full background migrations from it, too.
Alberto Ramoschanged title from Postdeploy migration failure due to statement timeout to 2020-10-07 Postdeploy migration failure due to statement timeout
changed title from Postdeploy migration failure due to statement timeout to 2020-10-07 Postdeploy migration failure due to statement timeout
Brent Newtonchanged title from 2020-10-07 Postdeploy migration failure due to statement timeout to 2020-10-07: Postdeploy migration failure due to statement timeout
changed title from 2020-10-07 Postdeploy migration failure due to statement timeout to 2020-10-07: Postdeploy migration failure due to statement timeout