2020-10-07: Postdeploy migration failure due to statement timeout

added IncidentActive Source::IMAIncidentDeclare incident severity2 labels

changed the description

This was introduced by gitlab-org/gitlab!42277 (merged).

The failing query is the in_batches version of this:

    blocking_issues_ids = <<-SQL
      SELECT issue_links.source_id AS blocking_issue_id
      FROM issue_links
      INNER JOIN issues ON issue_links.source_id = issues.id
      WHERE issue_links.link_type = 1
      AND issues.state_id = 1
      AND issues.blocking_issues_count = 0
      UNION
      SELECT issue_links.target_id AS blocking_issue_id
      FROM issue_links
      INNER JOIN issues ON issue_links.target_id = issues.id
      WHERE issue_links.link_type = 2
      AND issues.state_id = 1
      AND issues.blocking_issues_count = 0
    SQL

    relation =
      Issue.where("id IN(#{blocking_issues_ids})") # rubocop:disable GitlabSecurity/SqlInjection

Executed in_batches via queue_background_migration_jobs_by_range_at_intervals.

Escalated to dev on-call: @abrandl.

@abrandl joining the call, analysing that query at the moment. Reverting that MR could be an option - so the MR is fixed.

corrective action : For every postdeploy migration that fails, start by reverting the MR that caused it, and take it from there. We could automate this step or make it quite standard in our Release Troubleshooting procedure.

This would be a solution for unblocking deploys quickly:

Locate the MR that introduced the post-deploy migration
Delete the migration from the codebase and pick this into auto-deploy

This should be a safe approach for all post-deploy migrations.

It may become a little more complicated when there is another post-deploy migration dependent on the problematic one (we would also have to remove the ones dependent on the problematic migration).

Once we determine this is a correct approach in all cases, we may want to create runbook for this, so the on-call or delivery team can handle this without needing to rely on dev escalation.

We've decided to remove that migration to unblock the deploy. @abrandl is performing the revert. We'll mark this issue as mitigated once the revert MR is merged.

corrective action given that we've had this situation of failed background migrations quite frequently, one idea @abrandl had was to make this process more efficient, e.g. auto-reverting failed background migrations.

mentioned in commit gitlab-org/gitlab@7c800b8d

mentioned in merge request gitlab-org/gitlab!44591 (merged)

/cc @cdu1 @craig-gomes for visibility

The migration has two problematic queries:

Initialize the batching succeeded, but was very close to 15s in gprd: https://explain.depesz.com/s/kBtF
For an actual batch, this is beyond statement timeout (and failed in gprd): https://explain.depesz.com/s/yJsO

Hi @ops-gitlab-net,

This incident issue does not have any service attribution. Please add one or more of the appropriate service label that are prefixed with Service:.

Please also add a group:: scoped label to help trace to a correct engineering group.

Thanks for your help!

You are welcome to help improve this comment.

added auto updated label

mentioned in merge request gitlab-org/gitlab!42277 (merged)

We are going to remove the post-deploy migration with gitlab-org/gitlab!44591 (merged).

@felipe_artur Can you please take a look and follow-up with a fixed version? Please do not reuse the same schema version number, as this has succeeded in staging.

The assumption here is that running the background migration jobs is idempotent, so we can just leave staging in this state. Please let me know if this assumption does not hold.

@abrandl Yes sure. Thanks for taking care of this.

The assumption here is that running the background migration jobs is idempotent, so we can just leave staging in this state. Please let me know if this assumption does not hold.

Yes there should be no problem. We do have some temporary support indexes also added on gitlab-org/gitlab!42277 (merged), but maybe we can remove them after the fixed version runs?

Yes @felipe_artur , that sounds fine to me. We would ship a fixed version of the post-deploy and then in a later release remove the temp indexes as planned.

This hasn't shown in staging, because staging is just so much smaller in terms of data size that production. Therefore, the queries just succeeded there. The corrective action is to align staging better with production and/or gitlab-org/database-team/team-tasks#82 (closed).

There are a few more issues around this topic (adding as I find more):

https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/7214

corrective action The MR review comments indicate the migration has been thoroughly vetted on a dblabs clone. It may be worthwhile to figure out what we can improve with dblabs and how to improve the tooling for testing database migrations.

cc @ahegyi

I did several measurements on postgres.ai and DB-lab. The queries were also executed on a PRD replica where the temporary indexes supporting the query were not present.

If I remember correctly @felipe_artur also did some testing on the PRD rails console.

I think the problem was not running the whole migration locally, just the first few iterations. The UPDATE query from the BG job was running long and my ssh connection was disconnecting from time to time. (I have about 300ms latency to the postgres.ai server)

Idea: Maybe we should have a dedicated instance with gdk installed, close to the replica where we can leave these migrations running in a screen session.

@ahegyi Thanks for the details.

We've been discussing similar approaches recently (around gitlab-org/database-team/team-tasks#82 (closed)). It'd be a good start to have a full instance installed that is able to execute a migration fully (with access for db maintainers). We may also want to isolate this instance, e.g. from sending any data to the outside world. cc @craig-gomes since we talked about this yesterday, too.

I think the problem was not running the whole migration locally, just the first few iterations.

In this case, the problem was the scheduling logic. It gets more involved with the background migration. Ideally, we would have an environment that lets us run full background migrations from it, too.

If I remember correctly @felipe_artur also did some testing on the PRD rails console.

Yes. I tested on prod console without indexes and had no timeouts, but i just ran queries for scheduling the first two batches.

mentioned in commit gitlab-org/gitlab@ea8605d5

mentioned in issue #2795 (closed)

assigned to @abrandl

assigned to @igorwwwwwwwwwwwwwwwwwwww

assigned to @albertoramos and unassigned @igorwwwwwwwwwwwwwwwwwwww and @abrandl

changed title from postdeploy migration failure due to statement timeout to Postdeploy migration failure due to statement timeout

assigned to @igorwwwwwwwwwwwwwwwwwwww

added IncidentMitigated label and removed IncidentActive label

mentioned in issue gitlab-org/gitlab#233391 (closed)

added deployment-blocked label

mentioned in issue gitlab-org/database-team/team-tasks#100

mentioned in issue gitlab-com/www-gitlab-com#9236 (closed)

changed title from Postdeploy migration failure due to statement timeout to 2020-10-07 Postdeploy migration failure due to statement timeout

changed title from 2020-10-07 Postdeploy migration failure due to statement timeout to 2020-10-07: Postdeploy migration failure due to statement timeout