2018-07-06: 11.1.0.rc-any exception request for gitlab-org/gitlab-ce!20331

Exception request

  • Merge request to be considered for picking: gitlab-org/gitlab-ce!20331

Why it needs to be picked

This MR schedules the deletion of a lot of data from merge_request_diff_files. This is by far the biggest table on GitLab.com's production database: https://gitlab.com/gitlab-org/gitlab-ce/issues/37639#note_75321934

We originally had a scheduling migration for this in an earlier RC, but we removed it as the actual job scheduling (not the database queries) meant that the migration took too long to run: #300 (comment 85634399)

https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/20331 is an MR that rewrites this to schedule in a more efficient way. This MR was largely ready (perhaps even completely ready; I'm writing this on the 6th) before the freeze, but we decided that it would be better if we could try to deploy it in an isolated fashion from other migrations, to reduce the impact if something does go wrong.

Potential negative impact of picking

There are two main potential problems:

  1. We receive application-level errors from the deleted diff files.
  2. The database deletions happen at too high a rate for our replication to keep up.

For the first item, we have been running very similar code in production since the first RC was deployed, as https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/19670 deletes these rows from the database after an MR is merged. We aren't aware of any reports of complaints from this; the application is designed to fetch diffs from Gitaly if they are not in the database, and display a message if that also fails.

(Sean's tedious example section: visit https://gitlab.com/gitlab-org/gitlab-ee/merge_requests/6195/diffs and look in the performance bar for the request to diffs.json. Notice that this does not perform the diff_between call on Gitaly. Now look at https://gitlab.com/gitlab-org/gitlab-ee/merge_requests/6195/diffs?diff_id=18832227. Because the diffs have been deleted from the database, we can see from diffs.json?diff_id=18832227 that we fetch diffs from Gitaly using diff_between.)

For the second item, you can see in https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/20331#note_86257782 and the following comments that we have tried very hard to strike the right balance between the speed of the initial migration, the speed of the background migration, and managing replication lag.

If we find that the scheduling migration takes too long, we can stop it, revert it, and deploy another RC.

If we find that the background migration is causing issues, we can run this snippet on a production Rails console:

def remove_background_migration(remove_class)
  background_migration_queue = BackgroundMigrationWorker.sidekiq_options['queue']
  enqueued = Sidekiq::Queue.new(background_migration_queue)
  scheduled = Sidekiq::ScheduledSet.new

  [scheduled, enqueued].each do |queue|
    queue.each do |job|
      migration_class, migration_args = job.args

      next unless job.queue == background_migration_queue
      next unless migration_class == remove_class

      job.delete
    end
  end
end

remove_background_migration(Gitlab::BackgroundMigration::DeleteDiffFiles)

Timings

  • The post-migration should take roughly a minute to run, it just adds an straightforward index and make a single write on Redis (schedules ScheduleDiffFilesDeletion)
  • ScheduleDiffFilesDeletion should schedule around 1200 jobs (DeleteDiffFiles) and should take ~20 minutes to do it
  • Since we'll process DeleteDiffFiles in a 5 minute interval, it should take around 4 days to process all jobs
  • We're controlling deadtuples threadshold to avoid replication lag issues there (more on https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/20331#note_86894396)

Release manager sign-off

The release manager needs to provide initial approval for this exception request.

  • Release manager

Mention others as necessary during discussion.

Sign-off for RC (delete this section if patch release)

Upon initial approval, the release manager will then ping and assign the following roles:

  • Engineering team leader
  • Engineering department leader (if different from team leader)
  • Quality leader

If you are the last person to provide initial approval, assign this issue to the VPE for his approval:

  • VPE

Sign-off for patch release (delete this section if RC)

Upon initial approval, the release manager will then ping and assign one more role for final approval:

  • Engineering team leader/Engineering department leader/Quality leader/VPE

After approval

Check that the following is accurate before closing this issue:

  • gitlab-org/gitlab-ce!20331 has the correct milestone and label set so that the release managers will pick it.
  • gitlab-org/gitlab-ce!20331 has a comment with a link to this issue:
Edited by Felipe Cardozo