job traces and CI - document how to prepare self-managed instances' Redis for upgrades / migrations

Proposal

Support is working with a large self-managed customer whose PaaS Redis is now insufficient to support their workload.

GitLab team members with access can read more in the ticket. This ticket covers two topics: the Redis maintenance plans, and working around the product performance issues caused by the existing Redis running at maximum CPU load.

The customer needs to:

Split Redis cache from the existing instance (which then becomes the Redis persistent instance)
Upgrade their existing Redis to a more recent version. The process for their cloud provider is to export from the existing Redis and import to the new one.

Customer is running object storage and has incremental logging enabled via Feature.enable(:ci_enable_live_trace)

This issue focusses on the Redis upgrade, since the process for separating cache out is relatively simple and uses existing administrative tools:

Stop Rails
Run gitlab-rake cache:clear (to empty cache data from the persistent instance)
Add configuration to Rails for the cache instance; gitlab-ctl reconfigure
Start Rails.

Risks / Issues

If the export/import fails, the customer will go back into service without any data in their new Redis.

This will not be transparent; user sessions for example will be lost. But it's workable.

Job traces may still be stored in Redis, and this data would be lost. The customer would like to avoid this data loss.

Additionally, job traces will make the export/import process slower - their environment makes heavy use of CI.

Draft plan for a Redis migration

Create new Redis instance..
Drain off CI work on GitLab instance. See Proposed steps for draining off CI for more detail.
- Stop new jobs from running.
- Ensure traces are archived.
Shut down GitLab
Clear cache gitlab-rake cache:clear
Export data from old Redis
Import data to new redis
Reconfigure GitLab.
Start GitLab.

Proposed steps for draining off CI

During preparation for this change, ensure there are no stuck pipeline schedules. This is part of the workaround for an issue that arises when disabling pipeline_schedule_worker_cron - read more in the issue.
In advance of the main change, stop scheduled pipelines from running. This could be done the previous day, for example - depending on how long your pipelines take.
- Note: any pipelines with a 'next run' prior to the end of your maintenance will still run. After this, will they go inactive. This step is effective only for pipelines which are scheduled to run quite frequently.
1. Modify gitlab.rb on all Rails nodes
```
gitlab_rails['pipeline_schedule_worker_cron'] = ""
```
2. Apply with gitlab-ctl reconfigure on a rolling basis across all Rails nodes.
  - This will clear the redis cache, and restart both Puma and Sidekiq.
  - Ensure that sidekiq and Puma have time to restart on each node before doing the next.
  - If you roll out the change too fast, CI jobs will get stuck in 'running' status and job logs don't get uploaded.
At the start of your change, stop new CI traces getting created, by preventing runners asking for new jobs
- Modify gitlab.rb on all your Rails/Web servers. Add:
```
nginx['custom_gitlab_server_config'] = "location = /api/v4/jobs/request {\n    deny all;\n    return 503;\n  }\n"
```
- apply with gitlab-ctl reconfigure
- NGINX will automatically reload; this is quick - you might get a handful of HTTP requests get dropped, but in general it's not noticable
- The runners will then get a 503 when asking for new jobs, but will continue to finish existing jobs and upload logs. See MR description for more details

Check that jobs have completed

Monitor running jobs trending towards zero.

h=Ci::Build.where(status: 'running').where("started_at > ?", 1.hour.ago).count
d=Ci::Build.where(status: 'running').where("started_at > ?", 1.day.ago).count
puts "Running jobs (1 hour): #{h}\nRunning jobs (1 day): #{d}"

Put up the deploy page: gitlab-ctl deploy-page up on all Rails nodes.
- This will limit access to the instance to some extent.

Use Rails console to determine if there's build logs still in Redis, and count total jobs in object storage

puts "total archived artifacts: #{Ci::JobArtifact.where(file_store: 2).count}\nbuild logs in redis: #{Ci::Build.with_live_trace.count}"

Archive any traces that are outstanding

caller="ArchiveTraceService"
Ci::Build.with_live_trace.find_each(batch_size: 2).with_index do |build, index|
  puts "job: #{build.id} #{build.name} project: #{Project.find_by_id(build.project_id).name} finished: #{build.finished_at}"
  Ci::ArchiveTraceService.new.execute(build, worker_name: caller)
end

Reverting the change after completing the work

Remove the config entries from gitlab.rb, applying with gitlab-ctl reconfigure

gitlab_rails['pipeline_schedule_worker_cron'] = ""
nginx['custom_gitlab_server_config'] = "location = /api/v4/jobs/request {\n    deny all;\n    return 503;\n  }\n"

And take down the deploy page: gitlab-ctl deploy-page down
Finally, fix any scheduled pipelines that are now stuck. Read more in the issue.

Edited Jun 09, 2023 by Ben Prescott_