job traces and CI - document how to prepare self-managed instances' Redis for upgrades / migrations
Proposal
Support is working with a large self-managed customer whose PaaS Redis is now insufficient to support their workload.
GitLab team members with access can read more in the ticket. This ticket covers two topics: the Redis maintenance plans, and working around the product performance issues caused by the existing Redis running at maximum CPU load.
The customer needs to:
- Split Redis cache from the existing instance (which then becomes the Redis persistent instance)
- Upgrade their existing Redis to a more recent version. The process for their cloud provider is to export from the existing Redis and import to the new one.
Customer is running object storage and has incremental logging enabled via Feature.enable(:ci_enable_live_trace)
This issue focusses on the Redis upgrade, since the process for separating cache out is relatively simple and uses existing administrative tools:
- Stop Rails
- Run
gitlab-rake cache:clear
(to empty cache data from the persistent instance) - Add configuration to Rails for the cache instance;
gitlab-ctl reconfigure
- Start Rails.
Risks / Issues
If the export/import fails, the customer will go back into service without any data in their new Redis.
This will not be transparent; user sessions for example will be lost. But it's workable.
Job traces may still be stored in Redis, and this data would be lost. The customer would like to avoid this data loss.
Additionally, job traces will make the export/import process slower - their environment makes heavy use of CI.
Draft plan for a Redis migration
- Create new Redis instance..
- Drain off CI work on GitLab instance. See Proposed steps for draining off CI for more detail.
- Stop new jobs from running.
- Ensure traces are archived.
- Shut down GitLab
- Clear cache
gitlab-rake cache:clear
- Export data from old Redis
- Import data to new redis
- Reconfigure GitLab.
- Start GitLab.
Proposed steps for draining off CI
-
During preparation for this change, ensure there are no stuck pipeline schedules. This is part of the workaround for an issue that arises when disabling
pipeline_schedule_worker_cron
- read more in the issue. -
In advance of the main change, stop scheduled pipelines from running. This could be done the previous day, for example - depending on how long your pipelines take.
- Note: any pipelines with a 'next run' prior to the end of your maintenance will still run. After this, will they go inactive. This step is effective only for pipelines which are scheduled to run quite frequently.
-
Modify
gitlab.rb
on all Rails nodesgitlab_rails['pipeline_schedule_worker_cron'] = ""
-
Apply with
gitlab-ctl reconfigure
on a rolling basis across all Rails nodes.- This will clear the redis cache, and restart both Puma and Sidekiq.
- Ensure that sidekiq and Puma have time to restart on each node before doing the next.
- If you roll out the change too fast, CI jobs will get stuck in 'running' status and job logs don't get uploaded.
-
At the start of your change, stop new CI traces getting created, by preventing runners asking for new jobs
-
Modify
gitlab.rb
on all your Rails/Web servers. Add:nginx['custom_gitlab_server_config'] = "location = /api/v4/jobs/request {\n deny all;\n return 503;\n }\n"
-
apply with
gitlab-ctl reconfigure
-
NGINX will automatically reload; this is quick - you might get a handful of HTTP requests get dropped, but in general it's not noticable
-
The runners will then get a
503
when asking for new jobs, but will continue to finish existing jobs and upload logs. See MR description for more details
-
-
Check that jobs have completed
- Monitor running jobs trending towards zero.
h=Ci::Build.where(status: 'running').where("started_at > ?", 1.hour.ago).count d=Ci::Build.where(status: 'running').where("started_at > ?", 1.day.ago).count puts "Running jobs (1 hour): #{h}\nRunning jobs (1 day): #{d}"
-
Put up the deploy page:
gitlab-ctl deploy-page up
on all Rails nodes.- This will limit access to the instance to some extent.
-
Use Rails console to determine if there's build logs still in Redis, and count total jobs in object storage
puts "total archived artifacts: #{Ci::JobArtifact.where(file_store: 2).count}\nbuild logs in redis: #{Ci::Build.with_live_trace.count}"
-
Archive any traces that are outstanding
caller="ArchiveTraceService" Ci::Build.with_live_trace.find_each(batch_size: 2).with_index do |build, index| puts "job: #{build.id} #{build.name} project: #{Project.find_by_id(build.project_id).name} finished: #{build.finished_at}" Ci::ArchiveTraceService.new.execute(build, worker_name: caller) end
Reverting the change after completing the work
-
Remove the config entries from
gitlab.rb
, applying withgitlab-ctl reconfigure
gitlab_rails['pipeline_schedule_worker_cron'] = "" nginx['custom_gitlab_server_config'] = "location = /api/v4/jobs/request {\n deny all;\n return 503;\n }\n"
-
And take down the deploy page:
gitlab-ctl deploy-page down
-
Finally, fix any scheduled pipelines that are now stuck. Read more in the issue.