Migrate "batch 2" of catchall queues into Kubernetes

Production Change

Change Component Description
Change Objective We've identified that a certain set of sidekiq queues can run without issue on Kubernetes. Let's move them off of VM's into our Kubernetes infrastructure.
Change Type ConfigurationChange
Services Impacted ServiceSidekiq sidekiq_shardCatchAll
Change Technician @skarbek, @jarv
Change Criticality C3
Change Type changescheduled
Change Reviewer @jarv
Due Date 2020-09-14 18:00 UTC
Time tracking 65 minutes
Downtime Component n/a

Detailed steps for the change

Overview

  • We'll migrate a select set of queues from our catchnfs fleet into Kubernetes
  • We start off by first adding the second batch of catchall configuration into Kubernetes
    • Upon application into Production, Kubernetes will immediately start pulling jobs off the queue
  • We then have an MR that effectively removes all queues for the catch* fleet of servers. This is accomplished via:
    • configuring the catchnfs sidekiq to pull work for on specific queue that will does not have any work assigned to it
    • configuring the catchall fleet to have removed the batch2 queues
  • Note that we intermingle catchall and catchnfs:
    • catchnfs is the VM's that were utilized for evaluation
    • catchall is the shard where these queues will operate from
    • Queues are being migrated from the catchnfs VM fleet into the catchall Kubernetes sidekiq shard
  • This work ONLY encompasses the second batch of queues that were evaluated
List of Queues in Batch 2
chaos:chaos_cpu_spin
chaos:chaos_db_spin
chaos:chaos_kill
chaos:chaos_leak_mem
chaos:chaos_sleep
default
delete_stored_files
external_service_reactive_caching
object_storage:object_storage_background_move
object_storage:object_storage_migrate_uploads
analytics_code_review_metrics
self_monitoring_project_create
self_monitoring_project_delete
cronjob:import_software_licenses
refresh_license_compliance_checks
auto_devops:auto_devops_disable
gcp_cluster:cluster_configure_istio
gcp_cluster:cluster_install_app
gcp_cluster:cluster_patch_app
gcp_cluster:cluster_provision
gcp_cluster:cluster_update_app
gcp_cluster:cluster_upgrade_app
gcp_cluster:cluster_wait_for_app_update
gcp_cluster:cluster_wait_for_ingress_ip_address
gcp_cluster:clusters_applications_activate_service
gcp_cluster:clusters_applications_deactivate_service
gcp_cluster:clusters_applications_uninstall
gcp_cluster:clusters_cleanup_app
gcp_cluster:clusters_cleanup_project_namespace
gcp_cluster:clusters_cleanup_service_account
gcp_cluster:wait_for_cluster_creation
cronjob:ingress_modsecurity_counter_metrics
cronjob:network_policy_metrics
cronjob:pseudonymizer
propagate_service_template
file_hook
irker
project_service
web_hook
error_tracking_issue_link
incident_management:clusters_applications_check_prometheus_health
incident_management:incident_management_pager_duty_process_incident
incident_management:incident_management_process_alert
status_page_publish
github_import_advance_stage
github_importer:github_import_import_diff_note
github_importer:github_import_import_issue
github_importer:github_import_import_note
github_importer:github_import_import_pull_request
github_importer:github_import_refresh_import_jid
github_importer:github_import_stage_finish_import
github_importer:github_import_stage_import_base_data
github_importer:github_import_stage_import_issues_and_diff_notes
github_importer:github_import_stage_import_notes
github_importer:github_import_stage_import_pull_requests
github_importer:github_import_stage_import_repository
jira_importer:jira_import_advance_stage
jira_importer:jira_import_import_issue
jira_importer:jira_import_stage_finish_import
jira_importer:jira_import_stage_import_attachments
jira_importer:jira_import_stage_import_issues
jira_importer:jira_import_stage_import_labels
jira_importer:jira_import_stage_import_notes
jira_importer:jira_import_stage_start_import
project_import_schedule
dependency_proxy:purge_dependency_proxy_cache
epics:epics_update_epics_dates
create_evidence
create_commit_signature
create_note_diff_file
cronjob:admin_email
cronjob:authorized_project_update_periodic_recalculate
cronjob:repository_archive_cache
cronjob:repository_check_dispatch
cronjob:requests_profiles
cronjob:schedule_migrate_external_diffs
cronjob:stuck_merge_jobs
cronjob:trending_projects
cronjob:update_all_mirrors
cronjob:x509_issuer_crl_check
delete_diff_files
delete_merged_branches
detect_repository_languages
hashed_storage:hashed_storage_migrator
hashed_storage:hashed_storage_project_migrate
hashed_storage:hashed_storage_project_rollback
hashed_storage:hashed_storage_rollbacker
invalid_gpg_signature_update
merge_request_mergeability_check
migrate_external_diffs
project_daily_statistics
rebase
remote_mirror_notification
repository_check:repository_check_batch
repository_check:repository_check_clear
repository_check:repository_check_single_repository
repository_cleanup
repository_fork
repository_remove_remote
repository_update_remote_mirror
system_hook_push
update_namespace_statistics:namespaces_root_statistics
update_namespace_statistics:namespaces_schedule_aggregation
update_project_statistics
x509_certificate_revoke
cronjob:gitlab_usage_ping

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 10 minutes

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 30 minutes

Post-Change Steps - 5 minutes

Estimated Time to Complete (mins) - 5 minutes

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 30 minutes

Monitoring

Key metrics to observe

Key errors to observe

  • We want to find if any errors are related to storage or any new errors previously not seen:
  • In Sentry:

Logs

Summary of infrastructure changes

  • Does this change introduce new compute instances? No
  • Does this change re-size any existing compute instances? No
  • Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc? No

Changes checklist

  • This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled).
  • This issue has the change technician as the assignee.
  • Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
  • Necessary approvals have been completed based on the Change Management Workflow.
  • Change has been tested in staging and results noted in a comment on this issue.
  • A dry-run has been conducted and results noted in a comment on this issue.
  • SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue.)
  • There are currently no active incidents.
Edited by John Skarbek