Migrate "batch 2" of catchall queues into Kubernetes

Production Change

Change Component	Description
Change Objective	We've identified that a certain set of sidekiq queues can run without issue on Kubernetes. Let's move them off of VM's into our Kubernetes infrastructure.
Change Type	ConfigurationChange
Services Impacted	ServiceSidekiq sidekiq_shardCatchAll
Change Technician	@skarbek, @jarv
Change Criticality	C3
Change Type	changescheduled
Change Reviewer	@jarv
Due Date	2020-09-14 18:00 UTC
Time tracking	65 minutes
Downtime Component	n/a

Detailed steps for the change

Overview

We'll migrate a select set of queues from our catchnfs fleet into Kubernetes
We start off by first adding the second batch of catchall configuration into Kubernetes
- Upon application into Production, Kubernetes will immediately start pulling jobs off the queue
We then have an MR that effectively removes all queues for the catch* fleet of servers. This is accomplished via:
- configuring the catchnfs sidekiq to pull work for on specific queue that will does not have any work assigned to it
- configuring the catchall fleet to have removed the batch2 queues
Note that we intermingle catchall and catchnfs:
- catchnfs is the VM's that were utilized for evaluation
- catchall is the shard where these queues will operate from
- Queues are being migrated from the catchnfs VM fleet into the catchall Kubernetes sidekiq shard
This work ONLY encompasses the second batch of queues that were evaluated

List of Queues in Batch 2

chaos:chaos_cpu_spin
chaos:chaos_db_spin
chaos:chaos_kill
chaos:chaos_leak_mem
chaos:chaos_sleep
default
delete_stored_files
external_service_reactive_caching
object_storage:object_storage_background_move
object_storage:object_storage_migrate_uploads
analytics_code_review_metrics
self_monitoring_project_create
self_monitoring_project_delete
cronjob:import_software_licenses
refresh_license_compliance_checks
auto_devops:auto_devops_disable
gcp_cluster:cluster_configure_istio
gcp_cluster:cluster_install_app
gcp_cluster:cluster_patch_app
gcp_cluster:cluster_provision
gcp_cluster:cluster_update_app
gcp_cluster:cluster_upgrade_app
gcp_cluster:cluster_wait_for_app_update
gcp_cluster:cluster_wait_for_ingress_ip_address
gcp_cluster:clusters_applications_activate_service
gcp_cluster:clusters_applications_deactivate_service
gcp_cluster:clusters_applications_uninstall
gcp_cluster:clusters_cleanup_app
gcp_cluster:clusters_cleanup_project_namespace
gcp_cluster:clusters_cleanup_service_account
gcp_cluster:wait_for_cluster_creation
cronjob:ingress_modsecurity_counter_metrics
cronjob:network_policy_metrics
cronjob:pseudonymizer
propagate_service_template
file_hook
irker
project_service
web_hook
error_tracking_issue_link
incident_management:clusters_applications_check_prometheus_health
incident_management:incident_management_pager_duty_process_incident
incident_management:incident_management_process_alert
status_page_publish
github_import_advance_stage
github_importer:github_import_import_diff_note
github_importer:github_import_import_issue
github_importer:github_import_import_note
github_importer:github_import_import_pull_request
github_importer:github_import_refresh_import_jid
github_importer:github_import_stage_finish_import
github_importer:github_import_stage_import_base_data
github_importer:github_import_stage_import_issues_and_diff_notes
github_importer:github_import_stage_import_notes
github_importer:github_import_stage_import_pull_requests
github_importer:github_import_stage_import_repository
jira_importer:jira_import_advance_stage
jira_importer:jira_import_import_issue
jira_importer:jira_import_stage_finish_import
jira_importer:jira_import_stage_import_attachments
jira_importer:jira_import_stage_import_issues
jira_importer:jira_import_stage_import_labels
jira_importer:jira_import_stage_import_notes
jira_importer:jira_import_stage_start_import
project_import_schedule
dependency_proxy:purge_dependency_proxy_cache
epics:epics_update_epics_dates
create_evidence
create_commit_signature
create_note_diff_file
cronjob:admin_email
cronjob:authorized_project_update_periodic_recalculate
cronjob:repository_archive_cache
cronjob:repository_check_dispatch
cronjob:requests_profiles
cronjob:schedule_migrate_external_diffs
cronjob:stuck_merge_jobs
cronjob:trending_projects
cronjob:update_all_mirrors
cronjob:x509_issuer_crl_check
delete_diff_files
delete_merged_branches
detect_repository_languages
hashed_storage:hashed_storage_migrator
hashed_storage:hashed_storage_project_migrate
hashed_storage:hashed_storage_project_rollback
hashed_storage:hashed_storage_rollbacker
invalid_gpg_signature_update
merge_request_mergeability_check
migrate_external_diffs
project_daily_statistics
rebase
remote_mirror_notification
repository_check:repository_check_batch
repository_check:repository_check_clear
repository_check:repository_check_single_repository
repository_cleanup
repository_fork
repository_remove_remote
repository_update_remote_mirror
system_hook_push
update_namespace_statistics:namespaces_root_statistics
update_namespace_statistics:namespaces_schedule_aggregation
update_project_statistics
x509_certificate_revoke
cronjob:gitlab_usage_ping

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 10 minutes

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 30 minutes

Merge MR to reconfigure catchall into Kubernetes:
- gprd - gitlab-com/gl-infra/k8s-workloads/gitlab-com!395 (merged)
Ensure the manual deployment job for the gprd stage is executed
Merge MR to remove the specific queues from VMs:
- gprd - https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/4222
Ensure the publish job for gprd has been executed to apply to role to the chef server
Execute chef on the catchnfs fleet: knife ssh 'roles:gprd-base-be-sidekiq-catchnfs' 'sudo chef-client' -C 1
Execute chef on the catchall fleet: knife ssh 'roles:gprd-base-be-sidekiq-catchall' 'sudo chef-client' -C 1

Post-Change Steps - 5 minutes

Estimated Time to Complete (mins) - 5 minutes

Validate that the catchall fleet in Kubernetes is pulling "more" work
- A new deployment should be created called sidekiq-catchall-v1.*:
  - Dashboard: https://dashboards.gitlab.net/d/sidekiq-pod/sidekiq-pod-info?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-cluster=gprd-gitlab-gke&var-namespace=gitlab&var-Node=All&var-Deployment=gitlab-sidekiq
- The catchall queue should ramp up the amount of work it is pulling, approximately an additional 20-60 RPS:
  - Metric Name: catchall RPS
  - Dashboard: https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-sigma=2
Validate that the catchnfs fleet is no longer performing work
- The amount of work being pulled by catchnfs should fall to zero, this can be seen here:
  - Metric Name: catchnfs RPS
  - Dashboard: https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-sigma=2

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 30 minutes

Revert MR to remove the specific queues from catchnfs:
- gprd - https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/4222
Ensure the publish job for gprd has been executed to apply to role to the chef server
Execute chef on the catchall fleet: knife ssh 'roles:gprd-base-be-sidekiq-catchall' 'sudo chef-client'
Execute chef on the catchnfs fleet: knife ssh 'roles:gprd-base-be-sidekiq-catchnfs' 'sudo chef-client'
Revert MR to add additional catchall queues into Kubernetes:
- gprd - gitlab-com/gl-infra/k8s-workloads/gitlab-com!395 (merged)
Ensure the manual deployment job for the gprd stage is executed

Monitoring

Key metrics to observe

Metric: Apdex/Error Ratio
- Location: https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?orgId=1&from=now-3h&to=now
- What changes to this metric should prompt a rollback: Any increase above normal should induce a rollback
Metric: Catchall RPS
- Location: https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?orgId=1&from=now-3h&to=now
- What changes to this metric should prompt a rollback: This metric should drop to zero as no work is assigned to the catchnfs fleet, first investigate what queues are being pulled from to determine if there may exist a misconfiguration - this is not a blocker for moving forward, but is worth investigating

Key errors to observe

We want to find if any errors are related to storage or any new errors previously not seen:
In Sentry:
- catchall

Logs

Quick link for log viewing: https://log.gprd.gitlab.net/goto/39ed10baa678754f1f1ed8d794d778ec

Summary of infrastructure changes

Does this change introduce new compute instances? No
Does this change re-size any existing compute instances? No
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc? No

Changes checklist

This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled).
This issue has the change technician as the assignee.
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
Necessary approvals have been completed based on the Change Management Workflow.
Change has been tested in staging and results noted in a comment on this issue.
A dry-run has been conducted and results noted in a comment on this issue.
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue.)
There are currently no active incidents.

Edited Sep 14, 2020 by John Skarbek