Migrate "batch 2" of catchall queues into Kubernetes
Production Change
| Change Component | Description |
|---|---|
| Change Objective | We've identified that a certain set of sidekiq queues can run without issue on Kubernetes. Let's move them off of VM's into our Kubernetes infrastructure. |
| Change Type | ConfigurationChange |
| Services Impacted | ServiceSidekiq sidekiq_shardCatchAll |
| Change Technician | @skarbek, @jarv |
| Change Criticality | C3 |
| Change Type | changescheduled |
| Change Reviewer | @jarv |
| Due Date | 2020-09-14 18:00 UTC |
| Time tracking | 65 minutes |
| Downtime Component | n/a |
Detailed steps for the change
Overview
- We'll migrate a select set of queues from our catchnfs fleet into Kubernetes
- We start off by first adding the second batch of catchall configuration into Kubernetes
- Upon application into Production, Kubernetes will immediately start pulling jobs off the queue
- We then have an MR that effectively removes all queues for the catch* fleet of servers. This is accomplished via:
- configuring the catchnfs sidekiq to pull work for on specific queue that will does not have any work assigned to it
- configuring the catchall fleet to have removed the batch2 queues
- Note that we intermingle
catchallandcatchnfs:-
catchnfsis the VM's that were utilized for evaluation -
catchallis the shard where these queues will operate from - Queues are being migrated from the
catchnfsVM fleet into thecatchallKubernetes sidekiq shard
-
- This work ONLY encompasses the second batch of queues that were evaluated
List of Queues in Batch 2
chaos:chaos_cpu_spin
chaos:chaos_db_spin
chaos:chaos_kill
chaos:chaos_leak_mem
chaos:chaos_sleep
default
delete_stored_files
external_service_reactive_caching
object_storage:object_storage_background_move
object_storage:object_storage_migrate_uploads
analytics_code_review_metrics
self_monitoring_project_create
self_monitoring_project_delete
cronjob:import_software_licenses
refresh_license_compliance_checks
auto_devops:auto_devops_disable
gcp_cluster:cluster_configure_istio
gcp_cluster:cluster_install_app
gcp_cluster:cluster_patch_app
gcp_cluster:cluster_provision
gcp_cluster:cluster_update_app
gcp_cluster:cluster_upgrade_app
gcp_cluster:cluster_wait_for_app_update
gcp_cluster:cluster_wait_for_ingress_ip_address
gcp_cluster:clusters_applications_activate_service
gcp_cluster:clusters_applications_deactivate_service
gcp_cluster:clusters_applications_uninstall
gcp_cluster:clusters_cleanup_app
gcp_cluster:clusters_cleanup_project_namespace
gcp_cluster:clusters_cleanup_service_account
gcp_cluster:wait_for_cluster_creation
cronjob:ingress_modsecurity_counter_metrics
cronjob:network_policy_metrics
cronjob:pseudonymizer
propagate_service_template
file_hook
irker
project_service
web_hook
error_tracking_issue_link
incident_management:clusters_applications_check_prometheus_health
incident_management:incident_management_pager_duty_process_incident
incident_management:incident_management_process_alert
status_page_publish
github_import_advance_stage
github_importer:github_import_import_diff_note
github_importer:github_import_import_issue
github_importer:github_import_import_note
github_importer:github_import_import_pull_request
github_importer:github_import_refresh_import_jid
github_importer:github_import_stage_finish_import
github_importer:github_import_stage_import_base_data
github_importer:github_import_stage_import_issues_and_diff_notes
github_importer:github_import_stage_import_notes
github_importer:github_import_stage_import_pull_requests
github_importer:github_import_stage_import_repository
jira_importer:jira_import_advance_stage
jira_importer:jira_import_import_issue
jira_importer:jira_import_stage_finish_import
jira_importer:jira_import_stage_import_attachments
jira_importer:jira_import_stage_import_issues
jira_importer:jira_import_stage_import_labels
jira_importer:jira_import_stage_import_notes
jira_importer:jira_import_stage_start_import
project_import_schedule
dependency_proxy:purge_dependency_proxy_cache
epics:epics_update_epics_dates
create_evidence
create_commit_signature
create_note_diff_file
cronjob:admin_email
cronjob:authorized_project_update_periodic_recalculate
cronjob:repository_archive_cache
cronjob:repository_check_dispatch
cronjob:requests_profiles
cronjob:schedule_migrate_external_diffs
cronjob:stuck_merge_jobs
cronjob:trending_projects
cronjob:update_all_mirrors
cronjob:x509_issuer_crl_check
delete_diff_files
delete_merged_branches
detect_repository_languages
hashed_storage:hashed_storage_migrator
hashed_storage:hashed_storage_project_migrate
hashed_storage:hashed_storage_project_rollback
hashed_storage:hashed_storage_rollbacker
invalid_gpg_signature_update
merge_request_mergeability_check
migrate_external_diffs
project_daily_statistics
rebase
remote_mirror_notification
repository_check:repository_check_batch
repository_check:repository_check_clear
repository_check:repository_check_single_repository
repository_cleanup
repository_fork
repository_remove_remote
repository_update_remote_mirror
system_hook_push
update_namespace_statistics:namespaces_root_statistics
update_namespace_statistics:namespaces_schedule_aggregation
update_project_statistics
x509_certificate_revoke
cronjob:gitlab_usage_ping
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - 10 minutes
-
Validate no storage related errors reported in Sentry:
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 30 minutes
-
Merge MR to reconfigure catchall into Kubernetes: -
Ensure the manual deployment job for the gprdstage is executed -
Merge MR to remove the specific queues from VMs: -
Ensure the publish job for gprdhas been executed to apply to role to the chef server -
Execute chef on the catchnfs fleet: knife ssh 'roles:gprd-base-be-sidekiq-catchnfs' 'sudo chef-client' -C 1 -
Execute chef on the catchall fleet: knife ssh 'roles:gprd-base-be-sidekiq-catchall' 'sudo chef-client' -C 1
Post-Change Steps - 5 minutes
Estimated Time to Complete (mins) - 5 minutes
-
Validate that the catchall fleet in Kubernetes is pulling "more" work - A new deployment should be created called
sidekiq-catchall-v1.*: - The
catchallqueue should ramp up the amount of work it is pulling, approximately an additional 20-60 RPS:- Metric Name: catchall RPS
- Dashboard: https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-sigma=2
- A new deployment should be created called
-
Validate that the catchnfs fleet is no longer performing work - The amount of work being pulled by catchnfs should fall to zero, this can be seen here:
- Metric Name: catchnfs RPS
- Dashboard: https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-sigma=2
- The amount of work being pulled by catchnfs should fall to zero, this can be seen here:
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 30 minutes
-
Revert MR to remove the specific queues from catchnfs: -
Ensure the publish job for gprd has been executed to apply to role to the chef server -
Execute chef on the catchall fleet: knife ssh 'roles:gprd-base-be-sidekiq-catchall' 'sudo chef-client' -
Execute chef on the catchnfs fleet: knife ssh 'roles:gprd-base-be-sidekiq-catchnfs' 'sudo chef-client' -
Revert MR to add additional catchall queues into Kubernetes: -
Ensure the manual deployment job for the gprd stage is executed
Monitoring
Key metrics to observe
- Metric: Apdex/Error Ratio
- Location: https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?orgId=1&from=now-3h&to=now
- What changes to this metric should prompt a rollback: Any increase above normal should induce a rollback
- Metric: Catchall RPS
- Location: https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?orgId=1&from=now-3h&to=now
- What changes to this metric should prompt a rollback: This metric should drop to zero as no work is assigned to the catchnfs fleet, first investigate what queues are being pulled from to determine if there may exist a misconfiguration - this is not a blocker for moving forward, but is worth investigating
Key errors to observe
- We want to find if any errors are related to storage or any new errors previously not seen:
-
In Sentry:
Logs
- Quick link for log viewing: https://log.gprd.gitlab.net/goto/39ed10baa678754f1f1ed8d794d778ec
Summary of infrastructure changes
-
Does this change introduce new compute instances? No -
Does this change re-size any existing compute instances? No -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc? No
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled). -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncalland this issue.) -
There are currently no active incidents.
Edited by John Skarbek