Set ENABLE_RAILS_61_CONNECTION_HANDLING env variable to enable new Rails connection handling for Staging

Production Change

Change Summary

Set ENABLE_RAILS_61_CONNECTION_HANDLING env variable in Staging, to enable new Rails connection handling.

Enable the new Rails connection handling (gitlab-org/gitlab!63816 (merged)) that is in preparation for multiple database handling. It should have no impact for Rails applications that use a single database connection. But because this changes a key Rails method for the application to establish the the database connection on startup, we play it safe by rolling this out progressively - see gitlab-org/gitlab!63816 (merged) for details

Change Details

Services Impacted - ServiceWeb ServiceAPI ServiceSidekiq
Change Technician - @tkuah
Change Reviewer - @hphilipps
Time tracking - unknown
Downtime Component - none

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 1 minute

Merge gitlab-org/gitlab!63816 (merged)
Create MRs for VMs and K8s to set ENABLE_RAILS_61_CONNECTION_HANDLING env var
Set label changein-progress on this issue

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes

Merge gitlab-com/gl-infra/k8s-workloads/gitlab-com!943 (merged)
Merge https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/136
Rollout the config changes (Not sure if anything needs to be done here or if it's automatic after above MRs are merged).
Either HUP each Rails process, or wait for the next auto-deploy. Or does the config rollout HUP rails process ?

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes

The main thing is that the application can still connect to the database and make queries such as Project.first, etc. We can check this by just hitting any page on https://staging.gitlab.com, or via Rails console
Log into a VM / host where the host has the new env var
In a new rails console, we can check what mode we are in with the following:

Old connection handling

> ActiveRecord::Base.connection_handler.send(:owner_to_pool_manager).values.first.class
=> ActiveRecord::ConnectionAdapters::LegacyPoolManager

New connection handling

ActiveRecord::Base.connection_handler.send(:owner_to_pool_manager).values.first.class
=> ActiveRecord::ConnectionAdapters::PoolManager

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes

Revert above TBC MRs (Set ENABLE_RAILS_61_CONNECTION_HANDLING env variable to false)
HUP each Rails process

Monitoring

Key metrics to observe

Metric: rails_db_connection_pool component saturation: Rails DB Connection Pool Utilization
- Location:
- What changes to this metric should prompt a rollback: If the saturation drops to 0, then that would mean something went wrong with DB connections from Rails.

Metric: Web service error ratio
- Location: https://dashboards.gitlab.net/d/web-main/web-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg&var-stage=main
- What changes to this metric should prompt a rollback: increased rate of errors
Metric: Sidekiq service error ratio
- Location: https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg&var-stage=main
- What changes to this metric should prompt a rollback: increased rate of errors

Summary of infrastructure changes

[-] Does this change introduce new compute instances? No
[-] Does this change re-size any existing compute instances? No
[-] Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc? No

Summary of the above

Changes checklist

This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities.
This issue has the change technician as the assignee.
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
Necessary approvals have been completed based on the Change Management Workflow.
Change has been tested in staging and results noted in a comment on this issue.
A dry-run has been conducted and results noted in a comment on this issue.
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
There are currently no active incidents.

Edited Jun 18, 2021 by Graeme Gillies