Set ENABLE_RAILS_61_CONNECTION_HANDLING env variable to enable new Rails connection handling for Staging
Production Change
Change Summary
Set ENABLE_RAILS_61_CONNECTION_HANDLING env variable in Staging, to enable new Rails connection handling.
Enable the new Rails connection handling (gitlab-org/gitlab!63816 (merged)) that is in preparation for multiple database handling. It should have no impact for Rails applications that use a single database connection. But because this changes a key Rails method for the application to establish the the database connection on startup, we play it safe by rolling this out progressively - see gitlab-org/gitlab!63816 (merged) for details
Change Details
- Services Impacted - ServiceWeb ServiceAPI ServiceSidekiq
-
Change Technician -
@tkuah -
Change Reviewer -
@hphilipps - Time tracking - unknown
- Downtime Component - none
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - 1 minute
-
Merge gitlab-org/gitlab!63816 (merged) -
Create MRs for VMs and K8s to set ENABLE_RAILS_61_CONNECTION_HANDLINGenv var -
Set label changein-progress on this issue
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
-
Merge gitlab-com/gl-infra/k8s-workloads/gitlab-com!943 (merged) -
Merge https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/136 -
Rollout the config changes (Not sure if anything needs to be done here or if it's automatic after above MRs are merged). -
Either HUP each Rails process, or wait for the next auto-deploy. Or does the config rollout HUP rails process ?
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
-
The main thing is that the application can still connect to the database and make queries such as Project.first, etc. We can check this by just hitting any page on https://staging.gitlab.com, or via Rails console -
Log into a VM / host where the host has the new env var -
In a new rails console, we can check what mode we are in with the following:
Old connection handling
> ActiveRecord::Base.connection_handler.send(:owner_to_pool_manager).values.first.class
=> ActiveRecord::ConnectionAdapters::LegacyPoolManager
New connection handling
ActiveRecord::Base.connection_handler.send(:owner_to_pool_manager).values.first.class
=> ActiveRecord::ConnectionAdapters::PoolManager
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
-
Revert above TBC MRs (Set ENABLE_RAILS_61_CONNECTION_HANDLING env variable to false) -
HUP each Rails process
Monitoring
Key metrics to observe
- Metric: rails_db_connection_pool component saturation: Rails DB Connection Pool Utilization
- Location:
- https://dashboards.gitlab.net/d/web-main/web-overview?viewPanel=391047339&orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg&var-stage=main for web
- https://dashboards.gitlab.net/d/api-main/api-overview?viewPanel=391047339&orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg&var-stage=main for api
- https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?viewPanel=391047339&orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg&var-stage=main for sidekiq
- What changes to this metric should prompt a rollback: If the saturation drops to 0, then that would mean something went wrong with DB connections from Rails.
- Location:
- Metric: Web service error ratio
- Location: https://dashboards.gitlab.net/d/web-main/web-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg&var-stage=main
- What changes to this metric should prompt a rollback: increased rate of errors
- Metric: Sidekiq service error ratio
- Location: https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg&var-stage=main
- What changes to this metric should prompt a rollback: increased rate of errors
Summary of infrastructure changes
- [-] Does this change introduce new compute instances? No
- [-] Does this change re-size any existing compute instances? No
- [-] Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc? No
Summary of the above
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncalland this issue and await their acknowledgement.) -
There are currently no active incidents.