Set ENABLE_RAILS_61_CONNECTION_HANDLING env variable to enable new Rails connection handling for Staging
Production Change
Change Summary
Set ENABLE_RAILS_61_CONNECTION_HANDLING
env variable in Staging, to enable new Rails connection handling.
Enable the new Rails connection handling (gitlab-org/gitlab!63816 (merged)) that is in preparation for multiple database handling. It should have no impact for Rails applications that use a single database connection. But because this changes a key Rails method for the application to establish the the database connection on startup, we play it safe by rolling this out progressively - see gitlab-org/gitlab!63816 (merged) for details
Change Details
- Services Impacted - ServiceWeb ServiceAPI ServiceSidekiq
-
Change Technician -
@tkuah
-
Change Reviewer -
@hphilipps
- Time tracking - unknown
- Downtime Component - none
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - 1 minute
-
Merge gitlab-org/gitlab!63816 (merged) -
Create MRs for VMs and K8s to set ENABLE_RAILS_61_CONNECTION_HANDLING
env var -
Set label changein-progress on this issue
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
-
Merge gitlab-com/gl-infra/k8s-workloads/gitlab-com!943 (merged) -
Merge https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/136 -
Rollout the config changes (Not sure if anything needs to be done here or if it's automatic after above MRs are merged). -
Either HUP each Rails process, or wait for the next auto-deploy. Or does the config rollout HUP rails process ?
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
-
The main thing is that the application can still connect to the database and make queries such as Project.first
, etc. We can check this by just hitting any page on https://staging.gitlab.com, or via Rails console -
Log into a VM / host where the host has the new env var -
In a new rails console, we can check what mode we are in with the following:
Old connection handling
> ActiveRecord::Base.connection_handler.send(:owner_to_pool_manager).values.first.class
=> ActiveRecord::ConnectionAdapters::LegacyPoolManager
New connection handling
ActiveRecord::Base.connection_handler.send(:owner_to_pool_manager).values.first.class
=> ActiveRecord::ConnectionAdapters::PoolManager
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
-
Revert above TBC MRs (Set ENABLE_RAILS_61_CONNECTION_HANDLING env variable to false
) -
HUP each Rails process
Monitoring
Key metrics to observe
- Metric: rails_db_connection_pool component saturation: Rails DB Connection Pool Utilization
- Location:
- https://dashboards.gitlab.net/d/web-main/web-overview?viewPanel=391047339&orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg&var-stage=main for web
- https://dashboards.gitlab.net/d/api-main/api-overview?viewPanel=391047339&orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg&var-stage=main for api
- https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?viewPanel=391047339&orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg&var-stage=main for sidekiq
- What changes to this metric should prompt a rollback: If the saturation drops to 0, then that would mean something went wrong with DB connections from Rails.
- Location:
- Metric: Web service error ratio
- Location: https://dashboards.gitlab.net/d/web-main/web-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg&var-stage=main
- What changes to this metric should prompt a rollback: increased rate of errors
- Metric: Sidekiq service error ratio
- Location: https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg&var-stage=main
- What changes to this metric should prompt a rollback: increased rate of errors
Summary of infrastructure changes
- [-] Does this change introduce new compute instances? No
- [-] Does this change re-size any existing compute instances? No
- [-] Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc? No
Summary of the above
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall
and this issue and await their acknowledgement.) -
There are currently no active incidents.