Set ENABLE_RAILS_61_CONNECTION_HANDLING env variable to enable new Rails connection handling for Production
Production Change
Change Summary
Set ENABLE_RAILS_61_CONNECTION_HANDLING
env variable in Production, to enable new Rails connection handling.
Enable the new Rails connection handling (gitlab-org/gitlab!63816 (merged)) that is in preparation for multiple database handling. It should have no impact for Rails applications that use a single database connection. But because this changes a key Rails method for the application to establish the the database connection on startup, we play it safe by rolling this out progressively - see gitlab-org/gitlab!63816 (merged) for details
This was done in staging in #4885 (closed)
Change Details
- Services Impacted - ServiceWeb ServiceAPI ServiceSidekiq
-
Change Technician -
@tkuah
/@ggillies
-
Change Reviewer -
@ggillies
- Time tracking - unknown
- Downtime Component - none
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - 1 minute
-
Merge gitlab-org/gitlab!63816 (merged) -
Create MRs for VMs and K8s to set ENABLE_RAILS_61_CONNECTION_HANDLING
env var -
Set label changein-progress on this issue -
Consider dropping down to C3 while executing
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 120 minutes
-
Disable chef on all web nodes
cd chef-repo
bundle exec knife ssh roles:gprd-base-fe-web 'sudo chef-client-disable "CR 4921"'
-
Merge gitlab-com/gl-infra/k8s-workloads/gitlab-com!954 (merged) -
Merge https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/167 -
Run the following chatops command to re-enable and run chef
/chatops run deploycmd chefclient role_gprd_base_fe_web --environment gprd --no-check
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - 5 minutes
-
The main thing is that the application can still connect to the database and make queries such as Project.first
, etc. We can check this by just hitting any page on https://gitlab.com, or via Rails console -
Log into a VM / host where the host has the new env var -
On that VM / host, open a new rails console, we can check what mode we are in with the following:
Old connection handling
ActiveRecord::Base.connection_handler.send(:owner_to_pool_manager).values.first.class
=> ActiveRecord::ConnectionAdapters::LegacyPoolManager
New connection handling
ActiveRecord::Base.connection_handler.send(:owner_to_pool_manager).values.first.class
=> ActiveRecord::ConnectionAdapters::PoolManager
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 120 minutes
-
Revert gitlab-com/gl-infra/k8s-workloads/gitlab-com!954 (merged) -
Revert https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/167 -
Run the following chatops command
/chatops run deploycmd chefclient role_gprd_base_fe_web --environment gprd --no-check
Monitoring
Key metrics to observe
- Metric: rails_db_connection_pool component saturation: Rails DB Connection Pool Utilization
- Location:
- https://dashboards.gitlab.net/d/web-main/web-overview?viewPanel=391047339&orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main for web
- https://dashboards.gitlab.net/d/api-main/api-overview?viewPanel=391047339&orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main for api
- https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?viewPanel=391047339&orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main for sidekiq
- What changes to this metric should prompt a rollback: If the saturation drops to 0, then that would mean something went wrong with DB connections from Rails.
- Location:
- Metric: Web service error ratio
- Location: https://dashboards.gitlab.net/d/web-main/web-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main
- What changes to this metric should prompt a rollback: increased rate of errors
- Metric: Sidekiq service error ratio
- Location: https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main
- What changes to this metric should prompt a rollback: increased rate of errors
Summary of infrastructure changes
- [-] Does this change introduce new compute instances? No
- [-] Does this change re-size any existing compute instances? No
- [-] Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc? No
Summary of the above
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall
and this issue and await their acknowledgement.) -
There are currently no active incidents.