Set ENABLE_RAILS_61_CONNECTION_HANDLING env variable to enable new Rails connection handling for Production
Production Change
Change Summary
Set ENABLE_RAILS_61_CONNECTION_HANDLING env variable in Production, to enable new Rails connection handling.
Enable the new Rails connection handling (gitlab-org/gitlab!63816 (merged)) that is in preparation for multiple database handling. It should have no impact for Rails applications that use a single database connection. But because this changes a key Rails method for the application to establish the the database connection on startup, we play it safe by rolling this out progressively - see gitlab-org/gitlab!63816 (merged) for details
This was done in staging in #4885 (closed)
Change Details
- Services Impacted - ServiceWeb ServiceAPI ServiceSidekiq
-
Change Technician -
@tkuah/@ggillies -
Change Reviewer -
@ggillies - Time tracking - unknown
- Downtime Component - none
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - 1 minute
-
Merge gitlab-org/gitlab!63816 (merged) -
Create MRs for VMs and K8s to set ENABLE_RAILS_61_CONNECTION_HANDLINGenv var -
Set label changein-progress on this issue -
Consider dropping down to C3 while executing
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 120 minutes
-
Disable chef on all web nodes
cd chef-repo
bundle exec knife ssh roles:gprd-base-fe-web 'sudo chef-client-disable "CR 4921"'
-
Merge gitlab-com/gl-infra/k8s-workloads/gitlab-com!954 (merged) -
Merge https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/167 -
Run the following chatops command to re-enable and run chef
/chatops run deploycmd chefclient role_gprd_base_fe_web --environment gprd --no-check
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - 5 minutes
-
The main thing is that the application can still connect to the database and make queries such as Project.first, etc. We can check this by just hitting any page on https://gitlab.com, or via Rails console -
Log into a VM / host where the host has the new env var -
On that VM / host, open a new rails console, we can check what mode we are in with the following:
Old connection handling
ActiveRecord::Base.connection_handler.send(:owner_to_pool_manager).values.first.class
=> ActiveRecord::ConnectionAdapters::LegacyPoolManager
New connection handling
ActiveRecord::Base.connection_handler.send(:owner_to_pool_manager).values.first.class
=> ActiveRecord::ConnectionAdapters::PoolManager
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 120 minutes
-
Revert gitlab-com/gl-infra/k8s-workloads/gitlab-com!954 (merged) -
Revert https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/167 -
Run the following chatops command
/chatops run deploycmd chefclient role_gprd_base_fe_web --environment gprd --no-check
Monitoring
Key metrics to observe
- Metric: rails_db_connection_pool component saturation: Rails DB Connection Pool Utilization
- Location:
- https://dashboards.gitlab.net/d/web-main/web-overview?viewPanel=391047339&orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main for web
- https://dashboards.gitlab.net/d/api-main/api-overview?viewPanel=391047339&orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main for api
- https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?viewPanel=391047339&orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main for sidekiq
- What changes to this metric should prompt a rollback: If the saturation drops to 0, then that would mean something went wrong with DB connections from Rails.
- Location:
- Metric: Web service error ratio
- Location: https://dashboards.gitlab.net/d/web-main/web-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main
- What changes to this metric should prompt a rollback: increased rate of errors
- Metric: Sidekiq service error ratio
- Location: https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main
- What changes to this metric should prompt a rollback: increased rate of errors
Summary of infrastructure changes
- [-] Does this change introduce new compute instances? No
- [-] Does this change re-size any existing compute instances? No
- [-] Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc? No
Summary of the above
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncalland this issue and await their acknowledgement.) -
There are currently no active incidents.