Enable the Nakayoshi fork for puma in production
Production Change
Change Summary
Configuration changes to environment variables control the startup procedure for Puma to enable the forking mechanism called Nakayoshi. Enabling this will improve the memory usage of puma. This capability was introduced and is further discussed in gitlab-org/gitlab#288042 (closed)
Change Details
- Services Impacted - ServiceAPI ServiceWeb
- Change Technician - @skarbek @alipniagov
- Change Criticality - C4
- Change Type - changescheduled
- Change Reviewer - @hphilipps
- Due Date - 2020-02-17
- Time tracking - 15 minutes
- Downtime Component - 0
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
-
Get Approval on: https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5014 -
Get Approval on: gitlab-com/gl-infra/k8s-workloads/gitlab-com!696 (closed)
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 5
-
Merge and Apply: https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5014 -
Merge and Apply: gitlab-com/gl-infra/k8s-workloads/gitlab-com!696 (closed)
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 10
-
Revert: https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5014 -
Revert: gitlab-com/gl-infra/k8s-workloads/gitlab-com!696 (closed) -
Execute a manual chef run to expedite configuration change: knife ssh 'roles:gprd-base-fe' 'sudo chef-client' -C 3
Monitoring
Logging
Historically this change invoked a GC failure that introduced nasty behavior in puma. Luckily this can be seen with a massive amount of seg faulting of the application recorded by our puma logs. Monitor for such here: https://log.gprd.gitlab.net/goto/b71d63d5bb9a436f6a2d5c4fb9cb72cb
Key metrics to observe
Note that the behavior exhibited happens at a random point in time. The right things need to occur that caused a GC operation to kick the right set of values out of memory causing said behavior. Due to this pattern, we'd see this behavior one node at a time. Refer to the incident: #3370 (closed) for additional details
-
Metric: Node CPU
- Location: https://dashboards.gitlab.net/d/web-main/web-overview?orgId=1&from=now-3h&to=now&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-sigma=2
- What changes to this metric should prompt a rollback: High CPU usage on an impacted node
-
Metric: Puma Error Ratio
- Location: https://dashboards.gitlab.net/d/web-main/web-overview?orgId=1&from=now-3h&to=now&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-sigma=2
- What changes to this metric should prompt a rollback: Increase in Error's
Summary of infrastructure changes
-
Does this change introduce new compute instances? no -
Does this change re-size any existing compute instances? no -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc? no
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall
and this issue and await their acknowledgement.) -
There are currently no active incidents.