Skip to content

GitLab Next

Why GitLab
Pricing
Contact Sales
Explore

Sign in
Get free trial

Apply new Ruby GC settings to Puma on SaaS

Production Change

Change Summary

As part of gitlab-org/gitlab#289838 (closed), and at a higher level the "GitLab on 2GB" initiative, we are looking to fine-tune the Ruby GC for several production services to better suit our needs. We have done this for gitlab-exporter already, and are now moving up to the riskier services, i.e. our main app. This issue here specifically targets Puma (not Sidekiq, yet), for which we are now resizing the initial Ruby heap to better match our idle memory consumption:

gitlab-org/charts/gitlab!1851 (closed)
gitlab-org/omnibus-gitlab!5019 (closed)

We expect this two have two major benefits:

less memory used due to fewer heap pages being allocated initially
faster application start time (we are verifying this independently in another issue and it is not the goal here)

We also expect the memory effect to deteriorate over time since we know to have certain endpoints that drastically drive memory use up, and not come back down, but it will very likely benefit smaller customers/deployments i.e. self-managed. For SaaS, I would call it a success if memory remains at worst stable, and performance is not affected.

Change Details

Services Impacted - Puma
Change Technician - @mkaeppler
Change Criticality - C2
Change Type - changeunscheduled, changescheduled (Unsure -- need advice)
Change Reviewer - @jarv
Due Date - 2021-03-10
Time tracking - unknown
Downtime Component - none

Detailed steps for the change

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes

Set the env variable in staging https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5129 and gitlab-com/gl-infra/k8s-workloads/gitlab-com!713 (merged)
Set the env variable in canary (k8s and VMs) gitlab-com/gl-infra/k8s-workloads/gitlab-com!714 (merged) and https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5130
Set the env variable in one k8s cluster (git and websockets). For git this would be set for the puma that is servicing internal API requests from the git https and shell service. TBD
Set the env variable on all k8s clusters
Set the env variable on two VM hosts running web and api
Set the env variable everywhere

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes

Post-Change Step 1
Post-Change Step 2
Post-Change Step 3

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes

Staging: Revert and apply https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5129 and gitlab-com/gl-infra/k8s-workloads/gitlab-com!713 (merged)
Canary: Revert and apply gitlab-com/gl-infra/k8s-workloads/gitlab-com!714 (merged) and https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5130
Production single K8s cluster: TBD
Production all clusters: TBD
Production web/api VMs

Monitoring

Key metrics to observe

Metric: Ruby GC and process stats:
- Kubernetes: Thanos
- VMs: Thanos
- What changes to this metric should prompt a rollback: unusual growth or patterns
API overview https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&from=now-1h&to=now&var-PROMETHEUS_DS=Global&var-environment=gprd
Web overview https://dashboards.gitlab.net/d/web-main/web-overview?orgId=1&from=now-1h&to=now&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-sigma=2
Websockets overview https://dashboards.gitlab.net/d/websockets-main/websockets-overview?orgId=1&from=now-3h&to=now&refresh=10s&var-PROMETHEUS_DS=Global&var-environment=gprd&var-sigma=2

Summary of infrastructure changes

Does this change introduce new compute instances?
Does this change re-size any existing compute instances?
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?

Summary of the above

Changes checklist

This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities.
This issue has the change technician as the assignee.
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
Necessary approvals have been completed based on the Change Management Workflow.
Change has been tested in staging and results noted in a comment on this issue.
A dry-run has been conducted and results noted in a comment on this issue.
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
There are currently no active incidents.

Edited Mar 10, 2021 by John Jarvis

Assignee Loading

Time tracking Loading