Enable `dynamic_image_resizing_owner` FF via production console for pre-selected 100 users
Production Change
Change Summary
We want to run the Dynamic Image Resizing experiment.
To do that, we need to enable the Feature Flag (`dynamic_image_resizing_owner) for 100 of pre-selected users (a semi-random subset of GL employees).
It will only affect a randomly selected subset of Gitlab Employees' avatars which would be resized in our service before being served. Content images are not affected. We expect there will be no difference in User Experience. We want to monitor the Workhorse performance, stability, and resource utilization.
Change Details
- Services Impacted - GL Workhorse
- Change Technician - @iroussos
- Change Criticality - C3
- Change Type - changeunscheduled
- Change Reviewer - @alipniagov
- Due Date - 10 Sep, 2020, 12:00 UTC
- Time tracking - 240 minutes max (180 minutes probably)
- Downtime Component - No
Detailed steps for the change
Pre-Change Steps - steps to be completed before the execution of the change
Estimated Time to Complete (mins) - 15
-
Start a sync call with the Memory Team members: Aleksei Lipniagov, Nikola Milojevic, Matthias Kappler, Kamil (if available), and make sure that everyone is on board. Ideal start time would be somewhere in between of 9:00 GMT and 13:00 GMT, but we could discuss.
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 30
-
Run https://gitlab.com/gitlab-org/gitlab/-/issues/241533#note_407403888 in production console to enable the FF -
Check immediate WH health status
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - 180
-
Observe the WH performance: CPU, Memory, latency. Verify that there are no visible changes. -
Check related Prometheus metrics -
Check related logs in Kibana -
Post in #production (Slack): /chatops run feature set dynamic_image_resizing_requester 5 -
Check WH, Prometheus, Kibana -
Increase the dynamic_image_resizing_requesterpercentage up to 100 -
Check WH, Prometheus, Kibana
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 5
-
Post in #production (Slack): /chatops run feature delete dynamic_image_resizing_requester -
Verify FF is deleted (copy the Slack link here) -
Post in #production (Slack): /chatops run feature delete dynamic_image_resizing_owner -
Verify FF is deleted (copy the Slack link here)
Monitoring
Key metrics to observe
- Metric: GL Health
- Location: https://dashboards.gitlab.net/d/RZmbBr7mk/gitlab-triage?orgId=1&refresh=30s
- What changes to this metric should prompt a rollback: Error rate growth, significant response time growth
- Metric: gitlab_workhorse_image_resize_concurrency_limit_exceeds_total
- Location: https://prometheus-app.gprd.gitlab.net/graph
- What changes to this metric should prompt a rollback: Nothing
- Metric: gitlab_workhorse_image_resize_processes
- Location: https://prometheus-app.gprd.gitlab.net/graph
- What changes to this metric should prompt a rollback: No
- Metric: gitlab_workhorse_image_resize_completed_total
- Location: https://prometheus-app.gprd.gitlab.net/graph
- What changes to this metric should prompt a rollback: value remains zero after
dynamic_image_resizing_owneris enabled for all users
- Metric:
ImageResizer: Successlog record- Location: https://log.gprd.gitlab.net/goto/53283b48d53ba14934a46410d9412bb7
- What changes to this metric should prompt a rollback: zero records during the first 30 mins of the experiment
- Metric:
ImageResizer:...log records (various errors)- Location: https://log.gprd.gitlab.net/goto/53283b48d53ba14934a46410d9412bb7
- What changes to this metric should prompt a rollback: > 1000 error records of any type
- Metric: Workhorse/Full web component Health; WH CPU & Memory usage
- Location: https://dashboards.gitlab.net/d/web-main/web-overview?orgId=1 and https://dashboards.gitlab.net/d/general-service/general-service-platform-metrics?orgId=1
- What changes to this metric should prompt a rollback: Error rate growth, significant response time growth
- Metric: WH request CPU usage
- Location: https://log.gprd.gitlab.net/goto/53283b48d53ba14934a46410d9412bb7
- What changes to this metric should prompt a rollback: Nothing
Summary of infrastruture changes
- [-] Does this change introduce new compute instances? - No
- [-] Does this change re-size any existing compute instances? - No
- [-] Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc? - No
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled). -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and resultes noted in a comment on this issue: https://gitlab.com/gitlab-org/gitlab/-/issues/241533#note_407403888 -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncalland this issue.) -
There are currently no active incidents.
Edited by Aleksei Lipniagov