Enable `dynamic_image_resizing_owner` FF via production console for pre-selected 100 users

Production Change

Change Summary

We want to run the Dynamic Image Resizing experiment.
To do that, we need to enable the Feature Flag (`dynamic_image_resizing_owner) for 100 of pre-selected users (a semi-random subset of GL employees).

It will only affect a randomly selected subset of Gitlab Employees' avatars which would be resized in our service before being served. Content images are not affected. We expect there will be no difference in User Experience. We want to monitor the Workhorse performance, stability, and resource utilization.

Change Details

Services Impacted - GL Workhorse
Change Technician - @iroussos
Change Criticality - C3
Change Type - changeunscheduled
Change Reviewer - @alipniagov
Due Date - 10 Sep, 2020, 12:00 UTC
Time tracking - 240 minutes max (180 minutes probably)
Downtime Component - No

Detailed steps for the change

Pre-Change Steps - steps to be completed before the execution of the change

Estimated Time to Complete (mins) - 15

Start a sync call with the Memory Team members: Aleksei Lipniagov, Nikola Milojevic, Matthias Kappler, Kamil (if available), and make sure that everyone is on board. Ideal start time would be somewhere in between of 9:00 GMT and 13:00 GMT, but we could discuss.

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 30

Run https://gitlab.com/gitlab-org/gitlab/-/issues/241533#note_407403888 in production console to enable the FF
Check immediate WH health status

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - 180

Observe the WH performance: CPU, Memory, latency. Verify that there are no visible changes.
Check related Prometheus metrics
Check related logs in Kibana
Post in #production (Slack): /chatops run feature set dynamic_image_resizing_requester 5
Check WH, Prometheus, Kibana
Increase the dynamic_image_resizing_requester percentage up to 100
Check WH, Prometheus, Kibana

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 5

Post in #production (Slack): /chatops run feature delete dynamic_image_resizing_requester
Verify FF is deleted (copy the Slack link here)
Post in #production (Slack): /chatops run feature delete dynamic_image_resizing_owner
Verify FF is deleted (copy the Slack link here)

Monitoring

Key metrics to observe

Metric: GL Health
- Location: https://dashboards.gitlab.net/d/RZmbBr7mk/gitlab-triage?orgId=1&refresh=30s
- What changes to this metric should prompt a rollback: Error rate growth, significant response time growth
Metric: gitlab_workhorse_image_resize_concurrency_limit_exceeds_total
- Location: https://prometheus-app.gprd.gitlab.net/graph
- What changes to this metric should prompt a rollback: Nothing
Metric: gitlab_workhorse_image_resize_processes
- Location: https://prometheus-app.gprd.gitlab.net/graph
- What changes to this metric should prompt a rollback: No
Metric: gitlab_workhorse_image_resize_completed_total
- Location: https://prometheus-app.gprd.gitlab.net/graph
- What changes to this metric should prompt a rollback: value remains zero after dynamic_image_resizing_owner is enabled for all users
Metric: ImageResizer: Success log record
- Location: https://log.gprd.gitlab.net/goto/53283b48d53ba14934a46410d9412bb7
- What changes to this metric should prompt a rollback: zero records during the first 30 mins of the experiment
Metric: ImageResizer:... log records (various errors)
- Location: https://log.gprd.gitlab.net/goto/53283b48d53ba14934a46410d9412bb7
- What changes to this metric should prompt a rollback: > 1000 error records of any type
Metric: Workhorse/Full web component Health; WH CPU & Memory usage
- Location: https://dashboards.gitlab.net/d/web-main/web-overview?orgId=1 and https://dashboards.gitlab.net/d/general-service/general-service-platform-metrics?orgId=1
- What changes to this metric should prompt a rollback: Error rate growth, significant response time growth
Metric: WH request CPU usage
- Location: https://log.gprd.gitlab.net/goto/53283b48d53ba14934a46410d9412bb7
- What changes to this metric should prompt a rollback: Nothing

Summary of infrastruture changes

[-] Does this change introduce new compute instances? - No
[-] Does this change re-size any existing compute instances? - No
[-] Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc? - No

Changes checklist

This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled).
This issue has the change technician as the assignee.
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
Necessary approvals have been completed based on the Change Management Workflow.
Change has been tested in staging and resultes noted in a comment on this issue: https://gitlab.com/gitlab-org/gitlab/-/issues/241533#note_407403888
A dry-run has been conducted and results noted in a comment on this issue.
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue.)
There are currently no active incidents.

Edited Sep 09, 2020 by Aleksei Lipniagov

Assignee Loading

Time tracking Loading