2020-02-02: Redis starves for CPU on redis-cache server during burst of Redis SCAN commands

Summary

The redis-server process on the primary Redis-cache instance (redis-cache-01-db-gprd) intermittently reached CPU saturation several times today. This was triggered by a burst of Redis SCAN commands, which serially iterate a cursor over all keys in the database, linearly searching for matches. The SCAN command returns results incrementally to allow other clients a turn on-CPU, but in our case the search space has roughly 30 million keys, and the series of SCAN commands dominated the CPU time while active.

We captured an example of one of these bursts of Redis SCAN commands, which revealed that they originated from a canary web node (web-cny-01-sv-gprd) and all shared the following distinctive pattern:

scan
<cursor_id>
MATCH cache:gitlab:error_tracking_project_error_tracking_setting:14986497:list_issues*
COUNT 1000

The glob pattern in the MATCH argument led to identifying where in the Rails code the command originates -- class ProjectErrorTrackingSetting has a expire_issues_cache method that calls the Rails cache's delete_matched method, which uses SCAN to search for keys to then DEL.

Expiring this cache was a very recent change, just added to the Canary stage, which is why the SCAN commands were coming from a Canary web node. That merge request was reverted via a hotpatch to prevent this from recurring. Issue gitlab-org/gitlab#201808 (closed) documents the details of the performance problem and explores more efficient alternatives.

More information will be added as we investigate the issue.

Timeline

All times UTC.

2020-02-03

15:19 - @andrewn called attention to a non-paging alert about high single-CPU usage on a redis-cache node. (See example below.) EOC (@msmiley) begins investigating.
15:55 - @andrewn finds that Redis SCAN commands correlate with the CPU usage spikes.
16:23 - @msmiley started a rotating packet capture to wait for the next occurrence and find where the unexpected SCAN commands are coming from and what their arguments are.
18:59 - Captured the next regression event.
20:40 - Found the canary web node was the origin of the SCAN commands, and found they all shared a distinctive MATCH glob expression. Slack discussion lead quickly to finding the relevant code and context. Merge request: https://ops.gitlab.net/gitlab-com/gl-infra/patcher/merge_requests/155
23:32 - Hotpatch finished deploying to gprd. Pipeline: https://ops.gitlab.net/gitlab-com/gl-infra/patcher/pipelines/112390

Example

The single-CPU saturation sometimes triggered non-paging ~S3 alerts in the #alerts-general Slack channel.

Example:

The redis-cache service (main stage), single_threaded_cpu component has a saturation exceeding SLO and is close to its capacity limit. The redis-cache service (main stage), single_threaded_cpu component has a saturation exceeding SLO and is close to its capacity limit. This means that this resource is running close to capacity and is at risk of exceeding its current capacity limit.

The Redis CPU usage spikes correlated with spikes in the duration of SCAN commands (which are very rare).

https://dashboards.gitlab.net/d/redis-cache-main/redis-cache-overview?orgId=1&from=1580709600000&to=1580778000000&fullscreen&panelId=23

https://dashboards.gitlab.net/d/alerts-saturation_component/alerts-saturation-component-alert?orgId=1&from=1580709600000&to=1580778000000&panelId=2&tz=UTC&var-environment=gprd&var-type=redis-cache&var-stage=main&var-component=single_threaded_cpu