2020-02-02: Redis starves for CPU on redis-cache server during burst of Redis SCAN commands
Summary
The redis-server process on the primary Redis-cache instance (redis-cache-01-db-gprd) intermittently reached CPU saturation several times today. This was triggered by a burst of Redis SCAN commands, which serially iterate a cursor over all keys in the database, linearly searching for matches. The SCAN command returns results incrementally to allow other clients a turn on-CPU, but in our case the search space has roughly 30 million keys, and the series of SCAN commands dominated the CPU time while active.
We captured an example of one of these bursts of Redis SCAN commands, which revealed that they originated from a canary web node (web-cny-01-sv-gprd) and all shared the following distinctive pattern:
scan
<cursor_id>
MATCH cache:gitlab:error_tracking_project_error_tracking_setting:14986497:list_issues*
COUNT 1000
The glob pattern in the MATCH argument led to identifying where in the Rails code the command originates -- class ProjectErrorTrackingSetting has a expire_issues_cache method that calls the Rails cache's delete_matched method, which uses SCAN to search for keys to then DEL.
Expiring this cache was a very recent change, just added to the Canary stage, which is why the SCAN commands were coming from a Canary web node. That merge request was reverted via a hotpatch to prevent this from recurring. Issue gitlab-org/gitlab#201808 (closed) documents the details of the performance problem and explores more efficient alternatives.
More information will be added as we investigate the issue.
Timeline
All times UTC.
2020-02-03
- 15:19 - @andrewn called attention to a non-paging alert about high single-CPU usage on a redis-cache node. (See example below.) EOC (@msmiley) begins investigating.
- 15:55 - @andrewn finds that Redis
SCANcommands correlate with the CPU usage spikes. - 16:23 - @msmiley started a rotating packet capture to wait for the next occurrence and find where the unexpected
SCANcommands are coming from and what their arguments are. - 18:59 - Captured the next regression event.
- 20:40 - Found the canary web node was the origin of the
SCANcommands, and found they all shared a distinctiveMATCHglob expression. Slack discussion lead quickly to finding the relevant code and context. Merge request: https://ops.gitlab.net/gitlab-com/gl-infra/patcher/merge_requests/155 - 23:32 - Hotpatch finished deploying to gprd. Pipeline: https://ops.gitlab.net/gitlab-com/gl-infra/patcher/pipelines/112390
Example
The single-CPU saturation sometimes triggered non-paging ~S3 alerts in the #alerts-general Slack channel.
Example:
The
redis-cacheservice (mainstage),single_threaded_cpucomponent has a saturation exceeding SLO and is close to its capacity limit. The redis-cache service (main stage), single_threaded_cpu component has a saturation exceeding SLO and is close to its capacity limit. This means that this resource is running close to capacity and is at risk of exceeding its current capacity limit.
The Redis CPU usage spikes correlated with spikes in the duration of SCAN commands (which are very rare).

