When redis switches masters, our sidekiqs are having issues completing jobs.
Please see this issue where much discussion related to this has already happened
No timeline items have been added yet.
Edited
Child items ...
Show closed items
Linked incidents or issues 0
Link incidents together to show that they're related or that one is blocking others.
Learn more.
Activity
Sort or filter
Newest first
Oldest first
Show all activity
Show comments only
Show history only
John Skarbekchanged title from Incident Redis switchover causing increased rate of sidekiq errors to Redis switchover causing increased rate of sidekiq errors
changed title from Incident Redis switchover causing increased rate of sidekiq errors to Redis switchover causing increased rate of sidekiq errors
enalbed on the current master (redis-01). After a few seconds:
latency doctorDave, no latency spike was observed during the lifetime of this Redis instance, not in the slightest bit. I honestly think you ought to sit down calmly, take a stress pill, and think things over.
The timestamps are all today. (mostly around 10 UTC)
lpush is a O(1) operation in Redis, so it should be really fast, yet we're seeing some (lpush resque:gitlab:queue:system_hook_push) taking up to 170ms
The queues in question (resque:gitlab:queue:system_hook_push and resque:gitlab:queue:email_receiver) both currently have a length of zero (not that this will make a difference for a O(1) operation)
Staging does not have the same sort of entries. It's slowlog does not have anything in the past week.
Could the lpush operations just be random commands that are in-progress when the machine randomly freezes up?
Another observation: considering the size of these machines, and especially how much bigger they are than their Azure counterparts, I'm quite surprised at how much CPU Redis is consuming ~60% of one core. It would be worth investigating usage in Azure from before the failover.
Current Plan
@skarbek to await another failover and run LATENCY DOCTOR and SLOWLOG 10 post failover on the ex-master (redis-01) and see what comes up.
We need to keep in mind that the command that appears in the log may be an "innocent bystander" rather than the cause of the latency, if the latency is external.
It would also be worth investigating whether Redis is spawning a child for BGSAVE persistence using copy-on-write at the time of the latency spike: why? http://antirez.com/news/84
On GitLab.com, we have two system hooks, but neither of them handle the push event. If we can't figure out how to limit the payload in https://gitlab.com/gitlab-org/gitlab-ce/issues/32369, we may just want to avoid pushing this payload if we have no listeners for this event.
So, in watching things today, node 01 almost failed us. One of the servers indicated it had gone into a subjective down, but he remained master. I figured now would be a good time to collect a bit of data. From the slow log:
@reprazent and @mrjkc will skip the "deploy to production" step of the deploy. It is exactly the same as the current RC and without a full staff around, and with the current instability, it's not worth the risk to do a full deploy just to increment a version number
@skarbek to apply the patch and monitor Redis and the Sidekiq fleet for OOM errors
The patch will mean that we send fewer 160MB payloads to Sidekiq, but we will still be sending them
Hopefully this will mean less OOM errors in Sidekiq and less Redis failovers (without making them go away all together)
If this doesn't reduce the Sidekiq OOM errors, we'll do a rolling upgrade of the realtime sidekiq fleet to scale them up vertically. @jarv will prepare the change for this.
@ahmadsherif@alejandro please could you be on hand to assist @skarbek with his with patch deployment, in case he needs assistance.