Rollout a fixed version of the Sidekiq Reliable fetcher

See https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5215

In 11.5 we have enabled the Reliable fetcher and we found few problems with it. When it's fixed we need to try to enable it again

/cc @northrup @stanhu @rnienaber

Production readiness review (template)

Summary

Sidekiq is prone to lose jobs because it just pops the job from the queue (Redis list) and it only pushes job back when it catches some exception. However, there are some cases when Sidekiq can't handle the error properly, for example, when it crashes, or when kill -9 has taken place. We don't have any metric that would show us how serious this issue is but we saw a lot of cases when the job has not been delivered. Also, it's not only about GitLab.com. We don't know in what conditions GitLab instanced are used over the world so it's generally highly desirable to have reliable queue system. In some cases, the result of loss is not that expensive, like a missed email notification but sometimes it's crucial, for example, a missed Geo Event.

Architecture

The Sidekiq Basic fetcher picks a job from a queue using Redis brpop which essentially removes the job from the queue. Our Reliable fetcher acts differently, it moves the job from a regular queue to a special working queue. So if some process is stopped non-gracefully, we can reschedule it later.

How do we now that some working queue is already dead and so every job should be rescheduled? Every worker has its own working queue that is build with pattern "#{WORKING_QUEUE_PREFIX}:#{queue}:#{hostname}:#{pid}". Every 20 seconds a worker sends heartbeat request to Redis using SET with 1 minute expiration time. If this heartbeat key has been expired then we consider worker with such hostname and PID dead and we can reschedule all the jobs from its working queue. Every worker tries to perform a cleaning of dead working queues every hour if it's able to take a lease (a lock key in Redis) so only one worker will process dead queues at a time.

By default, every worker starts 10 threads to process jobs, in addition to that we start one more heartbeat thread

The Sidekiq has a built-in way to specify what exactly fetcher will be used. The OSS version (the one we use) uses Basic fetcher that is unsafe as described above.

As you can see OSS version mostly targets hobbyists.

We created two fetchers in a separate Ruby gem and they can be enabled on GitLab.com by enabling flipper feature flag gitlab_sidekiq_reliable_fetcher. If something goes wrong we can disable it using chatops.

Radius of failures

The Sidekiq is a fundamental part of our system. If something goes wrong, background jobs won't be processed which means that many services won't work correctly (we have about 185 different queues). There is another possible problem - job duplication. The thing is that if some job is still processed but we rescheduled it again, by moving it from a working queue to a regular one, then the job will be effectively duplicated. Unfortunately, most of our jobs are not idempotent and it can cause some serious performance problems with our database because too many jobs can try to update the same row in the same table which can lead to increased number of locks. In our case, the heartbeat is the way we detect the dead queue. If the heartbeat thread didn't manage to send a request within 1-minute window, the duplicates will be created. However, it should not have a snowball effect as we had with the original version of the reliable fetcher. We completely reworked the way it works.

Note about Semi-Reliable fetcher

In the Reliable fetcher, we use Redis rpoplpush command to fetch a new job from the queues. It's an atomic call so it's impossible that the job will be lost because of non-graceful interruption. Unfortunately, it has one significant drawback, it can't listen to many queues at once (like brpop can), instead, we need to constantly iterate over every queue to pick up the job. As we have about 185 queues in GitLab app, it brings a problem as fetching a job gets expensive. To mitigate this problem we have developed a slightly modified version of Reliable Fetcher - the Semi-Reliable fetcher. It uses the same approach, the only difference is a way it fetches the job - it uses a sequence of brpop and lpush commands to move the job from a regular queue to a working one. This approach fixes performance problems but it brings some neglectable possibility of losing the job as the movement is not atomic so there is a 1..5 ms window when unexpected interruption can lead to a loss.

We have developed a special test tool to mimic very harsh conditions when 200 workers are spawned then stopped every 30 seconds. We used TERM and KILL signals to stop workers. We got the following results:

200 workers, 200.000 jobs, interruptions every 30 seconds(100 of KILL and 100 of TERM signals).

Reliable:

Remaining unprocessed(lost): 0
Duplicates found: 0

Semi:

Remaining unprocessed(lost): 159
Duplicates found: 0

Basic:

Remaining unprocessed (lost): 9868
Duplicates found: 0

Based on this data we think that Semi-Reliable fetcher is an optimal solution for us.

Operational Risk Assessment

Both fetchers slightly increase the pressure to Redis server. But it should not be a problem as we already tried it on gitlab.com and there was no issue with Redis pressure. Anyway, by turning feature flag off we can stop using Relaible fetcher at any given point.

One more possible problem is duplicates. Explained above.

List the external and internal dependencies to the application (ex: redis, postgres, etc) for this feature and how the it will be impacted by a failure of that dependency.

There is a couple of concerns I can think of - what if Redis stop responding? What if due to network problems some worker can't connect to Redis? In this case, Reliable Fetcher will retry those jobs and the work will be duplicated. If this becomes a problem for us we have to fix it on application level I think. I don't think the fetcher can do much about this. This is one of the trades off we have to make. We can also consider increasing HEARBEAT_LIFESPAN from 1 minute to 15 or so. In this case, abandoned jobs will be rescheduled with some latency but the probability of duplicates decreases significantly.

List the top three operational risks when this feature goes live. What are a few operational concerns that will not be present at launch, but may be a concern later?

I can list only one - duplicates.

As a thought experiment, think of worst-case failure scenarios for this product feature, how can the blast-radius of the failure be isolated?

If there will be too many duplicates it can lead to pretty visible and global failure. The worst case was seen last time, in https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5215. The main impact to users looked like the following until the feature was disabled, and the heavy jobs were drained:

Merge requests diffs not being updated
Merge request widgets not updating
Repo mirrors not working

To recap the root cause:

Workers are occasionally killed due to excessive memory usage, starting new ones. This could be improved upon at the application layer, but probably not completely eliminated.
A bug in the old fetcher duplicated jobs that were in the middle of being worked on, at the beginning of every worker thread (read: on each kill + start due to excessive memory usage, and every time a worker was recycled while rolling out the new feature).
Big jobs exacerbate the problem, and we have at least a couple big jobs.

What have we done to ensure this won't happen again?

The bug was fixed
We brought test coverage to 99%
As already mentioned, we added a stress test
We minimized unnecessary duplication found by the stress test
As already mentioned, we added more log output

Monitoring and Alerts

We use standard Sidekiq logging + we add some extra items to the log in Reliable Fetcher. And yes, it will be logged to our ELK Stack.

How is the end-to-end customer experience measured?

Ideally, we would need to track the amount of job loses and job duplicates. On the other hand, Reliable Fetcher is intended to improve how the system handles edge-cases so it's not like we can deploy this feature and see immediate results.

Responsibility

Which individuals are the subject matter experts and know the most about this feature?

@vsizov @mkozono @stanhu and partially @northrup

Which team or set of individuals will take responsibility for the reliability of the feature once it is in production?

Geo team

Is someone from the team who built the feature on call for the launch? If not, why not?

Yes, I will assist in the launch process. If something is wrong, the fix is mostly just disabling the feature in chatops by disabling feature flag gitlab_sidekiq_reliable_fetcher

Testing

Describe the load test plan used for this feature. What breaking points were validated?

We built a special testing tool that spawns 200 workers (2000 worker threads) and creates 200.000 jobs. Every 30 seconds we interrupt workers in different ways. In the end, we count a number of lost jobs and a number of duplicates. Results are mentioned above.

Give a brief overview of what tests are run automatically in GitLab's CI/CD pipeline for this feature?

We run unit tests (almost 100% coverage) and a lightweight version of the testing tool suite I mentioned above (10 workers, 1000 jobs for both fetchers Reliable and Semi).

Edited Dec 12, 2018 by Valery Sizov