Project 'gitlab-com/gl-infra/scalability' was moved to 'gitlab-com/gl-infra/observability/team'. Please update any links and bookmarks that may still have the old path.
Investigating dangling jobs in redis-sidekiq default queue
After production#17868, it is observed that there were a few jobs being processed by pods in the gitlab-sidekiq-catchall-v2 deployment even after the feature flag has been fully enabled at 0744h UTC 09 May 2024.
The created_at translates to Tue May 07 2024 05:00:04 GMT+0000 which was before the PDM. All checks out.
Part 2: Groups::ResetSeatCalloutsWorker
The error shows uninitialized constant Groups::ResetSeatCalloutsWorker\\n\\n klass = Object.const_get(job_hash[\\\"class\\\"])\\n ^^^^^^^^^^\\nDid you mean? Groups::ReleasesController which can be correlated in the nonprod logs.
This error is raised in the Sidekiq processor and sent to the retry sorted set where the Sidekiq sharding enqueuer picks it up. But the same error will be converted into a nil by safe_constantize. With a nil class, we default to the main shard of redis-sidekiq.
I feel this is an expected behaviour since uninitialized constant indicates a problem with the ruby application loading.
An option is to have a small k8s deployment which listens to all sharded-out queues.
The feature flag was switched to 100% at 0744h UTC.
Looking at the latest performed job by catchall shard pods, we see the jobs is picked up at 0805h UTC.
Example 1
A Namespaces::ScheduleAggregationWorker job was performed at 0805h.
Further checking the caller, we can see that the caller started at 0742h (before ff toggle) and ended at 0805h (after ff toggle).
Example 2
A Integrations::GroupMentionWorker job was performed at 0748h.
The caller, NewMergeRequestWorker started at 0730h and ended at 0748h.
The root cause is because of memoized feature flag states in requst store.
Every sidekiq job initialises a request store in the server middleware. With an active request store, memoize: true is used for the flipper class. This means that within the scope of an request store's lifetime, feature flags will be memoized and not updated. Not a problem for short workloads but the above 2 examples lasted ~15-20 minutes.
This results in the router sending the message to the old Redis as a result of the memoized (and outdated) feature flag state.
local repro
Open a console (A) and run a simple loop to print the feature flag status
However this can be explained by the recent change in urgency of the worker class in gitlab-org/gitlab!150276 (merged). In the ss below, queue is default and the created_at is 21 Apr since the job has been failing and being retried. This was the 24th retry.
In scheduled_enq.rb, the Llm::CompletionWorker.get_sidekiq_options['store'] is nil which causes it to be routed to the redis-sidekiq. But the job pushes uses the queue in the job_hash which was default since that was the queue when the job was created (before the urgency update).
Conclusion: this is a one-off affecting jobs which has its worker attribute updated such that the queue changed from default to another queue. This only affects jobs scheduled before the feature flag was enabled and scheduled for a time after the feature flag enablement.
We can safely ignore these since they are the edge case. To be cautious, we can take on the same approach as #3497 (comment 1898992058) and run a deployment of 1-2 replicas just to pick up any stray jobs.
The cron has not been removed from the Redis which could explain why it continues to get enqueued. Because the class does not exist, it gets pushed as raw job.
The MR was reverted on 4 July and deployed on 6th July (gitlab-org/gitlab!158348 (merged)). This caused the cron job to be correctly routed and the rate to taper off.
When a cron worker class is removed without disabling, the router does not have any information on how to route it and sends it to the default queue. This harmless to the system since the jobs are not picked up. Prior to sharding, the jobs would have been picked up then dropped since no matching class exists.
all Llm::Embedding::GitlabDocumentation::* in the list above from the commit 84f8c88b8cebf5069eb4906b6170ea760e6b0a57
Cause 2: Mis-routed during transitory phase
ProductAnalytics::MoveFunnelsWorker. Job created at Thu May 16 2024 15:36:57 GMT+0000 . Introduced in gitlab-org/gitlab!150683 (merged). Deployed ~ May 16, 2024 at 18:51:07 PM GMT+0.
"error_message": "uninitialized constant ProductAnalytics::MoveFunnelsWorker\n\n klass = Object.const_get(job_hash[\"class\"])\n ^^^^^^^^^^",
The "meta.caller_id": "Projects::Settings::AnalyticsController#update" which indicates it may be a case where canary tries to enqueue a job class which does not exists on the Sidekiq pods. The Sidekiq server process is unable to process this class and retries would not route the job correctly. Note that the datetime of job creation is earlier than the complete deploy datetime.
Llm::CompletionWorker. Job created at Wed Jun 19 2024 20:41:16 GMT+0000. There was a change in worker property (gitlab-org/gitlab!155547 (merged)) merged on 19 June. Deployment completed at June 19, 2024 at 21:53:14 PM GMT+0
"error_message": "undefined method `member?' for #<Namespaces::UserNamespace id:REDACTED @REDACTED>\nDid you mean? member_roles",
The change moved the worker from the urgent_other queue on the default shard to the default queue of the catchall_a shard.
Likely explanation: Job gets pushed to the correct shard initially and the pod that picks it up raises the above error. The jobs retries but get picked up by a pod that has not updated. It uses the queue in the job hash (default) but the store from the job class default.
Ci::ClickHouse::FinishedPipelinesSyncCronWorker (no errors). Job created at Mon Jul 15 2024 10:36:04 GMT+0000. Introduced in gitlab-org/gitlab!159083 (merged). Deployed ~ July 15, 2024 at 10:58:14 GMT+0 (based on label change in MR).
Explanation: Likely that the Sidekiq process that retried the job for scheduling is a pod running the older release. Without the right class, the sidekiq cron pushes the job without using the routing logic.
Looking at the logs, the shows that the job started at that time ~1036h UTC:
Fix introduced in gitlab-org/gitlab!160516 (merged) and has been deployed. This addresses the issue experienced by Llm::CompletionWorker when the worker attributes got updated to route the worker to the default queue.