Skip to content

Namespaces::ProcessSyncEventsWorker: The `until_executed` does not allow re-scheduling worker

Noticed in: https://gitlab.com/gitlab-com/Product/-/issues/2852#note_1011169306

Problem

When there's a large backlog of SyncEvents (on namespaces or projects), the worker will re-schedule itself to be executed again to empty the queue.

The change introduced via !83286 (merged) broke this mechanism, making this "deduplicated" since it was executed during execution of itself.

This does show, that this mechanism is not correct, as if the until_executed and worker tries to re-schedule itself it will never succeed.

Symptoms

{"severity":"INFO","time":"2022-06-29T20:58:13.627Z","retry":25,"queue":"namespaces_process_sync_events","version":0,"args":[],"class":"Namespaces::ProcessSyncEventsWorker","jid":"cdfa166743baedeb19a50794","created_at":"2022-06-29T20:58:13.621Z","meta.feature_category":"pods","correlation_id":"27538673072803a620ee850db8957ca4","worker_data_consistency":"always","idempotency_key":"resque:gitlab:duplicate:namespaces_process_sync_events:99b01e5f7dc2a9b47ce31029252a21c1a7e02ef97b2a00a70d65582147f34ddb","size_limiter":"validated","enqueued_at":"2022-06-29T20:58:13.626Z","job_size_bytes":2,"pid":23095,"message":"Namespaces::ProcessSyncEventsWorker JID-cdfa166743baedeb19a50794: start","job_status":"start","scheduling_latency_s":0.001591}
{"severity":"INFO","time":"2022-06-29T20:58:25.174Z","retry":25,"queue":"namespaces_process_sync_events","version":0,"args":[],"class":"Namespaces::ProcessSyncEventsWorker","jid":"2b8d251960c7589bf10533c1","created_at":"2022-06-29T20:58:25.172Z","meta.feature_category":"pods","correlation_id":"27538673072803a620ee850db8957ca4","meta.caller_id":"Namespaces::ProcessSyncEventsWorker","meta.root_caller_id":"Namespaces::ProcessSyncEventsWorker","worker_data_consistency":"always","idempotency_key":"resque:gitlab:duplicate:namespaces_process_sync_events:99b01e5f7dc2a9b47ce31029252a21c1a7e02ef97b2a00a70d65582147f34ddb","duplicate-of":"cdfa166743baedeb19a50794","job_size_bytes":2,"pid":23095,"job_status":"deduplicated","message":"Namespaces::ProcessSyncEventsWorker JID-2b8d251960c7589bf10533c1: deduplicated: dropped until executed","deduplication.type":"dropped until executed"}
{"severity":"INFO","time":"2022-06-29T20:58:25.176Z","retry":25,"queue":"namespaces_process_sync_events","version":0,"args":[],"class":"Namespaces::ProcessSyncEventsWorker","jid":"cdfa166743baedeb19a50794","created_at":"2022-06-29T20:58:13.621Z","meta.feature_category":"pods","correlation_id":"27538673072803a620ee850db8957ca4","worker_data_consistency":"always","idempotency_key":"resque:gitlab:duplicate:namespaces_process_sync_events:99b01e5f7dc2a9b47ce31029252a21c1a7e02ef97b2a00a70d65582147f34ddb","size_limiter":"validated","enqueued_at":"2022-06-29T20:58:13.626Z","job_size_bytes":2,"pid":23095,"message":"Namespaces::ProcessSyncEventsWorker JID-cdfa166743baedeb19a50794: done: 11.549015 sec","job_status":"done","scheduling_latency_s":0.001591,"redis_calls":4,"redis_duration_s":0.0066500000000000005,"redis_read_bytes":253,"redis_write_bytes":1065,"redis_cache_calls":1,"redis_cache_duration_s":0.002337,"redis_cache_read_bytes":203,"redis_cache_write_bytes":52,"redis_queues_calls":1,"redis_queues_duration_s":0.000149,"redis_queues_read_bytes":48,"redis_queues_write_bytes":672,"redis_shared_state_calls":2,"redis_shared_state_duration_s":0.004164,"redis_shared_state_read_bytes":2,"redis_shared_state_write_bytes":341,"db_count":3005,"db_write_count":2001,"db_cached_count":0,"db_replica_count":0,"db_primary_count":3005,"db_main_count":3005,"db_main_replica_count":0,"db_replica_cached_count":0,"db_primary_cached_count":0,"db_main_cached_count":0,"db_main_replica_cached_count":0,"db_replica_wal_count":0,"db_primary_wal_count":0,"db_main_wal_count":0,"db_main_replica_wal_count":0,"db_replica_wal_cached_count":0,"db_primary_wal_cached_count":0,"db_main_wal_cached_count":0,"db_main_replica_wal_cached_count":0,"db_replica_duration_s":0.0,"db_primary_duration_s":6.698,"db_main_duration_s":6.698,"db_main_replica_duration_s":0.0,"cpu_s":5.228127,"mem_objects":1594293,"mem_bytes":61860304,"mem_mallocs":246876,"mem_total_bytes":125632024,"worker_id":"sidekiq_0","rate_limiting_gates":[],"extra.namespaces_process_sync_events_worker.estimated_total_events":515929,"extra.namespaces_process_sync_events_worker.consumable_events":1000,"extra.namespaces_process_sync_events_worker.processed_events":1000,"duration_s":11.549015,"completed_at":"2022-06-29T20:58:25.176Z","load_balancing_strategy":"primary","db_duration_s":6.564283}

Impact

The likely of this occurring is very low in production environment since of high rate of events. However, this might occur for on-premise instances when there's a lot of events enqueued at a single moment. This will result in being unable to empty the queue ever.

Workaround

Revert !83286 (merged)

Workaround (manual)

loop { puts Namespaces::ProcessSyncEventsWorker.new.perform }
{:estimated_total_events=>539371, :consumable_events=>1000, :processed_events=>1000}
{:estimated_total_events=>538371, :consumable_events=>1000, :processed_events=>1000}
{:estimated_total_events=>537371, :consumable_events=>1000, :processed_events=>1000}
{:estimated_total_events=>536371, :consumable_events=>1000, :processed_events=>1000}
# unitl there's no more events

And then

loop { puts Projects::ProcessSyncEventsWorker.new.perform }
...

Fix

Figure out how to make until_executed to re-schedule itself if it was discovered that there was another request during execution of a worker.

Edited by Kamil Trzciński