Flooded with logs on staging sidekiq besteffort due to large queue of background migrations
{"severity":"INFO","time":"2020-02-07T16:13:42.549Z","class":"BackgroundMigrationWorker","args":["ActivatePrometheusServicesForSharedClusterApplications",3012677],"retry":3,"queue":"background_migration","jid":"1d4cae1d873087ef26e27d6a","created_at":"2020-02-07T12:13:34.176Z","meta.caller_id":"BackgroundMigrationWorker","correlation_id":"031886fd3b177cd4fcb7aaefafe933d6","enqueued_at":"2020-02-07T12:15:08.972Z","pid":24286,"message":"BackgroundMigrationWorker JID-1d4cae1d873087ef26e27d6a: start","job_status":"start","scheduling_latency_s":14313.577107}
{"severity":"INFO","time":"2020-02-07T16:13:42.554Z","class":"BackgroundMigrationWorker","args":["ActivatePrometheusServicesForSharedClusterApplications",2311194],"retry":3,"queue":"background_migration","jid":"adc208d7918ba7eb67d674f6","created_at":"2020-02-07T12:13:26.173Z","meta.caller_id":"BackgroundMigrationWorker","correlation_id":"73f276eff7a57d450c36fac7f2026a55","enqueued_at":"2020-02-07T12:15:08.846Z","pid":24286,"message":"BackgroundMigrationWorker JID-adc208d7918ba7eb67d674f6: done: 0.537016 sec","job_status":"done","scheduling_latency_s":14313.171338,"duration":0.537016,"cpu_s":0.004422,"completed_at":"2020-02-07T16:13:42.554Z","db_duration":0,"db_duration_s":0}
We are being flooded with these log messages on staing, approx 260 messages per second.
I believe gitlab-org/gitlab!24135 (diffs) was meant to clear these migrations from the queue.
This is also giving us errors as the logs are being forwarded to stackdriver, which is similar to another issue https://gitlab.com/gitlab-org/gitlab/issues/202612
2020-02-07 16:16:16 +0000 [warn]: #0 Failed to extract log entry errors from the error details: {
"error": {
"code": 400,
"message": "Request payload size exceeds the limit: 10485760 bytes.",
"status": "INVALID_ARGUMENT"
}
}
. error_class=JSON::ParserError error="NilClass"
2020-02-07 16:16:16 +0000 [warn]: #0 Dropping 10592 log message(s) error="Invalid request" error_code="400"
I believe in this case, which may different than the other issue is that the very high message rate is filling the buffer which is hitting the 10MB limit, with multiple chunks. In the configuration we set buffer_chunk_limit 5m
, which is a bit higher than the default but is set higher so we don't hit the api request limit.
@abrandl @stanhu @mikolaj_wawrzyniak Can we clean up this huge queue?
Edited by John Jarvis