ProjectExport queuing and delays

Current summary of findings

DNS and/or a problem with GCS, from sidekiq in kubernetes, leading to poor throughput (exacerbated by HPA autoscaling)

Details

For several days now (since 2020-11-16) the project_export queue in sidekiq has periodically grown very large (over 1K), when it normally would never top 100 and usually is much lower:

image

Source

We expect some queuing because the memory-bound sidekiq workers are an intentionally limited resource because exports are not urgent, but this is unusual in intensity. It is causing problems for project templating (implemented as an export/import, using the project_export queue) as reported in depth at gitlab-org/gitlab#284498 (closed), and we have adjusted rate-limits at #3050 (closed) to attempt to compensate.

However, the number of jobs does not appear dramatically different:

image

Source

and they do not appear to be taking substantially more time than usual:

image

Source

There is a big of an increase, but: each bar there is 3 hours and we have up to 16 memory-bound sidekiq pods, which, with queuing, should have been fully operational and thus using up to 3600*3*16 = 172800 CPU seconds per bar , and we only got close to that once at 2020-11-18 00:00-03:00 UTC.

And finally, with a full queue, I would expect the Shard Utilization graph to be pegged at 100% utilization for those periods, but it is much more lightly used than that.

There is something Not Right here.

Edited by Craig Miskell