Decide what to do with Docker Registry mirroring for our CI fleet

Our auto-scaled Runners managers are configured to use Docker Registry mirror that can be running on cache servers. However, during last 2-3 months we've been investigating performance problems with these servers which ended with disabling Registry process. At this moment I think we can even say that having the mirror introduces more problems instead than performance improvements (e.g. as it is described in #3157 (moved)).

We should decide what to do with Registry mirroring.

Since at least one month (or maybe even more) mirroring was disabled and we haven't seen any reports mentioning that Pulling docker image... step is significantly longer than in the past. This could suggest that the mirroring is unnecessary and we could simplify our configuration by just removing it.

The problem here is that without registry mirroring we're pulling the same images over and over again from the remove registries. We don't have metrics for that but I'd say that the same images in the same versions are pulled more times than new versions of images are introduced in jobs. This of course generates a higher network utilization which means slowness (as described above - probably not problematic for us at this moment) and higher costs. Using a mirror that is placed in the same network as auto-scaled machines means that we don't create additional costs and images pulling could be a little faster.

The conclusion is that we should discuss if we still want to use Registry mirroring, and if we decide to do so - discuss how to change the configuration so it will not affect performance as it does now.

Additional context from product scaling agenda at https://docs.google.com/document/d/1nMJzrDfG7C14WP5v7P226oPFuXkwqIk7bdIT8ai0DNU/edit?ts=5d84fb07&skip_itp2_check=true&pli=1#bookmark=id.acbz08dge98p

Edited Mar 06, 2020 by Jason Yavorsky
Assignee Loading
Time tracking Loading