Improve docker timeouts
What does this MR do?
Tries to resolve the state of #2408 (closed).
Most of the errors that the user sees are related to Docker API taking a long time to process requests. For example, any I/O expensive operation can make Docker Engine to not be responsive fast enough. Increasing timeouts gives more room for Docker Engine.
Does this MR meet the acceptance criteria?
-
Documentation created/updated - Tests
-
Added for this feature/bug -
All builds are passing
-
-
Branch has no merge conflicts with master
(if you do - rebase it please)
What are the relevant issue numbers?
cc @tmaczukin
Related to #2408 (closed)
Merge request reports
Activity
mentioned in issue #2408 (closed)
@ayufan Should we maybe make this configurable from
config.toml
? I mean rising the default values is IMO a good change, but in some environments it may be still not enough. Having the possibility to configure these values fromconfig.toml
every user could adjust the settings to mach his environment, no matter how slow it is.@tjurak Are large downloads affected by this? AFAIK, this is already handled OK.
@tjurak I would not give unreasonable big timeouts, as it prevents fail-fast. I would rather have reasonable timeouts, where: 1. we make it clear that something timed out, and for how long, 2. we retry timed out operations whatever we can.
We don't retry all operations yet, that we could retry. I plan to fix it after this MR.
Yes, but now when I try to download relatively small image from dockerhub with slower internet connection it takes about 3-5min. And I get everytime this error
Cannot connect to the Docker daemon at unix:///var/run/docker.sock
When there is small delay and image is already downloaded it work everytime. But sometimes it is necessary to download a new docker image and sometimes it takes a few minutes to download - so it would be nice to be able to set timeouts even manually if needed.
Developers are unable to "preload" docker images they will need to the runner server. And GitLab does it nicely, except the timeout is too short. Maybe also some retry strategy would be useful.
Regarding to 'fail-fast', I understand that but this applies rather to small projects with small dependencies. We have large projects with huge dependencies we have to test it all together - fail-fast is something we do not care so much as we need to run test safely without false positives and false erros as this is usually during night.
So to sum up. I really think that you should come up with some reasonable timeouts and retryies - but let the final user change attributes in a runner so everyone can change this when needed (according to docs for runners). There are a lot of teams like us, we know what we're doing and it will be our decision to live with "longer" fail fasts...
Now the timeouts and false errors during build are "deal breakers" for us to use integrated CI (although I like it very much more than Jenkins CI)
Edited by Tomas JurakI made this MR to improve error messages: !964 (merged), so we know what kind of failure it is.
A special timeout for pulling could be nice, but I prefer to be able to override default settings when needed. In changes I can se the new timeout is 300s, but this can still lead to a false errors as some of our docker images are not accessible over fast ethernet (for some security reasons) so downloading them could be even around 15 minutes :-)
I would really love to see where all these errors do came from. So having this !964 (merged), gonna help.
assigned to @tmaczukin
- Resolved by Tomasz Maczukin
mentioned in commit 2aac6418
mentioned in issue #3391 (closed)