WIP: Cache Cloned Repo via Docker
UPDATE 1: This MR has been closed and superseded by a new MR to do the caching via object store rather than docker
UPDATE 2: The object-store-based caching was not performant enough, and had its own complexities. So, we are re-opening and re-visiting this approach.
UPDATE 3: Since the Docker-based approach has many complexities, and may still be slow due to the need to download a large image, we are instead investigating the approach of caching the cloned repo directly on the runners
DESCRIPTION
For each job in the www-gitlab-com CI/CD build, the git repo clone currently takes between a minute and a half and two minutes, because it is very big and takes a lot of network time.
If we cache the git clone on a docker image which is built on a regular basis (daily or hourly), and only pull new commits since the last image was built, this time could be greatly reduced.
Pros
-
This improvement would affect all jobs. If we do not make this change, the time spent cloning will always be necessary for every build as long as we use the current approach, and will be a significant portion of the minimum threshold of total job run time which we can achieve.
-
Once we have this approach in place, we can also switch to using it for caching the RubyGems (
vendordir) dependencies, and start using it for NPM/Yarn (node_modules) dependencies. Then, we could avoid using the standard CI caching for these, and save another ~30 seconds or so for the cache push/pull time, because it will be moving/zipping/unzipping much less data. See this MR about cachingnode_modulesfor some timings around the cache. That MR was abandoned because the extra CI cache push/pull time ate up all the gains, but if we did it asynchronously as part of the docker image build, that would not be an issue.
Cons
- This approach results in more complex pipelines, with more moving parts to understand and potentially break.
- There will be additional cost associated with the storage of the Container Registry images. It's currently not known whether this will be significant or negligible.
CURRENT CHALLENGES/QUESTIONS
Intermittent failures looking up docker registry host
(see references below)
This appeared to have been fixed by adding DOCKER_TLS_CERTDIR: "" to variables:, but then it came back:
$ docker login -u gitlab-ci-token -p $CI_JOB_TOKEN $CI_REGISTRY
00:01
WARNING! Using --password via the CLI is insecure. Use --password-stdin.
error during connect: Post http://docker:2375/v1.40/auth: dial tcp: lookup docker on 169.254.169.254:53: no such host
UPDATE 1: I finally seem to have it working (for now) by using 18.09.7-dind, but I'm not sure what of the other changes I made are necessary or not. I have asked @tmaczukin to review, because he worked with this problem as part of this issue.
UPDATE 2: Nope, it's failing again: https://gitlab.com/gitlab-com/www-gitlab-com/-/jobs/419301929
UPDATE 3: Got an explanation for this. It's because the gitlab-org runner pool tag should not be used for Docker-in-Docker jobs. Explanation here: !37948 (comment 282491370)
Image download caching
- An initial spike on this approach was not successful, because the built image took three minutes to download from the container registry (the standard image only takes 2 seconds). This should be avoidable by forcing it to be cached on the runners.
Image root
- Is it possible to have the repo root be at the root of the image, or does it need to be nested in a subdirectory?
Where and when the image will be built, and naming/versioning
- Ideally, the would be built on a regular basis by an independent pipeline or job, based off of the master branch, and used across all pipelines, master and merge_request. It would not be built as part of the main master or merge_request pipelines, because we don't want to slow them down. This should be fine, because we should not expect the dependencies to change frequently.
- Given the approach above, what should the naming/versioning be for the image? Should there only ever be a
latestversion, which is always overwritten? - What about unexpected breaking changes? If a bad image is built and published, and there is only one version, this will break all jobs until a good one is published. How can we mitigate this risk, or allow a rollback to a "last known good" version?
MAIN TASKS
-
Start with just an initial job which publishes the image. -
Need to turn on container registry
-
-
Is it possible to have the repo root be at the root of the image, or does it need to be nested in a subdirectory? -
Had to disable TLS for registry to resolve properly, determine if this is OK to leave disabled. -
Ensure docker image is cached properly -
Set up independent job/pipeline to publish the image periodically and asynchronously of the normal master/merge_request pipelines.
FOLLOW-ON TASKS
-
Add caching of Rubygems vendorin image, and remove from individual pipeline jobs -
Add caching of NPM/Yarn node_modulesin image, and remove from individual pipeline jobs
OTHER REFERENCES
Docker images
- Docker images repo: https://gitlab.com/gitlab-org/gitlab-build-images/blob/master/Dockerfile.www-gitlab-com-2.6
Docker-in-docker caching
- https://docs.gitlab.com/ee/ci/docker/using_docker_build.html
- gitlab-org/gitlab-foss#17861 (closed)
- Docker Artifact caching MVC - gitlab-org/gitlab-runner#1107 (closed)
TLS and docker registry host lookup failures
The following links are relevant to prevent lookup failures for the docker host (I fixed this by disabling TLS, not sure if that's a viable permanent solution but it seems to break again intermittently):
- gitlab-org/gitlab-runner#4566 (closed)
- gitlab-org/gitlab-runner#4501 (comment 195033385)
- gitlab-org/gitlab-runner#3984 (closed)
- https://about.gitlab.com/releases/2019/07/31/docker-in-docker-with-docker-19-dot-03/
- https://gitlab.com/help/user/packages/container_registry/index
Specifying clone path
More powerful runners
- Related MR which discusses runners and their specs: #6357 (closed)
Closes #6358 (closed)