WIP: Cache Cloned Repo via Docker

UPDATE 1: This MR has been closed and superseded by a new MR to do the caching via object store rather than docker

UPDATE 2: The object-store-based caching was not performant enough, and had its own complexities. So, we are re-opening and re-visiting this approach.

UPDATE 3: Since the Docker-based approach has many complexities, and may still be slow due to the need to download a large image, we are instead investigating the approach of caching the cloned repo directly on the runners

DESCRIPTION

For each job in the www-gitlab-com CI/CD build, the git repo clone currently takes between a minute and a half and two minutes, because it is very big and takes a lot of network time.

If we cache the git clone on a docker image which is built on a regular basis (daily or hourly), and only pull new commits since the last image was built, this time could be greatly reduced.

Pros

  • This improvement would affect all jobs. If we do not make this change, the time spent cloning will always be necessary for every build as long as we use the current approach, and will be a significant portion of the minimum threshold of total job run time which we can achieve.

  • Once we have this approach in place, we can also switch to using it for caching the RubyGems (vendor dir) dependencies, and start using it for NPM/Yarn (node_modules) dependencies. Then, we could avoid using the standard CI caching for these, and save another ~30 seconds or so for the cache push/pull time, because it will be moving/zipping/unzipping much less data. See this MR about caching node_modules for some timings around the cache. That MR was abandoned because the extra CI cache push/pull time ate up all the gains, but if we did it asynchronously as part of the docker image build, that would not be an issue.

Cons

  • This approach results in more complex pipelines, with more moving parts to understand and potentially break.
  • There will be additional cost associated with the storage of the Container Registry images. It's currently not known whether this will be significant or negligible.

CURRENT CHALLENGES/QUESTIONS

Intermittent failures looking up docker registry host

(see references below)

This appeared to have been fixed by adding DOCKER_TLS_CERTDIR: "" to variables:, but then it came back:

$ docker login -u gitlab-ci-token -p $CI_JOB_TOKEN $CI_REGISTRY
00:01
 WARNING! Using --password via the CLI is insecure. Use --password-stdin.
 error during connect: Post http://docker:2375/v1.40/auth: dial tcp: lookup docker on 169.254.169.254:53: no such host

UPDATE 1: I finally seem to have it working (for now) by using 18.09.7-dind, but I'm not sure what of the other changes I made are necessary or not. I have asked @tmaczukin to review, because he worked with this problem as part of this issue.

UPDATE 2: Nope, it's failing again: https://gitlab.com/gitlab-com/www-gitlab-com/-/jobs/419301929

UPDATE 3: Got an explanation for this. It's because the gitlab-org runner pool tag should not be used for Docker-in-Docker jobs. Explanation here: !37948 (comment 282491370)

Image download caching

  • An initial spike on this approach was not successful, because the built image took three minutes to download from the container registry (the standard image only takes 2 seconds). This should be avoidable by forcing it to be cached on the runners.

Image root

  • Is it possible to have the repo root be at the root of the image, or does it need to be nested in a subdirectory?

Where and when the image will be built, and naming/versioning

  • Ideally, the would be built on a regular basis by an independent pipeline or job, based off of the master branch, and used across all pipelines, master and merge_request. It would not be built as part of the main master or merge_request pipelines, because we don't want to slow them down. This should be fine, because we should not expect the dependencies to change frequently.
  • Given the approach above, what should the naming/versioning be for the image? Should there only ever be a latest version, which is always overwritten?
  • What about unexpected breaking changes? If a bad image is built and published, and there is only one version, this will break all jobs until a good one is published. How can we mitigate this risk, or allow a rollback to a "last known good" version?

MAIN TASKS

  • Start with just an initial job which publishes the image.
    • Need to turn on container registry
  • Is it possible to have the repo root be at the root of the image, or does it need to be nested in a subdirectory?
  • Had to disable TLS for registry to resolve properly, determine if this is OK to leave disabled.
  • Ensure docker image is cached properly
  • Set up independent job/pipeline to publish the image periodically and asynchronously of the normal master/merge_request pipelines.

FOLLOW-ON TASKS

  • Add caching of Rubygems vendor in image, and remove from individual pipeline jobs
  • Add caching of NPM/Yarn node_modules in image, and remove from individual pipeline jobs

OTHER REFERENCES

Docker images

Docker-in-docker caching

TLS and docker registry host lookup failures

The following links are relevant to prevent lookup failures for the docker host (I fixed this by disabling TLS, not sure if that's a viable permanent solution but it seems to break again intermittently):

Specifying clone path

More powerful runners


Closes #6358 (closed)

Edited by Chad Woolley

Merge request reports

Loading