[bug] [repro-case] the gitlab.com CI cache incorrectly changes file permissions

Summary

The gitlab.com CI cache incorrectly changes file permissions between run. For example, some files are put in the cache with permission rw-rw-rw- (or r--r--r--), and after being restored for the next job their permission changed to rwxrwxrwx. This breaks the CI when using tools that check file permissions and may fail or behave in unexpected ways if a file unexpectedly becomes executable. (The dune build system is an example of such problem for which my workflow was broken, but it seems that many other Gitlab users have encountered the issue with similar programs.)

More precisely, the bug shows up on files that are the target of a symbolic link also included in the cache. (This sounds like the same underlying issue as #3354 (closed), but I have a precise bug with a simple reproduction case so I preferred to create a new issue.)

Steps to reproduce

I created a minimal reproduction case for this bug as a Gitlab.com repository:

https://gitlab.com/gasche/gitlab-ci-cache-permission-bug

The git repository contains a small text file a.ml, and a symbolic link link.ml -> a.ml. The .gitlab-ci.yml script has two job, one that creates a cache (policy: push) and copies a.ml and link.ml in the cache, another that reads from this cache cache (policy: pull). At the end of the first job, and at the beginning of the second job, stat is called on the cached files; the permissions for cache/a.ml have changed from rw-r--r-- to rwxrwxrwx, this is the bug.

.gitlab-ci.yml file:

stages:
  - create_cache
  - use_cache
  - keep_cache

test_create_cache:
  image: alpine:latest
  stage: create_cache
  cache:
    paths:
      - cache
    policy: push
  script:
    - "# This job populates the cache for the next job"
    - stat a.ml link.ml

    - mkdir -p cache
    - cp -a a.ml link.ml cache/

    - "# show the metadata (including permissions) of cache/"
    - stat cache/ cache/a.ml cache/link.ml

test_use_cache:
  image: alpine:latest
  stage: use_cache
  cache:
    paths:
      - cache
    policy: pull
  script:
    - "# This job restores the cache from the previous job."
    - "# show the metadata of cache/, should be the same as before."
    - stat cache/ cache/a.ml cache/link.ml

Actual behavior

Look at recent pipeline results in the repository -- currently the last pipeline was https://gitlab.com/gasche/gitlab-ci-cache-permission-bug/-/pipelines/246796107. In the log of the create_cache job we see the following output:

  File: cache/a.ml
  Size: 10        	Blocks: 16         IO Block: 4096   regular file
Device: 809h/2057d	Inode: 1571568     Links: 1
Access: (0666/-rw-rw-rw-)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2021-01-26 05:49:53.000000000 +0000
Modify: 2021-01-26 05:49:53.000000000 +0000
Change: 2021-01-26 05:49:54.000000000 +0000
  File: 'cache/link.ml' -> 'a.ml'
  Size: 4         	Blocks: 8          IO Block: 4096   symbolic link
Device: 809h/2057d	Inode: 1571579     Links: 1
Access: (0777/lrwxrwxrwx)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2021-01-26 05:49:54.000000000 +0000
Modify: 2021-01-26 05:49:54.000000000 +0000
Change: 2021-01-26 05:49:54.000000000 +0000

In the log of the use_cache job we see the following output:

  File: cache/a.ml
  Size: 10        	Blocks: 16         IO Block: 4096   regular file
Device: 809h/2057d	Inode: 1571543     Links: 1
Access: (0777/-rwxrwxrwx)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2021-01-26 05:50:09.000000000 +0000
Modify: 2021-01-26 05:49:53.000000000 +0000
Change: 2021-01-26 05:50:09.000000000 +0000
  File: 'cache/link.ml' -> 'a.ml'
  Size: 4         	Blocks: 8          IO Block: 4096   symbolic link
Device: 809h/2057d	Inode: 1571544     Links: 1
Access: (0777/lrwxrwxrwx)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2021-01-26 05:50:09.000000000 +0000
Modify: 2021-01-26 05:50:09.000000000 +0000
Change: 2021-01-26 05:50:09.000000000 +0000

The permission change for cache/a.ml from rw-rw-rw- to rwxrwxrwx was caused by Gitlab's logic to archive and then restore the cache, this is the bug.

Expected behavior

The permission of files put in the cache should be preserved.

Other reports that are probably the same issue

Many permission-related bugs have been reported against Gitlab and gitlab-runner, but some of them seem unrelated to the present one. (It is possible that the present bug was introduced by a faulty fix for a past issue). Two bug reports seem closely related:

  • #3354 (closed) , "chmod is used instead of lchmod", specifically discusses a permission change for files that are the target of a link, and points at a problematic piece of code, https://gitlab.com/gitlab-org/gitlab-runner/blob/master/helpers/archives/zip_extract.go#L98, which seems responsible for the bug. This bug was reported 2 years ago with reports of several users affected (trying to cache files produced by npm, or yarn, or bazel), some of them declaring the bug as critical for their workflow. It looks like progress on this issue has stalled by lack of a repro-case, so I have that the present one will have a happy ending.

  • gitlab#18711 (comment 257196029) is a comment (a year ago) in an issue about choosing the cache/artifact compression method. The comment reports on a permission bug caused by the move of Gitlab from tar.gz to zip. The whole issue touches unrelated topics, but this comment shows that the user @ejgallego encountered the same bug as I did (we are using the same dune build system and it is @ejgallego who pointed me in the right direction as I was trying to debug the issue; thanks!).

Possible fixes

In discussion of #3354 (closed), @danielmd3000 highlights the following piece of code as the likely cause of this issue:

https://gitlab.com/gitlab-org/gitlab-runner/blob/master/helpers/archives/zip_extract.go#L101

	for _, file := range archive.File {
		// Update file permissions
		if err := os.Chmod(file.Name, file.Mode().Perm()); tracker.actionable(err) {
			logrus.Warningf("%s: %s (suppressing repeats)", file.Name, err)
		}

		// Process zip metadata
		if err := processZipExtra(&file.FileHeader); tracker.actionable(err) {
			logrus.Warningf("%s: %s (suppressing repeats)", file.Name, err)
		}
	}

It sounds plausible that the bug is caused by the Chmod call following the symlink, and changing the permissions of the link target to the permissions of the link itself.

The reporter suggests to use a lchmod call instead (according to this freebsd manpage, lchmod is similar to chmod bug does not follow symlinks). In fact if I run git grep lchmod in the current gitlab-runner directory, I see similar-looking code in the vendored code of the fastzip dependency (no idea whether this is what the cache-extractor uses):

https://gitlab.com/gitlab-org/gitlab-runner/-/blob/master/vendor/github.com/saracen/fastzip/extractor.go#L294-296

(This dependency includes an implementation of lchmod in go, no idea whether it is correct.)

It looks like this issue is caused by the fact that Go's archive/zip provides an API to read a zipfile content, but does not provide an easy way to correctly extract an archive in a directory. Users have to traverse the archive themselves and creates files themselves, and most of them (including apparently Gitlab) get it wrong -- when calling the venerable unzip command instead would work perfectly well. (I suppose that some Go progammers have implemented a correct unzipper for zip archives, and it might be wise to reuse that instead of maintaining your own code. But in the meantime, fixing the gitlab-runner implementation would be very helpful to affected users.)

Finally: browsing the web points to snippets for zip-unarchiving, for examplehttps://golangcode.com/unzip-files-in-go/, several of which containing a specific protection against a so-called ZipSlip vulnerability, and it appears that Gitlab's implementation does not include such a protection.