Cache unstable between stages

We have a master and dev branch on which we push regularly. Our CI is set so that node_modules are cached between the stages of the same branch, as the docs recommend.

One of our jobs relies on a modules installed by npm ci in a setup job. Subsequent jobs do not always find the node_modules directory, even though the cache is always extracted correctly.

See the following example:

First commit fails on master

The setup job logs:

```
Running with gitlab-runner 11.8.0 (4745a6f3)
    on b34672608ba1 8a6fcae6
Using Docker executor with image my.registry.com/my-npm-ci-docker-image ...
Pulling docker image my.registry.com/my-npm-ci-docker-image ...
Using docker image sha256:0a4c2bee112563b0e3e32dd04d627e8a012b913ef1c0baa9c3c9bab6f07812be for my.registry.com/my-npm-ci-docker-image ...
Running on runner-8a6fcae6-project-108-concurrent-1 via 3109e888ae36...
Cloning repository...
Cloning into '/builds/my-project'...
Checking out c6bc390a as master...
Skipping Git submodules setup
Checking cache for master...
No URL provided, cache will not be downloaded from shared cache server. Instead a local version of cache will be extracted.
Successfully extracted cache
Git git version 2.11.0
Node v10.15.3a
Npm 6.4.1
$ npm ci

> fsevents@1.2.7 install /builds/my-project/node_modules/fsevents
> node install


> husky@1.3.1 install /builds/my-project/node_modules/husky
> node husky install

husky > setting up git hooks
CI detected, skipping Git hooks installation.
added 1326 packages in 18.252s
Git git version 2.11.0
Node v10.15.3a
Npm 6.4.1
Creating cache master...
node_modules/: found 19120 matching files
No URL provided, cache will be not uploaded to shared cache server. Cache will be stored only locally.
Created cache
Job succeeded
```
The `release` job logs:
```
Running with gitlab-runner 11.8.0 (4745a6f3)
  on b34672608ba1 8a6fcae6
Using Docker executor with image my.registry.com/my-npm-ci-docker-image ...
Pulling docker image my.registry.com/my-npm-ci-docker-image ...
Using docker image sha256:7121107d53228c008d3f70a2f5facd2f1549b0c25a19321b181592c1c6271895 for my.registry.com/my-npm-ci-docker-image ...
Running on runner-8a6fcae6-project-108-concurrent-0 via 3109e888ae36...
Fetching changes...
Removing node_modules/
HEAD is now at 4c93117 chore: write more tests
Checking out c6bc390a as master...
Skipping Git submodules setup
Checking cache for master...
No URL provided, cache will not be downloaded from shared cache server. Instead a local version of cache will be extracted.
Successfully extracted cache
Git git version 2.18.1
Node v10.15.2a
Npm 6.4.1
Docker Docker version 18.06.1-ce, build d72f525745
$ ./node_modules/semantic-release/bin/semantic-release.js
/bin/sh: eval: line 71: ./node_modules/semantic-release/bin/semantic-release.js: not found
Git git version 2.18.1
Node v10.15.2a
Npm 6.4.1
Docker Docker version 18.06.1-ce, build d72f525745
ERROR: Job failed: exit code 127
```

Second commit (rebase dev on master) succeeds on dev
Third commit succeeds on master. The only difference between the first and third commit is that I added a blank line in the README to force gitlab to run a new pipeline, as I can't retry the job because it relies on a module installed in the cache by the setup job.

Here is our CI configuration: gitlab-ci.yml

cache:
  key: ${CI_COMMIT_REF_SLUG}
  paths:
    - node_modules/

stages:
  - setup
  - release

setup:
  stage: setup
  image: my-npm-ci-docker-image
  script:
    - npm ci

# build + test jobs

release:
  stage: release
  image: my-npm-ci-docker-image
  script:
    - ./node_modules/semantic-release/bin/semantic-release.js

# quality job

And various versions: GitLab: 11.8.2 (self hosted via docker) GitLab Runner: 11.8.0 (self hosted via docker)

Note that:

Other source of the repository are not relevant.
I know I could use npx as a workaround.
The runner is using a custom docker image that embeds NPM, Git, and Node. The logs
```
 Git git version 2.11.0
 Node v10.15.3a
 Npm 6.4.1
```
are written by this custom docker image. I am not sure why it is printed twice as it is just printed in the container entry point. Gitlab should log this only once - it's as if it were storing the entry point logs into a buffer an re-printing it at the end. Or it's trying to run the container twice, but the second time it passes en empty script ? Anyways this is probably not related as the job works most of the time.

Edited Mar 18, 2019 by Charlie Bravo