Papercuts while developing GitLab

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

A not complete list of small recurring issues slowing down development, increasing cycle time and preventing us from moving fast.

Context

I've seen variety of small things happening on a day-to-day work that is slowing everyone, as those are small hurdles, they may come unnoticed, so by documenting them here we want to give visibility, so engineering effort can be put into fixing the root cause (as well as each occurrence).

From slack:

I’m seeing a large amount of “small cuts” during development phase.

As part of our CI pipelines, we have split a lot of work into different jobs and child pipelines, it sort of resemble the “death star architecture” (https://mrtortoise.github.io/architecture/lean/design/patterns/ddd/2018/03/18/deathstar-architecture.html).

The main characteristic of it is that because we have that many intricate of dependencies, any failure at any point becomes a catastrophic failure: in our case, our builds always failing for a different reason.

I’ve seen all sorts of things, many that a simple retry logic would fix it self, but also many others that are just parts of the puzzle breaking apart.

I believe we need to some engineering into each failure category, as an example:

  1. A lot of builds fail because git fails (transient but when you have 100 jobs a higher chance to happen to one of them in your pipeline)
  2. Builds failing when downloading a dependency (gitlab is not super reliable, and sometimes we get an error code from dependency proxy / api / mirrored repository / hosted package that prevents the build to continue)
  3. Builds failing due to some non essential part of the build not being robust enough (we keep adding random “process as a code” that is not very well written, not very well tested)
  4. Builds failing due to the CI worker failing to pull a container image (transient, but when you have 100 jobs it happens often)
  5. Build is successful but fails when trying to upload artifacts or metadata (with no retry)

I don’t have an easy solution to provide but I may have insight in two problem categories:

  1. CI worker should have some retry logic when downloading the initial image (with some backoff), and that should be the default not something we need to configure when deploying it
  2. As git doesn’t provide an easy “retry”, we need to implement our own “retry” logic in the CI worker (for the initial clone)
    • We should consider some default wrapper that does that for any additional git operation we do as part of the Job script/before/after hooks (perhaps we could implement this in glab and always use git through it)

Edited by 🤖 GitLab Bot 🤖