Deploy tokens issues during 11.2.0 RC1 deployment

README FIRST

This issue is created to recognise the causes that led to the described problems. No individual or group can or will need to take responsibility for the problem. We are all working together on creating a way to not repeat the same mistakes.

Circumstance description

On 2018-08-02 during 11.2.0.RC1 deploy, deploy tokens stopped working for all users on GitLab.com. Increased number of errors shown in monitoring has only been observed by person on call, deploy was still ongoing.

Production incident is described in https://gitlab.com/gitlab-com/production/issues/385

As part of the production incident investigation, two errors were shown in Sentry:

  • https://gitlab.com/gitlab-org/gitlab-ce/issues/49904
  • https://gitlab.com/gitlab-org/gitlab-ee/issues/7080

Impact

Anyone using deploy tokens without an expiry date set was unable to clone a repository (manually or through CI) or pull a registry image on GitLab.com.

Immediate corrective actions

Post deploy patches were applied:

  • https://dev.gitlab.org/gitlab/post-deployment-patches/merge_requests/88
  • https://dev.gitlab.org/gitlab/post-deployment-patches/merge_requests/89

Application fixes

Fixes were introduced with:

  • https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/20992
  • https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/20993

Origin of the problem

It appears that the issue was caused by two unrelated changes.

Use Deploy Tokens to clone LFS repositories

Change introduced in https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/20729 to resolve https://gitlab.com/gitlab-org/gitlab-ce/issues/46869 .

Repositories that contain LFS objects could not be cloned using Deploy Tokens.

Users table getting updated frequently

Change introduced in https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/20597 to resolve https://gitlab.com/gitlab-org/gitlab-ce/issues/43312 .

User activity worker was running at schedule and was updating all records at the same time. At the time of reported problem, this was updating more than 100k rows in the users table just to show last user activity.

Corrective actions going forward

  • GitLab QA to add deploy tokens coverage gitlab-org/gitlab-qa#308 (moved)
Edited Aug 03, 2018 by Marin Jankovski
Assignee Loading
Time tracking Loading