Customer intermittently get "Error creating pipeline" after push and fails to start a pipeline

Support Request for the Gitaly Team

Customer Information

Salesforce Link: https://gitlab.my.salesforce.com/0016100001TzRV4AAN

Zendesk Ticket: https://gitlab.zendesk.com/agent/tickets/274459

Architecture Information:

  • Redis Sentinel replication configured.
  • Sidekiq runs on separate 4 VMs machines. Every node runs the same configuration with 8 queue groups and 3 threads each.
  • Gitaly cluster with 3 Gitaly and 3 Praefect nodes.
  • GitLab Version: 14.5.3, upgrading to 14.9.1 didn't resolve the issue

Additional Information: https://gitlab.com/gitlab-com/support/fieldnotes/-/issues/133

Support Request

Severity

At the moment the customer has a workaround that is known to the whole team: manually restart the pipeline. I would say that severity is S3.

Problem Description

Some push events on client's instance do not result in creation of a pipeline. This is true for many projects and users on this instance. Errors they see in sidekiq logs are: Reference not found or Commit not found from Git::BranchHooksService. For example:

08.04.2022 10:51:01.722 01G042FD39E11KAPF1Q15C8X5X prod m1-devops-prod-gitlab-sidekiq-2 WARN gitlab-sidekiq devplatform 1588499055012 m1 1588499055-1-2-1879775650 Git::BranchHooksService Error creating pipeline 0 Commit not found 6824b701201f236da61c70598317560cffddc633 0 6824b701201f236da61c70598317560cffddc633 refs/heads/e0dd01ad-8338-47f9-b22a-09b87a71c0e9 [] 12403 svc_devops-infra/gitlab-exporter-probe-repo-prod-6 651 m1 2022-04-08T07:51:05.8211349Z 18 m1 monster
08.04.2022 11:07:48.305 01G043E440ZM0RFV27SXE6M8WP prod m1-devops-prod-gitlab-sidekiq-2 WARN gitlab-sidekiq devplatform 1588633800012 m1 1588633800-1-2-1822989861 Git::BranchHooksService Error creating pipeline 0 Reference not found 07af468e47b9c41e235172406b7c1a3c8f129f96 0 07af468e47b9c41e235172406b7c1a3c8f129f96 refs/heads/213e9dca-15e5-4f63-af44-29d0535b684b [] 12403 svc_devops-infra/gitlab-exporter-probe-repo-prod-6 654 m1 2022-04-08T08:07:49.1031256Z 18 m1 monster

We see that pipelines should be created after each push event by PostReceive sidekiq job from BaseHooksService#create_pipelines which performs branches/tags and commits validation in lib/gitlab/ci/pipeline/chain/validate/repository.rb. We suspect that this validation intermittently fails and pipeline fails to create.

In logs, we analyzed with the customer PostReceive job starts, but fails after /gitaly.RefService/FindAllBranchNames or /gitaly.CommitService/FindCommit Gitaly call without giving any errors. Example (correlataion_id: 01G183MGB9VG2JFD6PJ022MA7P):

  1. rails-1 - workhorse access (timestamp: 43:56.000)
  2. sidekiq-2 - PostReceive start - by POST /api/:version/internal/post_receive (enqueued at: 43:56.793)
  3. gitaly-2 (grpc_request_repoStorage: gitaly-2) - /gitaly.RepositoryService/HasLocalBranches (43:56.828)
  4. praefect-1 - /gitaly.RepositoryService/HasLocalBranches (43:56.832089331Z)
  5. sidekiq-2 - ProjectCacheWorker start - by PostReceive (enqueued at: 43:56.834)
  6. gitaly-1 (*grpc_request_repoStorage: gitaly-3) - /gitaly.CommitService/ListCommits (43:56.865)
  7. praefect-2 - /gitaly.CommitService/ListCommits (43:56.87554)
  8. gitaly-1 (*grpc_request_repoStorage: gitaly-3) - /gitaly.RepositoryService/RepositorySize (43:56.883)
  9. praefect-1 - /gitaly.RepositoryService/RepositorySize (43:56.888281317Z)
  10. sidekiq-1 - Namespaces::ScheduleAggregationWorker start - by ProjectCacheWorker (enqueued at: 43:56.907)
  11. sidekiq-2 - ProjectCacheWorker done - by PostReceive (timestamp: 43:56.913)
  12. sidekiq-1 - Namespaces::ScheduleAggregationWorker done - by ProjectCacheWorker (timestamp: 43:56.927)
  13. gitaly-1 - (*grpc_request_repoStorage: gitaly-3) /gitaly.CommitService/FindCommit (43:56.950)
  14. praefect-2 - /gitaly.CommitService/FindCommit (43:56.954700910Z)
  15. gitaly-2 - (grpc_request_repoStorage: gitaly-2) /gitaly.RefService/FindAllBranchNames (43:56.986)
  16. praefect-2 - /gitaly.RefService/FindAllBranchNames (43:56.992107054Z)
  17. sidekiq-2 - Git::BranchHooksService - Error creating pipeline - Reference not found (43:57.000)

  1. gitaly-2 - (grpc_request_repoStorage: gitaly-2) /gitaly.DiffService/CommitDelta (43:57.020)
  2. praefect-1 - /gitaly.DiffService/CommitDelta (43:57.033082051Z)
  3. gitaly-2 - (grpc_request_repoStorage: gitaly-2) /gitaly.CommitService/FilterShasWithSignatures - (43:57.058)
  4. praefect-2 - /gitaly.CommitService/FilterShasWithSignatures(43:57.059468795Z)
  5. sidekiq-2 - WebHookWorker start - by PostReceive (enqueued at: 43:57.071)
  6. sidekiq-2 - PostReceive done - by POST /api/:version/internal/post_receive (timestamp: 43:57.081)
  7. sidekiq-2 - WebHookWorker done by PostReceive (timestamp: 43:57.198)

Note: gitaly-3 is the primary node for the svc_devops-infra/gitlab-exporter-probe-repo-prod-6 project that we used for log dive.

At the same time, Gitaly logs contains additional activity around the same time with a (different correlation ID 01G183MDQ8VQ6GYFKY88V2W9MV) that runs /gitaly.HookService/PreReceiveHook and /gitaly.HookService/PostReceiveHook Gitaly hooks.

image

/gitaly.RefService/FindAllBranchNames (from 01G183MGB9VG2JFD6PJ022MA7P) that preceded Reference not found log ran on gitaly-2. At this time Gitaly-2 hadn't run /gitaly.HookService/UpdateHook(from 01G183MDQ8VQ6GYFKY88V2W9MV) by this time. Could this be a problem?

Troubleshooting Performed

  • Rugged disabled:

    irb(main):001:0> Feature.enabled?(:rugged_commit_is_ancestor)
    => false
    irb(main):002:0> Feature.enabled?(:rugged_commit_tree_entry)
    => false
    irb(main):003:0> Feature.enabled?(:rugged_list_commits_by_oid)
    => false
    irb(main):004:0> Feature.enabled?(:rugged_tree_entry)
    => false
    irb(main):005:0> Feature.enabled?(:rugged_get_tree_entries)
    => false
    irb(main):006:0> Feature.enabled?(:rugged_find_commit)
    => false
  • After receiving complains from instance users, the customer has setup a project svc_devops-infra/gitlab-exporter-probe-repo-prod-6 to collect periodic probes of this issue. This project is regularly cloned, a branch is created afterwards which receives a commit that gets pushed back to origin and merged. This probe sometimes fails and pipelines are not getting created. To rule out possible load issues, we went through the failed probes – we see that it happens also during late evening and night hours when the overall load on the Gitlab service is small.

  • Manually restarting the pipeline or pushing a new commit will usually work and pipeline will execute.

  • Decreasing repository size doesn't seem to help, however, the customer confirmed that it happens more rarely on repos with fewer commits.

What specifically do you need from the Gitaly team

Help pinpoint potential race condition.

Author Checklist

  • Customer information provided
  • Severity realistically set
  • Clearly articulated what is needed from the Gitaly team to support your request by filling out the What specifically do you need from the Gitaly team

/cc @mjwood @andrashorvath

Edited by Kate Grechishkina
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information