Customer intermittently get "Error creating pipeline" after push and fails to start a pipeline
Support Request for the Gitaly Team
Customer Information
Salesforce Link: https://gitlab.my.salesforce.com/0016100001TzRV4AAN
Zendesk Ticket: https://gitlab.zendesk.com/agent/tickets/274459
Architecture Information:
- Redis Sentinel replication configured.
- Sidekiq runs on separate 4 VMs machines. Every node runs the same configuration with 8 queue groups and 3 threads each.
- Gitaly cluster with 3 Gitaly and 3 Praefect nodes.
- GitLab Version:
14.5.3, upgrading to14.9.1didn't resolve the issue
Additional Information: https://gitlab.com/gitlab-com/support/fieldnotes/-/issues/133
Support Request
Severity
At the moment the customer has a workaround that is known to the whole team: manually restart the pipeline. I would say that severity is S3.
Problem Description
Some push events on client's instance do not result in creation of a pipeline. This is true for many projects and users on this instance. Errors they see in sidekiq logs are: Reference not found or Commit not found from Git::BranchHooksService. For example:
| 08.04.2022 10:51:01.722 | 01G042FD39E11KAPF1Q15C8X5X | prod | m1-devops-prod-gitlab-sidekiq-2 | WARN | gitlab-sidekiq | devplatform | 1588499055012 | m1 | 1588499055-1-2-1879775650 | Git::BranchHooksService | Error creating pipeline | 0 | Commit not found | 6824b701201f236da61c70598317560cffddc633 | 0 | 6824b701201f236da61c70598317560cffddc633 | refs/heads/e0dd01ad-8338-47f9-b22a-09b87a71c0e9 | [] | 12403 | svc_devops-infra/gitlab-exporter-probe-repo-prod-6 | 651 | m1 | 2022-04-08T07:51:05.8211349Z | 18 | m1 | monster |
|---|
| 08.04.2022 11:07:48.305 | 01G043E440ZM0RFV27SXE6M8WP | prod | m1-devops-prod-gitlab-sidekiq-2 | WARN | gitlab-sidekiq | devplatform | 1588633800012 | m1 | 1588633800-1-2-1822989861 | Git::BranchHooksService | Error creating pipeline | 0 | Reference not found | 07af468e47b9c41e235172406b7c1a3c8f129f96 | 0 | 07af468e47b9c41e235172406b7c1a3c8f129f96 | refs/heads/213e9dca-15e5-4f63-af44-29d0535b684b | [] | 12403 | svc_devops-infra/gitlab-exporter-probe-repo-prod-6 | 654 | m1 | 2022-04-08T08:07:49.1031256Z | 18 | m1 | monster |
|---|
We see that pipelines should be created after each push event by PostReceive sidekiq job from BaseHooksService#create_pipelines which performs branches/tags and commits validation in lib/gitlab/ci/pipeline/chain/validate/repository.rb. We suspect that this validation intermittently fails and pipeline fails to create.
In logs, we analyzed with the customer PostReceive job starts, but fails after /gitaly.RefService/FindAllBranchNames or /gitaly.CommitService/FindCommit Gitaly call without giving any errors. Example (correlataion_id: 01G183MGB9VG2JFD6PJ022MA7P):
- rails-1 - workhorse access (timestamp: 43:56.000)
- sidekiq-2 - PostReceive start - by
POST /api/:version/internal/post_receive(enqueued at: 43:56.793) - gitaly-2 (grpc_request_repoStorage: gitaly-2) -
/gitaly.RepositoryService/HasLocalBranches(43:56.828) - praefect-1 -
/gitaly.RepositoryService/HasLocalBranches(43:56.832089331Z) - sidekiq-2 - ProjectCacheWorker start - by PostReceive (enqueued at: 43:56.834)
- gitaly-1 (*grpc_request_repoStorage: gitaly-3) -
/gitaly.CommitService/ListCommits(43:56.865) - praefect-2 -
/gitaly.CommitService/ListCommits(43:56.87554) - gitaly-1 (*grpc_request_repoStorage: gitaly-3) -
/gitaly.RepositoryService/RepositorySize(43:56.883) - praefect-1 -
/gitaly.RepositoryService/RepositorySize(43:56.888281317Z) - sidekiq-1 - Namespaces::ScheduleAggregationWorker start - by ProjectCacheWorker (enqueued at: 43:56.907)
- sidekiq-2 - ProjectCacheWorker done - by PostReceive (timestamp: 43:56.913)
- sidekiq-1 - Namespaces::ScheduleAggregationWorker done - by ProjectCacheWorker (timestamp: 43:56.927)
- gitaly-1 - (*grpc_request_repoStorage: gitaly-3)
/gitaly.CommitService/FindCommit(43:56.950) - praefect-2 -
/gitaly.CommitService/FindCommit(43:56.954700910Z) - gitaly-2 - (grpc_request_repoStorage: gitaly-2)
/gitaly.RefService/FindAllBranchNames(43:56.986) - praefect-2 -
/gitaly.RefService/FindAllBranchNames(43:56.992107054Z) - sidekiq-2 -
Git::BranchHooksService- Error creating pipeline - Reference not found (43:57.000)
- gitaly-2 - (grpc_request_repoStorage: gitaly-2)
/gitaly.DiffService/CommitDelta(43:57.020) - praefect-1 -
/gitaly.DiffService/CommitDelta(43:57.033082051Z) - gitaly-2 - (grpc_request_repoStorage: gitaly-2)
/gitaly.CommitService/FilterShasWithSignatures- (43:57.058) - praefect-2 -
/gitaly.CommitService/FilterShasWithSignatures(43:57.059468795Z) - sidekiq-2 - WebHookWorker start - by
PostReceive(enqueued at: 43:57.071) - sidekiq-2 - PostReceive done - by
POST /api/:version/internal/post_receive(timestamp: 43:57.081) - sidekiq-2 - WebHookWorker done by
PostReceive(timestamp: 43:57.198)
Note: gitaly-3 is the primary node for the svc_devops-infra/gitlab-exporter-probe-repo-prod-6 project that we used for log dive.
At the same time, Gitaly logs contains additional activity around the same time with a (different correlation ID 01G183MDQ8VQ6GYFKY88V2W9MV) that runs /gitaly.HookService/PreReceiveHook and /gitaly.HookService/PostReceiveHook Gitaly hooks.
/gitaly.RefService/FindAllBranchNames (from 01G183MGB9VG2JFD6PJ022MA7P) that preceded Reference not found log ran on gitaly-2. At this time Gitaly-2 hadn't run /gitaly.HookService/UpdateHook(from 01G183MDQ8VQ6GYFKY88V2W9MV) by this time. Could this be a problem?
Troubleshooting Performed
-
Rugged disabled:
irb(main):001:0> Feature.enabled?(:rugged_commit_is_ancestor) => false irb(main):002:0> Feature.enabled?(:rugged_commit_tree_entry) => false irb(main):003:0> Feature.enabled?(:rugged_list_commits_by_oid) => false irb(main):004:0> Feature.enabled?(:rugged_tree_entry) => false irb(main):005:0> Feature.enabled?(:rugged_get_tree_entries) => false irb(main):006:0> Feature.enabled?(:rugged_find_commit) => false -
After receiving complains from instance users, the customer has setup a project
svc_devops-infra/gitlab-exporter-probe-repo-prod-6to collect periodic probes of this issue. This project is regularly cloned, a branch is created afterwards which receives a commit that gets pushed back to origin and merged. This probe sometimes fails and pipelines are not getting created. To rule out possible load issues, we went through the failed probes – we see that it happens also during late evening and night hours when the overall load on the Gitlab service is small. -
Manually restarting the pipeline or pushing a new commit will usually work and pipeline will execute.
-
Decreasing repository size doesn't seem to help, however, the customer confirmed that it happens more rarely on repos with fewer commits.
What specifically do you need from the Gitaly team
Help pinpoint potential race condition.
Author Checklist
-
Customer information provided -
Severity realistically set -
Clearly articulated what is needed from the Gitaly team to support your request by filling out the What specifically do you need from the Gitaly team
