Skip to content
GitLab
Next
    • Why GitLab
    • Pricing
    • Contact Sales
    • Explore
  • Why GitLab
  • Pricing
  • Contact Sales
  • Explore
  • Sign in
  • Get free trial
  • GitLab.org GitLab.org
  • GitLab GitLab
  • Issues
  • #326066

"ref not found" after new branch pushed

Issue

Customer in internal ZD link has been plagued with intermittent problems associated with newly-pushed branches. We have confirmed that it is not caused by slow sidekiq jobs. Periodically they end up with Merge Requests that claim "source branch not found" despite all other indications on the system showing that the source branch was successfully pushed and processed. The users have found that if they push another commit on the same branch, or push another branch to the same repo, it will usually resolve the problem. We have identified a rails console script that will find existing MRs with this symptom and (if run in this version) it will recover for those existing examples by clearing the branches cache:

MergeRequest.find_each(start: 660000, finish: 666300, batch_size: 50) do |mr|
  if !mr.source_branch_exists? and mr.open?
    puts "#{mr.project.full_path}, #{mr.iid}, #{mr.source_branch}"

    mr.project.repository.expire_branches_cache
    check_service = MergeRequests::MergeabilityCheckService.new(mr)
    check_service.execute(recheck: true, retry_lease: false)

    puts "#{mr.project.full_path}!#{mr.iid} refreshed"
  end
end

We identified that we can reliably find repositories that have this condition from this error in the Git::BranchHooksService sidekiq job, which is kicked off as part of the PostReceive job when a new branch is received:

{"severity":"WARN","time":"2021-03-24T15:53:16.778Z","class":"Git::BranchHooksService","correlation_id":"01F1JF1A39E48FMB383M8HC4AZ","project_id":43212,"project_path":"sre/repo_name","message":"Error creating pipeline","errors":"Reference not found","pipeline_params":{"before":"0000000000000000000000000000000000000000","after":"451379f2d0c16ec53cfdd5b00bdcbe7cb1476b7e","ref":"refs/heads/branch/path/name","variables_attributes":[],"checkout_sha":"451379f2d0c16ec53cfdd5b00bdcbe7cb1476b7e"},"retry":0}

Using the "Reference not found error" in Git::BranchHooksService we were able to identify at least one repository where the users were not using merge requests but the repository otherwise showed the same behavior as one identified by searching from the MR perspective: namely, that you can look at the repository in the GitLab UI, see the name of the "missing" branch in the branches list, but then get a 404 when trying to click through to that branch. They have not had users reporting this issue because users are unlikely to poke at the UI to view a branch they just pushed, unless they're creating a merge request.

We believe we've narrowed down where the problem happens to these few lines in the PostReceive code

    expire_caches(post_received, project.repository)
    enqueue_project_cache_update(post_received, project)
    process_ref_changes(project, user, push_options: push_options, changes: changes)
    update_remote_mirrors(post_received, project)
    after_project_changes_hooks(project, user, changes.refs, changes.repository_data)

Clearly the first thing we do here is expire the caches, then enqueue_project_cache_update is just a job that updates the cached project size, so is irrelevant to this situation. process_ref_changes is the thing that eventually runs the Git::BranchHooksService job where we first see the missing ref behavior reflected in logs.

The customer has 4 front-end GitLab nodes with repository data mounted via NFS. We have exhaustively double-checked possible inconsistencies across nodes due to NFS timing issues, possibly looking at different redis nodes (they have 3 redis nodes, with sentinel; there have been no redis failovers). GitLab version is 13.7.5 confirmed on all nodes, and redis is 5.0.9. We also confirmed that whenever they identified "bad" repositories with the rails query (without the cache expiration step) they saw the same list of repositories across all GitLab nodes, which of course just checks that all nodes are seeing the same redis cache results.

We were then able to identify one push where we saw all the activity associated with that push executed on the same GitLab node, which would seem to rule out any intra-node inconsistency:

First, the push comes in via http, as seen in production_json.log:

{"method":"GET","path":"/sre/repo_name.git/info/refs","format":"*/*","controller":"Repositories::GitHttpController","action":"info_refs","status":200,"time":"2021-03-24T15:53:15.650Z","params":[{"key":"service","value":"git-receive-pack"},{"key":"repository_path","value":"sre/repo_name.git"}],"remote_ip":"10.254.100.6","user_id":3950,"username":"username","ua":"git/2.29.2","correlation_id":"01F1JF18REK4P6W35KS3AM89TF","meta.user":"username","meta.project":"sre/repo_name","meta.root_namespace":"sre","meta.caller_id":"Repositories::GitHttpController#info_refs","meta.feature_category":"source_code_management","redis_calls":2,"redis_duration_s":0.000811,"redis_read_bytes":312,"redis_write_bytes":98,"redis_cache_calls":2,"redis_cache_duration_s":0.000811,"redis_cache_read_bytes":312,"redis_cache_write_bytes":98,"db_count":19,"db_write_count":0,"db_cached_count":1,"cpu_s":0.2,"db_duration_s":0.00981,"view_duration_s":0.0002,"duration_s":0.36388}

We see the associated gitaly activity in the gitaly log:

{"correlation_id":"01F1JF18REK4P6W35KS3AM89TF","grpc.code":"OK","grpc.meta.auth_version":"v2","grpc.meta.client_name":"gitlab-workhorse","grpc.meta.deadline_type":"none","grpc.method":"InfoRefsReceivePack","grpc.request.fullMethod":"/gitaly.SmartHTTPService/InfoRefsReceivePack","grpc.request.glProjectPath":"sre/repo_name","grpc.request.glRepository":"project-43212","grpc.request.repoPath":"@hashed/b8/e0/b8e018a79fd712b5c0864804a42459d4c93e53bc1c8fb67642a32a86c6f117a1.git","grpc.request.repoStorage":"default","grpc.request.topLevelGroup":"@hashed","grpc.service":"gitaly.SmartHTTPService","grpc.start_time":"2021-03-24T15:53:15Z","grpc.time_ms":100.718,"level":"info","msg":"finished streaming call with code OK","peer.address":"@","pid":43853,"span.kind":"server","system":"grpc","time":"2021-03-24T15:53:15.752Z"}

Next we see the sidekiq PostReceive job, with associated Git::BranchHooksService:

{"severity":"INFO","time":"2021-03-24T15:53:16.668Z","class":"PostReceive","args":["project-43212","user-3950","0000000000000000000000000000000000000000 451379f2d0c16ec53cfdd5b00bdcbe7cb1476b7e refs/heads/branch/path/name\n","{}"],"retry":3,"queue":"post_receive","version":0,"jid":"db03919cc4addcec56315078","created_at":"2021-03-24T15:53:16.662Z","meta.user":"username","meta.project":"sre/repo_name","meta.root_namespace":"sre","meta.caller_id":"/api/:version/internal/post_receive","meta.feature_category":"source_code_management","correlation_id":"01F1JF1A39E48FMB383M8HC4AZ","enqueued_at":"2021-03-24T15:53:16.666Z","pid":43962,"message":"PostReceive JID-db03919cc4addcec56315078: start","job_status":"start","scheduling_latency_s":0.001787} 

{"severity":"WARN","time":"2021-03-24T15:53:16.778Z","class":"Git::BranchHooksService","correlation_id":"01F1JF1A39E48FMB383M8HC4AZ","project_id":43212,"project_path":"sre/repo_name","message":"Error creating pipeline","errors":"Reference not found","pipeline_params":{"before":"0000000000000000000000000000000000000000","after":"451379f2d0c16ec53cfdd5b00bdcbe7cb1476b7e","ref":"refs/heads/branch/path/name","variables_attributes":[],"checkout_sha":"451379f2d0c16ec53cfdd5b00bdcbe7cb1476b7e"},"retry":0} 

{"severity":"INFO","time":"2021-03-24T15:53:16.825Z","class":"PostReceive","args":["project-43212","user-3950","0000000000000000000000000000000000000000 451379f2d0c16ec53cfdd5b00bdcbe7cb1476b7e refs/heads/branch/path/name\n","{}"],"retry":3,"queue":"post_receive","version":0,"jid":"db03919cc4addcec56315078","created_at":"2021-03-24T15:53:16.662Z","meta.user":"username","meta.project":"sre/repo_name","meta.root_namespace":"sre","meta.caller_id":"/api/:version/internal/post_receive","meta.feature_category":"source_code_management","correlation_id":"01F1JF1A39E48FMB383M8HC4AZ","enqueued_at":"2021-03-24T15:53:16.666Z","pid":43962,"message":"PostReceive JID-db03919cc4addcec56315078: done: 0.15706 sec","job_status":"done","scheduling_latency_s":0.001787,"gitaly_calls":5,"gitaly_duration_s":0.055844,"rugged_calls":1,"rugged_duration_s":0.003038,"redis_calls":42,"redis_duration_s":0.017637,"redis_read_bytes":813,"redis_write_bytes":11267,"redis_cache_calls":25,"redis_cache_duration_s":0.011773,"redis_cache_read_bytes":511,"redis_cache_write_bytes":2554,"redis_queues_calls":6,"redis_queues_duration_s":0.002228,"redis_queues_read_bytes":234,"redis_queues_write_bytes":8124,"redis_shared_state_calls":11,"redis_shared_state_duration_s":0.003636,"redis_shared_state_read_bytes":68,"redis_shared_state_write_bytes":589,"db_count":33,"db_write_count":6,"db_cached_count":0,"duration_s":0.15706,"cpu_s":0.055241,"completed_at":"2021-03-24T15:53:16.825Z","db_duration_s":0.023526}

and the gitaly activity associated with the PostReceive job, including a FindAllBranchNames call that gives no errors:

{"correlation_id":"01F1JF1A39E48FMB383M8HC4AZ","grpc.code":"OK","grpc.meta.auth_version":"v2","grpc.meta.client_name":"gitlab-sidekiq","grpc.meta.deadline_type":"unknown","grpc.method":"CommitsBetween","grpc.request.deadline":"2021-03-24T15:54:16Z","grpc.request.fullMethod":"/gitaly.CommitService/CommitsBetween","grpc.request.glProjectPath":"sre/repo_name","grpc.request.glRepository":"project-43212","grpc.request.repoPath":"@hashed/b8/e0/b8e018a79fd712b5c0864804a42459d4c93e53bc1c8fb67642a32a86c6f117a1.git","grpc.request.repoStorage":"default","grpc.request.topLevelGroup":"@hashed","grpc.service":"gitaly.CommitService","grpc.start_time":"2021-03-24T15:53:16Z","grpc.time_ms":25.292,"level":"info","msg":"finished streaming call with code OK","peer.address":"@","pid":43853,"span.kind":"server","system":"grpc","time":"2021-03-24T15:53:16.713Z"}

{"correlation_id":"01F1JF1A39E48FMB383M8HC4AZ","grpc.code":"OK","grpc.meta.auth_version":"v2","grpc.meta.client_name":"gitlab-sidekiq","grpc.meta.deadline_type":"unknown","grpc.method":"FindAllTagNames","grpc.request.deadline":"2021-03-24T15:53:46Z","grpc.request.fullMethod":"/gitaly.RefService/FindAllTagNames","grpc.request.glProjectPath":"sre/repo_name","grpc.request.glRepository":"project-43212","grpc.request.repoPath":"@hashed/b8/e0/b8e018a79fd712b5c0864804a42459d4c93e53bc1c8fb67642a32a86c6f117a1.git","grpc.request.repoStorage":"default","grpc.request.topLevelGroup":"@hashed","grpc.service":"gitaly.RefService","grpc.start_time":"2021-03-24T15:53:16Z","grpc.time_ms":3.806,"level":"info","msg":"finished streaming call with code OK","peer.address":"@","pid":43853,"span.kind":"server","system":"grpc","time":"2021-03-24T15:53:16.762Z"}

{"correlation_id":"01F1JF1A39E48FMB383M8HC4AZ","grpc.code":"OK","grpc.meta.auth_version":"v2","grpc.meta.client_name":"gitlab-sidekiq","grpc.meta.deadline_type":"unknown","grpc.method":"FindAllBranchNames","grpc.request.deadline":"2021-03-24T15:53:46Z","grpc.request.fullMethod":"/gitaly.RefService/FindAllBranchNames","grpc.request.glProjectPath":"sre/repo_name","grpc.request.glRepository":"project-43212","grpc.request.repoPath":"@hashed/b8/e0/b8e018a79fd712b5c0864804a42459d4c93e53bc1c8fb67642a32a86c6f117a1.git","grpc.request.repoStorage":"default","grpc.request.topLevelGroup":"@hashed","grpc.service":"gitaly.RefService","grpc.start_time":"2021-03-24T15:53:16Z","grpc.time_ms":3.777,"level":"info","msg":"finished streaming call with code OK","peer.address":"@","pid":43853,"span.kind":"server","system":"grpc","time":"2021-03-24T15:53:16.776Z"}

{"correlation_id":"01F1JF1A39E48FMB383M8HC4AZ","grpc.code":"OK","grpc.meta.auth_version":"v2","grpc.meta.client_name":"gitlab-sidekiq","grpc.meta.deadline_type":"unknown","grpc.method":"CommitDelta","grpc.request.deadline":"2021-03-24T15:53:46Z","grpc.request.fullMethod":"/gitaly.DiffService/CommitDelta","grpc.request.glProjectPath":"sre/repo_name","grpc.request.glRepository":"project-43212","grpc.request.repoPath":"@hashed/b8/e0/b8e018a79fd712b5c0864804a42459d4c93e53bc1c8fb67642a32a86c6f117a1.git","grpc.request.repoStorage":"default","grpc.request.topLevelGroup":"@hashed","grpc.service":"gitaly.DiffService","grpc.start_time":"2021-03-24T15:53:16Z","grpc.time_ms":16.981,"level":"info","msg":"finished streaming call with code OK","peer.address":"@","pid":43853,"span.kind":"server","system":"grpc","time":"2021-03-24T15:53:16.797Z"}

{"correlation_id":"01F1JF1A39E48FMB383M8HC4AZ","grpc.code":"OK","grpc.meta.auth_version":"v2","grpc.meta.client_name":"gitlab-sidekiq","grpc.meta.deadline_type":"unknown","grpc.method":"FilterShasWithSignatures","grpc.request.deadline":"2021-03-24T15:53:46Z","grpc.request.fullMethod":"/gitaly.CommitService/FilterShasWithSignatures","grpc.request.glProjectPath":"sre/repo_name","grpc.request.glRepository":"project-43212","grpc.request.repoPath":"@hashed/b8/e0/b8e018a79fd712b5c0864804a42459d4c93e53bc1c8fb67642a32a86c6f117a1.git","grpc.request.repoStorage":"default","grpc.request.topLevelGroup":"@hashed","grpc.service":"gitaly.CommitService","grpc.start_time":"2021-03-24T15:53:16Z","grpc.time_ms":0.569,"level":"info","msg":"finished streaming call with code OK","peer.address":"@","pid":43853,"span.kind":"server","system":"grpc","time":"2021-03-24T15:53:16.814Z"}

We have most recently been attempting to reproduce and/or get more information to see if git for-each-ref is reporting any error without returning a failure status, by doing repeated branch pushes to a repo and calling out git for-each-ref calls in loops. That effort has so far proved unfruitful.

Current state

We have a few different areas of investigation open currently:

NFS/Gitaly

See @zj-gitlab's comments below

Race conditions in the cache

@nick.thomas has identified two possible race conditions in the set cache adapter:

  1. https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/gitlab/repository_cache_adapter.rb#L67
  2. https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/gitlab/repository_set_cache.rb#L39

These should be updated to use atomic Redis methods, though whether they're the cause of the issue or not we don't currently know.

Nick's comment: #326066 (comment 539076174)

Merge request for this is workflowin review : !57607 (merged)

Backup plan: work around the issue in the PostReceive worker

!57598 (closed)

This is an attempt to keep forcibly clearing the cache inside the PostReceive worker until it includes the branches we're looking for. This isn't a "solution" but it is something we want to have ready in case we just need to work around the issue for now.

Edited Mar 26, 2021 by Nick Thomas
Assignee
Assign to
Time tracking