repository shows as "replica not yet created" following temporary misconfiguration during 16.0 upgrade
Support Request for the Gitaly Team
The goal is to keep these requests public. However, if customer information is required to the support request, please be sure to mark this issue as confidential.
This request template is part of Gitaly Team's intake process.
Customer Information
Salesforce Link:
Zendesk Ticket: https://gitlab.zendesk.com/agent/tickets/425386
Installation Size:
Architecture Information: Gitaly cluster
Slack Channel:
Additional Information:
Support Request
Severity
This is a non production environment, and this is the only reason it's not severity::1
.
The production upgrade is now on hold, and the matter is being reviewed on a daily basis by customer management.
This issue manifests as data loss since the state of the praefect database means the repository cannot be accessed by GitLab.
Problem Description
Gitaly cluster 16.0 is no longer serving up a git repository which we know existed at 15.11, and from troubleshooting, we know exists on disk on the Gitaly servers.
Troubleshooting Performed
- Customer upgraded 15.11.8 to version 16.0.5
- One of their projects ceased to have a git repo. It's a test environment, we don't know how many repos are affected, this one happens to be in regular use.
- When Gitlab requests the project, the RPC return code indicates it's not present.
-
project activity shows git activity from prior to the upgrade (from
events
in the Rails database) -
the requirement to manually add
repositories
to the git data path was temporarily missed from the upgrade instructions. As a result, we see in the logs that Gitaly was looking ingit-data/@cluster/repositories
and notgit-data/repositories/@cluster/repositories
"error": "GetRepoPath: not a git repository: \"/data/opt/gitlab/git-data/@cluster/repositories/d6/ed/7931\"",
-
this was on the day of the upgrade, and we can see successful RPCs -
DeleteRefs
,FindCommit
the day before. See comments in the ticket, 2023-07-25 -
After the upgrade, there's no more requests in the Gitaly server logs for this project. We don't have all the Praefect logs owing to log rotation.
-
On a call today, we checked the praefect metadata:
Repository ID: 7931 Virtual Storage: "default" Relative Path: "@hashed/fb/4a/fb4ae734cf6bcfd5440336eb1e80cdc0b1b035bb4454a832701fc4a10f923c73.git" Replica Path: "@cluster/repositories/d6/ed/7931" Primary: "gitaly2" Generation: 123 Replicas: - Storage: "gitaly1" Assigned: true Generation: replica not yet created Healthy: true Valid Primary: false Verified At: unverified - Storage: "gitaly2" Assigned: true Generation: replica not yet created Healthy: true Valid Primary: false Verified At: unverified - Storage: "gitaly3" Assigned: true Generation: replica not yet created Healthy: true Valid Primary: false Verified At: unverified
-
we then checked on disk in the correct location
/[...]/repositories/@cluster/repositories/
, I observed it to have many subdirectories with a range of date stamps pre dating and post-dating the upgrade. -
on all three Gitaly nodes, the specified path
@cluster/repositories/d6/ed/7931
exists and contains a git repository -
git ls-remote .
on one of the nodes printed out refs includingmain
,keep-around
. the SHA formain
matched the most recent update in theevents
data from the Rails database -
conclusion: the repositories exist, the problem is in the praefect metadata.
What specifically do you need from the Gitaly team
Please assist us in identifing any functionality that could result in this behaviour. for example: Default enable invalid metadata deletion in Pra... (!5321 - merged)
we're also going to need to scope which projects have been affected and identify a way to reinstate the metadata.
Author Checklist
-
Customer information provided -
Severity realistically set -
Clearly articulated what is needed from the Gitaly team to support your request by filling out the What specifically do you need from the Gitaly team
/cc @mjwood @andrashorvath @jcaigitlab @john.mcdonnell @gerardo