Customer Request - Clarity and Guidance on Backups.
Support Request for the Gitaly Team
Customer has some concerns regarding server side backup:
- They run server side backups. It managed to backup 99% of their repository and it took 2 days.
- However, the entire backup failed due to missing objects in some of the repository.
- They are working on this on a different ticket
- This resulted in the backup file not being generated.
- For this one, I believe it’s normal that we don’t generate backup file for failed backups right?
Author Checklist
-
Reached out to #g_gitaly prior creating issue (please provide link) -
Fill out customer information section -
Provide an detail summary under Additional Information:
-
-
Severity realistically set -
Provided detailed problem description -
Provided detailed troubleshooting performed -
Clearly articulated what is needed from the Gitaly team to support your request by filling out the What specifically do you need from the Gitaly team
Customer Information
Salesforce Link: https://gitlab.my.salesforce.com/00161000013aRjGAAU
Zendesk Ticket: https://gitlab.zendesk.com/agent/tickets/502958
Installation Size: 12k active users
Architecture Information: Uses Gitaly cluster
Slack Channel:
Additional Information:
Support Request
Customer is requesting a call with Gitaly developers to discuss server-side backups and its limitations. Some of the questions they have (I did answer some of their questions):
High Level goals
- Recover from partial to total loss of repositories
- Recover from loss of database and repositories
At a high-level, what is needed to recover from these scenarios?
- If restoration necessitates wiping the database, then what is the solution?
- If there are no database and repo backups, we are starting with a fresh instance. Can repositories be restored into a fresh instance (assuming same versions)?
Meta
- What scenarios are implied by backup and restoration docs? Just repo loss, or total loss?
- What special concerns exist when it comes to backup and restoration for large architectures?
- Concerns over us backing up only repositories?
- excluded
- db: backups done by separate team*
- uploads, artifacts, LFS: already using object storage, so seems redundant
- builds: would you clarify this? I've been under the impression that job trace falls under artifacts.
- pages: not using
- terraform_state: not, but not for any reason
- registry: not, because not using registry
- packages: not determined yet
- ci_secure_files: not determined yet
- included
- repositories: to object storage provider
- Concerns related to database.
- As gitlab-backup does not support partial restoration from serverside backups, what to do? The backup tool requires the wiping of an existing database, which is fine in the context of recovering from a total catastrophe ... but what of partial catastrophes?
Answered but Would Like to Known What's Planned
- Do GitLab plan to support pruning of server-side backups of repos?
- Interruptions aren't handled in any special way, as we understand it. Some improved fault tolerance could allow incremental / resumed backups
- Will there be support for backup chunking in the future? Context is that it takes us ~2 days to completely backup 5TB.
Lingering Questions/Concerns
- Are partial restores from server side backups supported?
- The docs indicate that "To restore a backup, you must also restore the GitLab secrets". Does this apply to case where only doing server-side backups of repos?
- In which version of GitLab will it no longer necessary to provide a backupID to perform an incremental backup?
- Why are backups so fragile?
- Why isn't it possible to perform an incremental backup against a largely successful full backup? EG: 99% of 5TB backedup, but gitlab-backup removes the tar file that encodes the backupID when it encounters some corrupted repos.
- Why isn't it possible to restore repositories made during a prior version of GitLab? My understanding is that this is to do with changes in database schema between versions. With that said, why not accommodate this?
As mentioned above, I did manage to answer some of their questions, but they have a couple of follow-up questions:
I meant to refer to a scenario where everything is lost except for repository backups. As well, the assumption is that repos were backed-up using the gitlab-backup tool.
They're asking if it's possible to restore a repository only backup without the database. I believe this would be difficult but sort of possible.
Lastly, they want a clear answer about the viable backup strategy for their architecture.
Severity
Low severity since the customer is mostly testing backup and restore.
Problem Description
- What version is the customer running?
- What is the customers architecture?
- What is the GitLab architecture?
- Are networking filesystems (like NFS) used?
- What are the filesystems?
- What are the OS and kernel versions?
- How are backup, replication, HA, etc performed?
- Are they using Gitaly Cluster?
- How many Gitaly Clusters the customer has?
- How many Gitaly nodes per cluster the customer has configured?
- Has the customer, or some tools/script (backup, synchro, replication, HA, etc) they set up, directly interacted with the Git repository?
- using
rsync
or similar tools? -
git
commands? - history changing tools (like git filter-repo)?
- using
- Does the customer have any hooks configured?
- If this is a performance issue, what does the Git workflow look like?
- What are the customer RPS for push and pulls? (use fast-stats)
- How mamy pipelines does the customer run?
- How many users are working on the instance?
- How big are the repositories? Do they have monorepos?
- Provide the output of git-sizer.
Troubleshooting Performed
What specifically do you need from the Gitaly team
Customer is requesting a call with the Gitaly developers to answer some of their questions and concerns. They also have some feedback regarding the backup and restore process, which we can use to improve it.
/cc @mjwood @andrashorvath @jcaigitlab @john.mcdonnell @gerardo