backup.rake will need to be updated so that when a backup is created, the backup ID is fetched from gitaly and saved into the backup tar. This backup ID will be used on restore in order to tell gitaly to restore a specific backup.
@sranasinghe@juan-silva since we have this issue already, let's use it to ask any questions there are about the integration between Gitaly's new backup arch and the rake task.
To answer the question during our call, it looks like we already are planning on saving the backup ID into the tarball so that it can be used to do the restore.
A couple of questions in my mind:
Will we just write the backup_id into a file?
At what point do we write the backup_id? Do we wait for all the repositories to finish backing up, or return the backup_id immediately and allow the backups to happen in the background?
Will the backup rake task be able to take advantage of backups that are done periodically by the background worker? Like could we maybe add something to the backup rake task to pass in a backup_id to do the restore with?
Happy to continue the conversation here. Here are some initial thoughts on your questions. But I will also let the @geo-team engineers weigh in.
Will we just write the backup_id into a file?
If you are storing the rest of any metadata relevant to the backup in the DB, then the ID in a file would suffice I think. As I mentioned, it would be helpful for a human to be able to browse the object storage and use the ID to find the corresponding repository backup files.
At what point do we write the backup_id? Do we wait for all the repositories to finish backing up, or return the backup_id immediately and allow the backups to happen in the background?
For expediency of generating the backup I would think it is better to return the ID as soon as one is known. However, it would be good to have a way to relate the back-up ID to a log entry somewhere else if that is the case. This way an operator can find the log in case something went wrong in the async backup process. Furthermore, we may need an alert of sorts if the backup does not succeed.
Having the backup be synchronous would remove the need for additional async alerts and logging. So it might be a better fit for a first iteration if it is easier.
Will the backup rake task be able to take advantage of backups that are done periodically by the background worker? Like could we maybe add something to the backup rake task to pass in a backup_id to do the restore with?
Maybe, but the caller of the rake task would have to know the ID and know that the time matches the ad-hoc run time of the backup rake task. Otherwise, the repo and the rest of the backed-up assets would be out of sync, no?
As I've put more thought into the solution beyond the initial MVP it has evolved a bit. I'm planning on putting this all in the blueprint.
For the MVP the backups will be more similar to how they currently work. That is we periodically take a backup of every single repository and these repositories will all end up with a full repository backup associated with a specific backup ID. gitlab-rails can fetch this ID with an RPC call at any time (it will just return the latest complete backup).
The ultimate goal though is to use gitalys knowledge of git and repo access to be smart about how we take full backups. That may mean that each repository has their own backup schedule. So I think we'll more likely define backup_id as a timestamp. Then when trying to restore to a timestamp, we find the latest full backup before the given timestamp. Then use incremental/WAL to restore the repos as close to the timestamp as possible.
Given that:
For backup.rake integration I expect gitlab-rails to store this somehow. It doesn't matter if it is in the DB or as a file in the backup tar. I think eventually we will want some gitlay RPCs to list of available backups.
Since there would be no schedule, it basically doesn't matter. You will want to run backup.rake as often as you want DB/file backups.
backup.rake already shells out to gitaly-backup for restores. So I expect that we'll just be adding a direct-to-object storage option to gitaly-backup. It will pass along the backup ID/timestamp and the restore RPC will best effort restore the repos to that specific point.
This way an operator can find the log in case something went wrong in the async backup process. Furthermore, we may need an alert of sorts if the backup does not succeed.
Right. This is one reason why I think it makes sense to get backups out of backup.rake. There's no good way to monitor backup.rake. If we move backups into gitaly, then we already have centralised logging and metrics at our disposal. We can setup alerting off metrics as we would for any other service degradation.
Having the backup be synchronous would remove the need for additional async alerts and logging. So it might be a better fit for a first iteration if it is easier.
Yeah. There will effectively be an RPC to ask gitalies to take backups of specific repos. So this might be easy. Though it hasn't been designed to work through praefect. So there would be some extra work there.
In light of this approach, I have some additional questions:
How often do you expect the backups to happen for each repository? or would this be a configuration?
What would be the retention policy for different time intervals (i.e. daily, weekly monthly) ? again, configurable?
If you will rely on the WAL to approximate the repository restore point to the desired timestamp, does this mean you will retain all the WAL metadata in between backups for each repo?
From this approach, it sounds like you will have the capability to restore a repository to any arbitrary point in time, regardless of the backup schedule that the rake task will be running on as configured by customers. Is that a correct assumption? And if that is the case, does it mean that the rake task does not need to call gitaly to initiate a backup at all, but rather just specify a timestamp when running a restore?
For the MVP the backups will be more similar to how they currently work. That is we periodically take a backup of every single repository and these repositories will all end up with a full repository backup associated with a specific backup ID. gitlab-rails can fetch this ID with an RPC call at any time (it will just return the latest complete backup).
I'm supportive of this approach for the MVP. How do you see the rake task being able to report back on success/failure and report back on errors so the user is confident they have a complete backup or successful restore?
How often do you expect the backups to happen for each repository? or would this be a configuration?
I don't expect this transition to be transparent. Alerting will have to be setup. I think it ought to work the same as any other service level alerting for gitaly.
What would be the retention policy for different time intervals (i.e. daily, weekly monthly) ? again, configurable?
Retention is usually configured via the object-storage provider. It is not part of the proposal, except that I think we ought to change the backup layout to support WORM.
If you will rely on the WAL to approximate the repository restore point to the desired timestamp, does this mean you will retain all the WAL metadata in between backups for each repo?
Right now the plan would be to stream WAL entries to object storage. I expect that the layout of the backups will need to be changed to accommodate this. IIUC there may be one WAL for multiple repositories "partitions".
The goal of the MVP is really just to see how the shift to server-side will work and likely wont have incremental or WAL.
At least for MVP we wont have a background worker. backup.rake will call gitaly-backup with -server-side, this will trigger gitaly-backup to use the new server-side RPCs BackupRepository/RestoreRepository. There's a new config (docs still pending) that specifies the object-storage connection.
This should be a much more simple transition for backup.rake. I'll just use the backup ID that backup.rake already generates, this is already part of the backup metadata and the backup archive will have no repository directory.
One thing that might be worth mentioning in terms of transition is that because I now think we have to solve gitlab#357044 (closed) as part of MVP, what we store for backups will be slightly different. So you probably couldn't just upload some historic backup to object-storage and expect it to seamlessly work. Exactly how this transition goes still needs to be worked out.
One thing that might be worth mentioning in terms of transition is that because I now think we have to solve gitlab#357044 (closed) as part of MVP, what we store for backups will be slightly different. So you probably couldn't just upload some historic backup to object-storage and expect it to seamlessly work. Exactly how this transition goes still needs to be worked out.
By historic backup do you mean a backup created using the hold tar method?
The reason is what I describe &10077 (comment 1459412652) - Basically because backup.rake always uses latest it would be unaffected, but server-side needs to be specific, and in that case you end up with headaches when there is not a complete set of backups. We will obviously need to handle incomplete backup sets for per-repository scheduling, but this work has not started.