Backup machine and how to avoid backups from becoming a single point of failure

I had a chat with @sranasinghe today about Backups, we've discussed a lot of topics regarding backups, including the following items below:

Availability

We should look at strategies to ensure backups keeps operating even when there is availability problems.

Backups are commonly implemented as either a manual task that is run on demand, or as a scheduled cron task that is performed at specific times.

The problem with the scheduled approach is that, if the machine were that was being performed goes bad, you have to have all your custom setup documented and remember to reestablish that again in order to have "backup" coverage.

Another situation where this can go wrong is when your entire entire site goes down (which is likely your primary), you have to remember to restore that backup procedure again in the newly promoted secondary, along with the work to handle the disaster situation.

A better approach would be to have a designated backup instance, which also can handle your backup logic/policy, and have that work in a similar way to how the Geo Logcursor works today: by having more than one instance running at the same time (in more than one machine), and use a distributed/central lock to decide which instance will "operate" at any given time.

This approach would make the backup procedure more reliable and make it "highly available", as well as removing the SPOF (single point of failure).

In order for it to be a "distributed" solution, you need a non-local storage solution. So you need either a network attached storage (like NFS), or rely on object storage for that.

Having one of them available can be used as your "shared lock" mechanism. It can be implemented with a solution as simple as a "remote lock file" which includes a timestamp that is updated from time to time, and a unique identifier for the running backup instance. When that TTL is breached and the lock file is not updated, a new machine tries to take the lock (similar to how Geo logcursor works, except we rely on Redis for that).

Integrity and Consistency

On backup integrity, we currently have a challenge to include "everything" at a particular point in time in the same backup archive.

For very large installations, the moment you start the backup procedure to the moment it finishes, there is a "range" of data that may not be consistent between each other, nor they are consistent with the point in time the database snapshot was made.

In order to have point in time backup, you need either some form of snapshot capability, for example:

from your filesystem when it supports it (like the BTRFS or ZFS)
from your virtualization solution (and you want to synchronize the snapshot operation to happen as much as possible "at the same time" in each VM
- this can be either implemented as a parallel execution triggered manually by asking the supervisor to do so, or by ping each instance to do so
- it can be a feature provided by the virtualization software that allows you to do a "cluster-wide" snapshot
from a feature provided by your cloud provider (with similar caveats as the virtualization solution)
rely on a different solution that allows you to reconcile your data to a certain point in time, for example, with the help of WAL logs
using immutability and delayed removal of items (for objectstorage or regular blobs stored on disk)

To solve the database part, what you can do is perform a pg_dump at certain time, and copy the WAL logs that cover from that point in time until the target point where you want to freeze time at.

For example:

at 00:10 you start your pg_dump
at 00:20 it finishes
other data is being backed up
at 00:50 all other items finishes
at 00:55 you want to freeze time
at 00:56 or later you copy wal logs that cover from at least 00:09 until 00:55.

With that information you can later on restore your database and replay the WAL logs until it reaches 00:55.

Gitaly is working on similar WAL logs implementation which would allow us to perform a similar operation:

at sometime after 00:10 and before 00:50 copy repositories
at 00:56 or later copy WAL logs that covers from 00:09 to 00:55

This would allow us to replay the logs so we can get the repositories at 00:55.

In both cases, a restore is only possible if the WAL logs are compatible with the version being restored. In other words, a solution with WAL logs will not be "portable", nor it is "universal".

WAL Logs in PostgreSQL for example, are not compatible between "major versions". So in order to restore a backup that relies partially on WAL logs, you need to restore it in that particular PostgreSQL version, otherwise you can't recover the data contained in the WAL files for that particular range of time.

A different alternative to relying on pg_dump is to use pg_basebackup, which copies the binary data where your database is stored along with the WAL logs necessary to cover from the moment the process is started to the moment it finishes (you need the -x option for that). Read more here: https://manpages.ubuntu.com/manpages/trusty/man1/pg_basebackup.1.html

In each case, those require a special process to restore the data, by setting up the database again using compatible versions and executing the WAL logs to get it to the desired final state.

Archiving data with WAL logs has multiple downsides:

It requires you to provision your services using specific versions, which may be hard to do a couple of months or years later
A corruption is even harder to recovery from, as you have to deal with "proprietary formats" that may not be optimize for error correction
Can't be used when migrating from on-premise to cloud managed
Can't be used as a solution to backup cloud-managed (in case you want to have a consistent backup to migrate from cloud to on-premise)

Ideally what you want is to perform the reconciliation step before archiving the data, so you have the data in a portable / universal format.

A SQL based pg_dump is such a format, as it is guaranteed to work with any current or newer version than the one it was generated/formatted for. (As an example, a pg_dump done one 14.0 will work on 15.0+ by default)

Similarly when considering Gitaly future WAL log based approach, you either want to reconcile the repository to be at certain point of time, so a future restore doesn't depend on specific WAL log processing tools, or Gitaly specific versions, but it also helps in case there is any sort of corruption of data. Instead of having to deal with just GIT data, you now have also to consider this WAL log format which can complicate everything to a level where recovering data may not be possible.

Backup service / appliance

The last item we discussed was regarding having a specific Backup machine and a solution around it.

By building something as a separate service / appliance, you can look at the backup entire lifecycle and challanges as a solution rather than a simple feature that is part of the product.

When you consider the initial discussed challenges where a site can go down, your Backup solution should use Geo if it is available.

If you imagine you have A Geo primary and a Geo secondary, each one in its own datacenter, and you also want backups to be reliable and not a single point of failure, you also want your Backup service to be installed in at least other two datacenters (can be a regional one from each Geo site, as an example).

The Backup service should also be able to use any available Geo site to trigger a backup from. As an example, in case where backups are targeting a Secondary site, if that site is unavailable, it should use a Primary site instead (or should allow the backup instance from a secondary to take over).

This type of approach can only be possible if we consider that the Backup service will not live in the same machine as the other GitLab services are running from, and connecting through SSH to each machine it needs to access data from.

Going from the existing "push method" (where data is pushed by backup to somewhere else), from a "pull" one (where backup connects to where it needs data from and pulls from it), will also help improve security and make the backup solution harder to compromise in case on of the production machines were invaded.

It allow us to store backup-specific credentials outside of the main GitLab application, which may be also a necessary architecture for Dedicated.

By also having it having its own service, we can store backup related metadata which we can access / query / generate statistics later on.

For example, we can have a list of all available backups in a SQLite database. This can include things like how long did each one took, file sizes, versions of each service, when something failed, etc.

Those information can initially be available on the terminal by relying on specific commands, but later on can turn into its own API / dashboard.

Having that metadata available can also help automate verification procedures (like verify whether a backup integrity check is still passing, or to ensure you have your correct backup coverage in place for the X amount of time you need it to be covered)

It can also be used to manually purge versions you don't have to keep anymore, etc.

Regarding those metadata, we also discussed a layered approach that is disaster resistent to keep that data available.

Ideally when you save a backup in objectstorage, we should ship an index file (using the same name as the backup but with an extension like .index), and have similar data that is in the local SQLite database there. This allows the data to be "re-scanned and re-indexed" locally, without having to make the database solution there also "highly available", etc.

Edited Dec 15, 2023 by Gabriel Mazetto

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information

Admin message

Backup machine and how to avoid backups from becoming a single point of failure

Availability

Integrity and Consistency

Backup service / appliance