Project 'gitlab-com/infrastructure' was moved to 'gitlab-com/gl-infra/production-engineering'. Please update any links and bookmarks that may still have the old path.
Big issue that requires leadership and hard thinking: we are moving our repos to Cephfs and we do not have a backup plan, this has to be covered as soon as possible.
Requirements:
Offline
Different provider
Different credentials
Different format (test CephFS corruption and recover from it)
What is not backup
Another Ceph cluster to replicate to is great, but not a replacement.
Repushing the commits to another cluster is also not a replacement. We are not only backing up git repos
At this point, no not really. The issue is that right now CephFS (because it is a posix overlay and as such is tracking the extraneous metadata needed for presenting a posix file system in separate process and database than the rest of Ceph) doesn't conform to the rest of the Ceph architecture. For example Ceph has the concept of doing replication zones for different data centers and having gateways that replicate the Ceph data between the two. While we could establish this to replicate the Ceph pool data between two data centers, the lack of the MDS data makes the replicated data unusable. The same with most documented "Ceph" (note the lack of 'FS' which does grow confusing) backup solutions, as they don't currently apply to CephFS.
We're on the bleeding edge of this one, which is both scary but also afford the the opportunity to forge what works for us and maybe set some precedence
Good info in http://disq.us/p/1cafs3l "We use CephFS on our backend and yes it is still very new. Snapshots are not recommended right now, but you also have to think about backup in terms of MDS failing as well. A true backup would be on a different medium. Also note MDS is single threaded so you may hit a bottleneck there. I know they are working on MultiMDS and hopefully the single thread will be a thing of the past."
Ok, so we have a plan for the initial backup. These are the notes from the document with a few comments here and there to explain what the idea is.
Highlights
We have a plan forward, at least an experiment to run
Lowlights
We do not have a plan for incremental backups.
Summary of the plan
Replicate CephFS to another cluster (will be interesting)
@northrup idea is to snapshot the MDS server on top of the Ceph rados cluster to have an offline copy of the data. He will be investigating this approach.
Kind of - what we talked about was using the underlying Ceph RADOS snapshot function and Ceph Data Center syncing to time a snapshot of the Ceph pools with a CephFS MDS journal dump and then replicate the pool and see what the skew level is (if any) between the MDS journal recovery and the replicated data in the underlying Ceph pools
This is what we used to move data from ext4 to xfs on AWS, and then from AWS to Azure. Ask me if things are not obvious, this documentation is not perfect.
Edit: some more explanation of what we did.
The basic idea was to sync the repositories in multiple passes:
T1: rsync everything
T2: rsync all repos with activity since T1
T3: rsync all repos with activity since T2
... repeat until rsync run no longer gets shorter
take downtime, do final rsync run (repos with activity since Tn), switch storage
run cleanup rake script to remove repositories that were removed/renamed since T1
We have a rake task that would generate the list of repos for each rsync run. The actual rsync run used GNU parallel; it was actually a batch of small rsync runs (one per repository on the list). We used a directory with text files to keep state during a batch, so that a batch could be resumed.
The documentation I linked to above explains sort of what the steps are to do one batch run.
What made this approach successful, IMO, was the fact that:
(10) parallel small rsync jobs gave good throughput
we were able to resume interrupted batches with low startup overhead (about 1 minute of sort | uniq -u)
You really want to use an rsync-esque (but probably parallel) approach. If you want to optimise this, you could create a tool that consumes CephFS recursive statistics (rstats) to rapidly check if a directory has been updated without having to traverse all its children. Large amounts of kudos are available for whoever gets around to writing this tool first :-)
When the purpose of your replication is robustness against all type of issue (not just unavailability and the primary site), you should always operate at the file level (and not try to copy anything lower level like raw rados pools) because in the event that there was some low level corruption on the primary system, you don't want to transmit it to the secondary.
As you've noticed, there isn't any geo-replication built into CephFS, and RADOS-level snapshots aren't going to work for you because you won't be able to get moment in time consistency between the data and metadata pools. The RADOS-level snapshots are also incompatible with CephFS snapshots (they use the same underlying mechanism). Note that all the geo-replication features that do exist are actually in the RBD/RGW layers: there isn't any geo-replication in RADOS that you're missing out on.
If you wanted to live dangerously and do initial sync up between two filesystems at a lower overhead than rsync, you could try taking your filesystem offline and using "rados export" to create a serialized stream of your pools. On a suitably fast client node you might find that rados export goes fast enough to saturate your WAN link, but you'd have to test it. However, I wouldn't go down that road unless you'd established that parallel rsync was really not going to be fast enough.
You might also get some helpful experiences from folks on the ceph-users list.
In our team discussion several of us began hitting around the edges of the same design structure, which can best be described as a 'pub/sub' queue and bus for near real-time replication of objects from production to remote (DR site). My example use case would be to use a post-commit hook that publishes a message to the queue containing the name of the repo. From there we have "sherpa" nodes that are responsible for subscribing to the queue and turning 'requests' into individual rsync commands to sync the repo from source to target destination. This would allow us to trickle stream the data once caught up to the target without saturating the hosts in production, this also would prevent thundering herd potential as we could configure concurrency on the Sherpa side so that during large workloads the work just queued for processing.
@pcarranza doing snapshot+rsync is definitely a sane thing in principle, but snapshots are one of the less tested bits of cephfs at the moment, so if you can come up with a solution that avoids them you'll have a lower probability of running into issues.
into individual rsync commands to sync the repo from source to target destination
I am not sure if that gives you a consistent git repository on the other end. In our other rsync git replication scenarios (repo storage move feature, or the rsync bulk migration described by me above) we have a 'final frozen rsync'. If you don't have that you may end up losing objects when rsyncing in the middle of a repack. Doing git push (or pull) would be more reliable (but it would miss out on git-annex data).
For the initial data transfer I'd like to propose that we look at starting an export job with Azure and shipping them a couple of 8TB drives to load our repo data out on, then coordinate that with an Amazon Snowball device to get our data into AWS. This would for sure be less taxing and a quicker transfer than streaming it across the internet. Once done we could then initiate an rsync to catch the delta change in our data.
@pcarranza Geo does not have a feature to "backfill" the missing un-active repos, so to have 100% coverage we must make sure we do an initial rsync.
The initial rsync is needed only for repositories that are not receiving any push. If we don't want to do the initial rsync, we could craft an API request to simulate the event notification Geo uses to start the git clone / fetch from the primary and make it part of the migration process (after migrate, make the API call).
This will also be a good thing as we can see under our size of load what inefficiencies we can spot.
@brodock so, to backfill we should fake a notification event... does it make sense to actually add this as a feature somehow? like a "fetch now" that will force fetching?
The issue with doing an rsync is that we are talking about a lot of data, if we can leverage geo to do it for us, and particularly to keep it up to date, we would have everything covered.
@pcarranza yes, we will need to fake a notification event. I believe it does make sense to add this as a feature. A simple implementation would be to iterate over all projects and schedule (in the secondary node) and schedule a update job. We can do that after the initial live traffic update finishes (which for our size may take a lot of time).
Bare in mind that the initial sync will be a full git clone with all the cpu overhead related to it, that's why we opted for doing the rsync first.
@brodock please add a simple implementation to iterate over all projects. Customers will need this too. We should bake it into the product and allow people to easily enable/disable it and throttle it. We should not mess around with rsync for this since our customers need it in the product too. /cc @stanhu
@northrup Can we start the boring export solution and ship to AWS ASAP? I think we should do both NFS and CephFS clusters right now. Is there any reason to wait?
@stanhu the exporting of data from Azure hit a snag. Namely that the export process just takes data from existing block blobs (of which we have 240 1TB Block Blobs supporting Ceph) and copies that data to a drive that we would have to provide. Also, the drive would have to be formatted NTFS and the data would be encrypted with a Microsoft BitLocker encryption key.
I am now looking at copying data from Ceph to AWS EFS using the CERN FDT Tool that @sytses found.
For a backup theory I have the FDT software running on a node called "ChangeAgent" in Azure ARM that has both CephFS and NFS mounted. I am copying the data to a target node in AWS that has an EFS drive mounted to it as a means of getting a copy of our data somewhere else.
The Uploads directory of GitLab has replicated to AWS. The FDT software works by up providing it a list of files to replicate to a target, it does not do the filesystem walk itself. In light of that I have had a job walking the git data store for ~13 hours now, we're up to 9.5 million files and counting that need to be replicated.