Recovery plan for Cephfs

Is this relevant? http://ceph.com/community/blog/tag/backup/

At this point, no not really. The issue is that right now CephFS (because it is a posix overlay and as such is tracking the extraneous metadata needed for presenting a posix file system in separate process and database than the rest of Ceph) doesn't conform to the rest of the Ceph architecture. For example Ceph has the concept of doing replication zones for different data centers and having gateways that replicate the Ceph data between the two. While we could establish this to replicate the Ceph pool data between two data centers, the lack of the MDS data makes the replicated data unusable. The same with most documented "Ceph" (note the lack of 'FS' which does grow confusing) backup solutions, as they don't currently apply to CephFS.

We're on the bleeding edge of this one, which is both scary but also afford the the opportunity to forge what works for us and maybe set some precedence

Mentioned in issue #279 (closed)

Next step: we are meeting to flesh this solution out on next Wednesday at 4PM CET.

Good info in http://disq.us/p/1cafs3l "We use CephFS on our backend and yes it is still very new. Snapshots are not recommended right now, but you also have to think about backup in terms of MDS failing as well. A true backup would be on a different medium. Also note MDS is single threaded so you may hit a bottleneck there. I know they are working on MultiMDS and hopefully the single thread will be a thing of the past."

Snapshots with cephfs: http://docs.ceph.com/docs/master/dev/cephfs-snapshots/

Notes are in Google Doc 'Cephfs Backup'

Ok, so we have a plan for the initial backup. These are the notes from the document with a few comments here and there to explain what the idea is.

Highlights

We have a plan forward, at least an experiment to run

Lowlights

We do not have a plan for incremental backups.

Summary of the plan

Replicate CephFS to another cluster (will be interesting)

@northrup idea is to snapshot the MDS server on top of the Ceph rados cluster to have an offline copy of the data. He will be investigating this approach.

Create CephFS snapshots on the secondary cluster

By having a dump of the MDS server we could rebuild this cluster in a frozen state in a different location, then run all these operations offline.

Use CERN Monalisa tool to copy to AWS EFS

Details and comments

Given the size of our current filesystem we do not see any tool capable of managing this size, so we need to think outside of the box to do it.
- http://blog.bacula.org/
- http://amanda.zmanda.com/
Also on that note, we need a target filesystem that will be able of receiving 40Tb of data (and growing at fast pace)
Cephfs snapshotting kinda works, but moving the data out will take some time, so maybe we just can't freeze our cephfs cluster for so long.
We could use rados snapshoting and replication capabilities because that is really well tested and it just works™
Last sync from Amazon to Azure using rsync took 3 weeks, that was 12Tb of data.
1Gbit and 40Tbyte => 320 Tbit / 1 Gbit => 320k seconds => 100 hours (in a perfect world). Twice this would still be acceptable.
Ask @jacobvosmaer-gitlab for parallel rsync used last time.
Useful Data Copy Tool: http://monalisa.cern.ch/FDT/ from cern that could actually work at this scale. (Thanks again CERN!)
We could think of an application level backup. But this would not solve the initial backup as that would take too long.
What are we backing up?
- Git repos
- Build artifacts
- Shared files (images, LFS, etc)
Database is being backed up to S3 already.

Kind of - what we talked about was using the underlying Ceph RADOS snapshot function and Ceph Data Center syncing to time a snapshot of the Ceph pools with a CephFS MDS journal dump and then replicate the pool and see what the skew level is (if any) between the MDS journal recovery and the replicated data in the underlying Ceph pools

Thanks for the clarification @northrup

Documentation from parallel rsync https://gitlab.com/gitlab-org/gitlab-ce/blob/master/doc/operations/moving_repositories.md#thousands-of-git-repositories-use-one-rsync-per-repository

This is what we used to move data from ext4 to xfs on AWS, and then from AWS to Azure. Ask me if things are not obvious, this documentation is not perfect.

Edit: some more explanation of what we did.

The basic idea was to sync the repositories in multiple passes:

T1: rsync everything
T2: rsync all repos with activity since T1
T3: rsync all repos with activity since T2
... repeat until rsync run no longer gets shorter
take downtime, do final rsync run (repos with activity since Tn), switch storage
run cleanup rake script to remove repositories that were removed/renamed since T1

We have a rake task that would generate the list of repos for each rsync run. The actual rsync run used GNU parallel; it was actually a batch of small rsync runs (one per repository on the list). We used a directory with text files to keep state during a batch, so that a batch could be resumed.

The documentation I linked to above explains sort of what the steps are to do one batch run.

What made this approach successful, IMO, was the fact that:

(10) parallel small rsync jobs gave good throughput
we were able to resume interrupted batches with low startup overhead (about 1 minute of sort | uniq -u)

@jnijhof does this sound correct?

Thanks @jacobvosmaer-gitlab

(Hi! CephFS developer here)

You really want to use an rsync-esque (but probably parallel) approach. If you want to optimise this, you could create a tool that consumes CephFS recursive statistics (rstats) to rapidly check if a directory has been updated without having to traverse all its children. Large amounts of kudos are available for whoever gets around to writing this tool first :-)

When the purpose of your replication is robustness against all type of issue (not just unavailability and the primary site), you should always operate at the file level (and not try to copy anything lower level like raw rados pools) because in the event that there was some low level corruption on the primary system, you don't want to transmit it to the secondary.

As you've noticed, there isn't any geo-replication built into CephFS, and RADOS-level snapshots aren't going to work for you because you won't be able to get moment in time consistency between the data and metadata pools. The RADOS-level snapshots are also incompatible with CephFS snapshots (they use the same underlying mechanism). Note that all the geo-replication features that do exist are actually in the RBD/RGW layers: there isn't any geo-replication in RADOS that you're missing out on.

If you wanted to live dangerously and do initial sync up between two filesystems at a lower overhead than rsync, you could try taking your filesystem offline and using "rados export" to create a serialized stream of your pools. On a suitably fast client node you might find that rados export goes fast enough to saturate your WAN link, but you'd have to test it. However, I wouldn't go down that road unless you'd established that parallel rsync was really not going to be fast enough.

You might also get some helpful experiences from folks on the ceph-users list.

Thanks for your input @jcsp

Then what you're saying is that we should be copying from the FS level and forget about playing with the underlaying RADOS layer.

In such case, should we consider using CephFS snapshoting then?

In our team discussion several of us began hitting around the edges of the same design structure, which can best be described as a 'pub/sub' queue and bus for near real-time replication of objects from production to remote (DR site). My example use case would be to use a post-commit hook that publishes a message to the queue containing the name of the repo. From there we have "sherpa" nodes that are responsible for subscribing to the queue and turning 'requests' into individual rsync commands to sync the repo from source to target destination. This would allow us to trickle stream the data once caught up to the target without saturating the hosts in production, this also would prevent thundering herd potential as we could configure concurrency on the Sherpa side so that during large workloads the work just queued for processing.

@pcarranza doing snapshot+rsync is definitely a sane thing in principle, but snapshots are one of the less tested bits of cephfs at the moment, so if you can come up with a solution that avoids them you'll have a lower probability of running into issues.

@JobV to consider in the future development of the product for performing near real time backups of the whole storage system, it's not HA but it could be interesting: https://gitlab.com/gitlab-com/infrastructure/issues/415#note_16210610

into individual rsync commands to sync the repo from source to target destination

I am not sure if that gives you a consistent git repository on the other end. In our other rsync git replication scenarios (repo storage move feature, or the rsync bulk migration described by me above) we have a 'final frozen rsync'. If you don't have that you may end up losing objects when rsyncing in the middle of a repack. Doing git push (or pull) would be more reliable (but it would miss out on git-annex data).

Also see https://gitlab.com/gitlab-org/gitlab-ce/issues/5613 for what happens when backing up a git repository with a not-git-native tool.

For the initial data transfer I'd like to propose that we look at starting an export job with Azure and shipping them a couple of 8TB drives to load our repo data out on, then coordinate that with an Amazon Snowball device to get our data into AWS. This would for sure be less taxing and a quicker transfer than streaming it across the internet. Once done we could then initiate an rsync to catch the delta change in our data.

@pcarranza We want to make sure that people can use GitLab Geo for DR anyway. Can we use GitLab Geo to replicate GitLab.com to a secundary geo server with its volume(s) mounted on Amazon EFS? Then we can use http://docs.aws.amazon.com/efs/latest/ug/efs-backup.html to make backups. /cc @JobV @regisF

@sytses that's a really boring solution, I love it.

@brodock what are your thoughts on this? will GitLab geo catch up to date and push all the repos content over the wire?

@pcarranza Geo does not have a feature to "backfill" the missing un-active repos, so to have 100% coverage we must make sure we do an initial rsync.

The initial rsync is needed only for repositories that are not receiving any push. If we don't want to do the initial rsync, we could craft an API request to simulate the event notification Geo uses to start the git clone / fetch from the primary and make it part of the migration process (after migrate, make the API call).

This will also be a good thing as we can see under our size of load what inefficiencies we can spot.

@brodock so, to backfill we should fake a notification event... does it make sense to actually add this as a feature somehow? like a "fetch now" that will force fetching?

The issue with doing an rsync is that we are talking about a lot of data, if we can leverage geo to do it for us, and particularly to keep it up to date, we would have everything covered.

@pcarranza yes, we will need to fake a notification event. I believe it does make sense to add this as a feature. A simple implementation would be to iterate over all projects and schedule (in the secondary node) and schedule a update job. We can do that after the initial live traffic update finishes (which for our size may take a lot of time).

Bare in mind that the initial sync will be a full git clone with all the cpu overhead related to it, that's why we opted for doing the rsync first.

@brodock please add a simple implementation to iterate over all projects. Customers will need this too. We should bake it into the product and allow people to easily enable/disable it and throttle it. We should not mess around with rsync for this since our customers need it in the product too. /cc @stanhu

@northrup Can we start the boring export solution and ship to AWS ASAP? I think we should do both NFS and CephFS clusters right now. Is there any reason to wait?

@stanhu the exporting of data from Azure hit a snag. Namely that the export process just takes data from existing block blobs (of which we have 240 1TB Block Blobs supporting Ceph) and copies that data to a drive that we would have to provide. Also, the drive would have to be formatted NTFS and the data would be encrypted with a Microsoft BitLocker encryption key.

I am now looking at copying data from Ceph to AWS EFS using the CERN FDT Tool that @sytses found.

@patricio Wanted to put https://gitlab.com/gitlab-com/infrastructure/issues/415#note_16408556 on your radar for Geo.

As a sanity check to test the transfer rates between Azure and AWS East, I used iperf to confirm we would see 1 GB transfer rates:

$ iperf -c aws-host -p 9121
------------------------------------------------------------
Client connecting to aws-host, TCP port 9121
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[  3] local 10.1.0.6 port 47300 connected with 54.197.235.69 port 9121
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  1.27 GBytes  1.09 Gbits/sec

I've contacted Zmanda to see if their software can help us manage the file-level backup.

Reassigned to @northrup

@northrup could you please update progress of the current backup effort in this issue?

@stanhu I'd also like to look at Bacula as well

For a backup theory I have the FDT software running on a node called "ChangeAgent" in Azure ARM that has both CephFS and NFS mounted. I am copying the data to a target node in AWS that has an EFS drive mounted to it as a means of getting a copy of our data somewhere else.

Do you know how far are we in the copy?

The Uploads directory of GitLab has replicated to AWS. The FDT software works by up providing it a list of files to replicate to a target, it does not do the filesystem walk itself. In light of that I have had a job walking the git data store for ~13 hours now, we're up to 9.5 million files and counting that need to be replicated.

Assignee removed

At this stage it doesn't make sense to keep working on this, but we do need to think of a plan on the mid term for when we move out of Azure.

Please keep this ongoing in the back of your mind.

closed

removed milestone

Recovery plan for Cephfs

Requirements:

What is not backup

Designs

Child items ...

Activity

Highlights

Lowlights

Summary of the plan

Details and comments