File Migration Framework (discovery)

We have had the need to perform file migrations (moving things around) a few times already. In each iteration we had to hand-craft a lot of code, and we had to take care of the same concerns over and over.

Here is a non-complete list of things that we usually have to do:

List / Summarize before and after states (resources that needs to be migrated, resources that have been migrated, total counts etc)
Handle both migration and rollback
Handle partial migration failure (some migrated and then a failure happens)
1. when you are tracking each resource individually, when one fails, you just need to restore that single one back
2. when you are tracking a group of resources, and you have a failure after some have been migrated, you need to roll those back
3. you should also provide instructions when this rollback fails and probably stop any future migration, as you may have a bug on your codebase
Filter targets to be migrated / rolled back (so you can do a small amount of them first, verify and perform other operations later)
Log and report progress
Allow execution both as foreground or as background (jobs)
Allow batching up migrations (so you can speed it up for bigger installations like ours)

There are also few categories of migration that we usually perform:

Moving files in the physical machine (can be on the same disk, or between multiple NFS attachable disks mounted on the filesystem - code on both cases are the same)
Moving files between multiple machines (like between shards)
Moving files from local machine to object storage
Moving files from object storage back to local machines

And with the help of production I also found out that its really useful to have monitoring stuff to follow and specific logs, so people can have visibility on whats going on.

Minimal Viable Solution

The minimal viable solution here would be to try to create abstractions that represents each component on the migration, each migration strategy, and have something to defines a migration unit that can glue everything together.

With all those components we can probably generate rake tasks and sidekiq jobs asking for

SomeFileMigration.migrate
SomeFileMigration.rollback
SomeFileMigration.not_migrated.filter_by
SomeFileMigration.not_migrated.list
SomeFileMigration.not_migrated.summarize
SomeFileMigration.migrated.filter_by
SomeFileMigration.migrated.list
SomeFileMigration.migrated.summarize ...

The best workflow of how to implement this kind of stuff is probably trying to get one migration case and start refactoring it into and extracting into something that can be generalized.

First Iteration

Further work

This is not super easy to think about all the pieces upfront, and it will probably be easier/faster if we do it into multiple iterations, like as soon as something is generalized enough, commit the changes and merge it, move further. This makes it less easy to schedule/plan, but we can agree on perhaps limited-time based blocks of work, so we can do it like: let's have 2 blocks of iteration during this release.

As we progress this initial phase, it will be easier to plan ahead, as we will have a much better understanding of the problems and have more solid solutions on what needs to be done

cc @fzimmer @geo-team

Edited Mar 18, 2021 by Aakriti Gupta