Automatically enable Rugged feature flags if NFS is detected

changed milestone to %12.0

added backstage [DEPRECATED] bugperformance + 1 deleted label

@stanhu @jacobvosmaer-gitlab @zj I'm not sure if this is safe or the best place to enable this?

An alternative approach might be detecting the configuration in Omnibus? /cc @marin @sytses

changed title from Automaticall enable Rugged feature flags if NFS is detected to Automatically enable Rugged feature flags if NFS is detected

I was just going to say, doing this in Omnibus (or even within Rails) might be better because Gitaly servers don't have anyway to toggle feature flags.

@stanhu feel free to edit the description and propose what you think would be a better approach since you'll know the technical considerations better then me.

mentioned in epic &1190 (closed)

added [deprecated] Accepting merge requests label

cc @ebrinkman

Hi all, jumping in while @jramsay is on vacation. I have a few questions I'm hoping to get answered by the team:

How reliably can we detect the criteria for automatic enabling of Rugged? Are we certain this can be foolproof? What reference architectures do we have lying around that we could test the automatic enable on? Do we feel it is comprehensive enough of the architectures our customers are running?
What's the failure mode if we misdetect and the upgrade gets applied against an architecture it shouldn't? Is this easy to recover from?
How can we drive to an answer of where this should live? FF vs. Omnibus vs. Rails?

cc @markpundsack @meks

@ebrinkman great questions.

One more from me: What do we mean by enabling it from "Omnibus" is this when the user installs or upgrades the GitLab Instance? We also need to take into account that NFS can be configured or added after upgrade/installation. Unless Omnibus is continuously detecting NFS usage it would be safer to implement it in Rails.

cc @at.ramya per gitaly performance.

How reliably can we detect the criteria for automatic enabling of Rugged? Are we certain this can be foolproof? What reference architectures do we have lying around that we could test the automatic enable on? Do we feel it is comprehensive enough of the architectures our customers are running?

It depends on how much we're willing to do. For example, I think if we really wanted to be 100% sure that enabling Rugged will work, we'd need to:

First, check that the given repository path is an NFS mount (e.g. using statfs()).
If it is, iterate through all projects in the database.
For each project that exists via Gitaly, generate the local path and use Rugged to verify that the repository exists.
Do some basic verification (e.g. branch and tags match, etc.).
If we verify N projects are there and match, then we have high confidence that this will work.

If we just do step 1 alone, there's a chance that a stale mount on the application node is sending out a false signal.

What's the failure mode if we misdetect and the upgrade gets applied against an architecture it shouldn't? Is this easy to recover from?

In the best case, all repositories would appear empty, and the admin would immediately know they have to disable the flags again to get things working.

In the worst case, we could introduce a split brain, where projects get created locally on the application server and deviate from the actual data store.

How can we drive to an answer of where this should live? FF vs. Omnibus vs. Rails?

We might be able to do it solely on the Rails side since we have all the data there that we need. However, when would we do this check (e.g. every time Unicorn starts?). Omnibus might be a more intuitive place to check (e.g. after every gitlab-ctl reconfigure), but then Omnibus would then need to somehow toggle the feature flags.

I tend to think that given the risk of the split brain, we might want just to put a friendly banner somewhere to tell the admin to check their configuration rather than try to automatically enable something that could cause more problems than it solves.

Maybe we shouldn't be talking about NFS at all. Gitlab-rails "thinks" the repositories are at path X and Gitaly server Y. (These X and Y are already / still in config/gitlab.yml.) If we can establish that gitlab-rails can access the repositories on Gitaly server Y behind Gitaly's back, using path X, then it is safe to use the Rugged patches.

I don't think we currently have a good way to establish this property. But we could make this possible, for instance by having Gitaly put a special metadata file with a UUID in the root of each of its storages. That UUID would then be exposed via a (new?) Gitaly RPC. Gitlab-rails could then look for that metadata file in two ways, both via the Gitaly RPC and via direct disk access. If it finds the same file then it would be safe to use Rugged.

If gitlab-rails boots before the NFS mount is available it will wrongly assume it cannot use Rugged, but at least this is safe: it will fall back to only using Gitaly. So I think it's reasonable to do this check only on gitlab-rails startup.

@jacobvosmaer-gitlab that seems like a great improvement to me, validating in the rails codebase that we can reach said files in question. To me, that seems like it would eliminate a lot of the potential risk here based on a check for a specific type of environment/file mount.

@jacobvosmaer-gitlab This seems promising. How quickly could we test your proposal to understand its validity?

changed the description

I think to validate the idea you need to:

Come up with a format for the metadata file (JSON? with just one field, "gitaly-filesystem-id")
Come up with a way to ask a Gitaly server what the "filesystem ID of a storage" is. That means either creating a new RPC, or adding fields to an existing RPC. I think we might be able to squeeze this into some RPC that returns general information about the Gitaly server.
Write code in gitlab-rails that reads the metadata file directly and compares it with that RPC
Manually create such a metadata file and play around seeing if the detection works as expected

The only thing that that leaves out is writing code in Gitaly that safely creates such a metadata file on startup. You might as well add that the first time around, because you have to make Gitaly changes anyway.

Thanks @jacobvosmaer-gitlab - I'm going to update the description with this comment. It seems like our next step is implement the proposal and then test that it works. This is marked as %12.0, should it stay there or should we aim for something sooner?

changed the description

Automatically enable Rugged feature flags if NFS is detected

Problem to solve

Proposal

Designs

Child items 0

Activity