Performance issues have been identified when using GitLab with NFS in recent versions of GitLab. We have back ported patches which add back Rugged code to improve performance. However, these patches are behind a feature flag because in certain configurations this would be a breaking change.
Come up with Gitaly code that safely creates a metafile on startup.
Come up with a format for the metadata file (JSON? with just one field, "gitaly-filesystem-id")
Come up with a way to ask a Gitaly server what the "filesystem ID of a storage" is. That means either creating a new RPC, or adding fields to an existing RPC. I think we might be able to squeeze this into some RPC that returns general information about the Gitaly server.
Write code in gitlab-rails that reads the metadata file directly and compares it with that RPC
Manually create such a metadata file and play around seeing if the detection works as expected
An alternative approach might be detecting the configuration in Omnibus? /cc @marin@sytses
James Ramsay (ex-GitLab)changed title from Automaticall enable Rugged feature flags if NFS is detected to Automatically enable Rugged feature flags if NFS is detected
changed title from Automaticall enable Rugged feature flags if NFS is detected to Automatically enable Rugged feature flags if NFS is detected
I was just going to say, doing this in Omnibus (or even within Rails) might be better because Gitaly servers don't have anyway to toggle feature flags.
@stanhu feel free to edit the description and propose what you think would be a better approach since you'll know the technical considerations better then me.
Hi all, jumping in while @jramsay is on vacation. I have a few questions I'm hoping to get answered by the team:
How reliably can we detect the criteria for automatic enabling of Rugged? Are we certain this can be foolproof? What reference architectures do we have lying around that we could test the automatic enable on? Do we feel it is comprehensive enough of the architectures our customers are running?
What's the failure mode if we misdetect and the upgrade gets applied against an architecture it shouldn't? Is this easy to recover from?
How can we drive to an answer of where this should live? FF vs. Omnibus vs. Rails?
One more from me:
What do we mean by enabling it from "Omnibus" is this when the user installs or upgrades the GitLab Instance? We also need to take into account that NFS can be configured or added after upgrade/installation. Unless Omnibus is continuously detecting NFS usage it would be safer to implement it in Rails.
How reliably can we detect the criteria for automatic enabling of Rugged? Are we certain this can be foolproof? What reference architectures do we have lying around that we could test the automatic enable on? Do we feel it is comprehensive enough of the architectures our customers are running?
It depends on how much we're willing to do. For example, I think if we really wanted to be 100% sure that enabling Rugged will work, we'd need to:
First, check that the given repository path is an NFS mount (e.g. using statfs()).
If it is, iterate through all projects in the database.
For each project that exists via Gitaly, generate the local path and use Rugged to verify that the repository exists.
Do some basic verification (e.g. branch and tags match, etc.).
If we verify N projects are there and match, then we have high confidence that this will work.
If we just do step 1 alone, there's a chance that a stale mount on the application node is sending out a false signal.
What's the failure mode if we misdetect and the upgrade gets applied against an architecture it shouldn't? Is this easy to recover from?
In the best case, all repositories would appear empty, and the admin would immediately know they have to disable the flags again to get things working.
In the worst case, we could introduce a split brain, where projects get created locally on the application server and deviate from the actual data store.
How can we drive to an answer of where this should live? FF vs. Omnibus vs. Rails?
We might be able to do it solely on the Rails side since we have all the data there that we need. However, when would we do this check (e.g. every time Unicorn starts?). Omnibus might be a more intuitive place to check (e.g. after every gitlab-ctl reconfigure), but then Omnibus would then need to somehow toggle the feature flags.
I tend to think that given the risk of the split brain, we might want just to put a friendly banner somewhere to tell the admin to check their configuration rather than try to automatically enable something that could cause more problems than it solves.
Maybe we shouldn't be talking about NFS at all. Gitlab-rails "thinks" the repositories are at path X and Gitaly server Y. (These X and Y are already / still in config/gitlab.yml.) If we can establish that gitlab-rails can access the repositories on Gitaly server Y behind Gitaly's back, using path X, then it is safe to use the Rugged patches.
I don't think we currently have a good way to establish this property. But we could make this possible, for instance by having Gitaly put a special metadata file with a UUID in the root of each of its storages. That UUID would then be exposed via a (new?) Gitaly RPC. Gitlab-rails could then look for that metadata file in two ways, both via the Gitaly RPC and via direct disk access. If it finds the same file then it would be safe to use Rugged.
If gitlab-rails boots before the NFS mount is available it will wrongly assume it cannot use Rugged, but at least this is safe: it will fall back to only using Gitaly. So I think it's reasonable to do this check only on gitlab-rails startup.
@jacobvosmaer-gitlab that seems like a great improvement to me, validating in the rails codebase that we can reach said files in question. To me, that seems like it would eliminate a lot of the potential risk here based on a check for a specific type of environment/file mount.
Come up with a format for the metadata file (JSON? with just one field, "gitaly-filesystem-id")
Come up with a way to ask a Gitaly server what the "filesystem ID of a storage" is. That means either creating a new RPC, or adding fields to an existing RPC. I think we might be able to squeeze this into some RPC that returns general information about the Gitaly server.
Write code in gitlab-rails that reads the metadata file directly and compares it with that RPC
Manually create such a metadata file and play around seeing if the detection works as expected
The only thing that that leaves out is writing code in Gitaly that safely creates such a metadata file on startup. You might as well add that the first time around, because you have to make Gitaly changes anyway.
Thanks @jacobvosmaer-gitlab - I'm going to update the description with this comment. It seems like our next step is implement the proposal and then test that it works. This is marked as %12.0, should it stay there or should we aim for something sooner?
Yes that's a hard question. I think we need to have an extra check at the same point where we now check the feature flag. The repository that we're about to call the RPC on would be an input to that check, e.g. can_use_rugged?(repository). This check should then do a lookup somewhere to see if repository.storage can be reached with rugged. The result of that check should be cached for the duration of the process.
Yes that's a hard question. I think we need to have an extra check at the same point where we now check the feature flag. The repository that we're about to call the RPC on would be an input to that check, e.g. can_use_rugged?(repository). This check should then do a lookup somewhere to see if repository.storage can be reached with rugged. The result of that check should be cached for the duration of the process.
That makes sense. I think this might be done in an initializer (e.g. config/initializers), and we should also check that the feature flag is NOT set. If it's explicitly disabled or enabled, we should go with that value.
yes I think we can backport this since it's still protected behind feature flags. My thinking is that once we have this change in, we can change the rugged feature flags to be default on.
@stanhu@johncai what do you think about back porting this change to 11.11 so that customers that take their time before upgrading to 12.X benefit from this?
@jramsay I think that makes sense, although I'd like to get some feedback in 12.1 to see whether we created some issues with this change. I think we'll want to include https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/30871 as well. That's not in 12.1 at the moment, but I think we should get an exception for it in a patch release.