Currently we dynamically load jemalloc via the LD_PRELOAD environment variable, but this causes seg faults on systems with certain applications running (e.g. anti-virus software like Cylance: #3813 (closed), #4330 (closed)).
I suspect we'll have better success if the Ruby interpreter is actually compiled using the --with-jemalloc command-line. Note that Redis already does this by default.
Since the issue was unassigned and set for current milestone, I have opened !3745 (closed) as a start. However, I will still be trying to get my other MRs in first, before focusing more on this one (a.k.a if someone wants to grab that MR and test it out, feel free to).
@stanhu If we do this in the Ruby interpreter itself, we can get rid of the LD_PRELOAD injection right? Also, this means jemalloc becomes non-negotiable (for anything that uses Ruby), and gitlab_rails['enable_jemalloc'] becomes useless. I assume that warrants a deprecation process.
That makes it tricky - once we do this during the build process, jemalloc becomes mandatory and the setting becomes useless. This is against our deprecation policy for configurations. However, if we are following deprecation policy, we can't use jemalloc at build time until 13.0, which feels too long a wait.
@twk3@ibaum Right now, I am leaning towards breaking policy and making jemalloc mandatory. I will give it more thought, but wanted to loop you in, in case you had concerns.
@balasankarc I know some users disabled jemalloc with that setting when Cylance and other anti-virus scanners caused GitLab to seg fault (see the issues linked in this issue description). Cylance said they were going to investigate this issue, but we never heard back. We should ask Cylance for a trial to verify that compiling jemalloc with the Ruby interpreter makes the problem go away.
@dstanley I think you were on the support call before. Maybe it's time to engage with them now?
@balasankarc we need to confirm this resolves the problem. Then I think moving forward with it is fine, as the option to disable was really our previous fix. But we need to know it removes the need for the disable.
@dstanley In #3813 (closed), I saw you communicated with Cylance. Can we revive that conversation? If they were able to reproduce the issue with LD_PRELOAD, we can give them a package from this MR and confirm if it solved the issue. Or like @stanhu said, we should ask for a trial to test it ourselves.
I looked back through the old case and the call we had with Cylance, and I have email addresses of some people at Cylance, but their processes really did not support us reaching out to them directly. Our mutual customer had to arrange the call. I think it would be more straightforward for us to ask for a trial. Then if there's an issue we could go to them directly with that relationship established.
However, I recommend trying this out in a separate machine (not their production instance). This is because the packages from the above pipelines are not "release packages", which means the version they report will not be following semver. It will be something like 12.5.0+rfbranch.blahblah-ee.0 (we use information like "latest git tag in the repo", "commit sha", "pipeline id" etc. to generate this version info). This can cause issues when they try to upgrade to next 12.4.x version (if/when we release it) because it will look like a "downgrade".
So, can you ask them if we provide them with a test package whether they can try it out in a separate machine with Cylance?
The CylancePROTECT agent doesn’t lend itself very well to one off testing due the need to communicate to a cloud resource to receive policy instructions. It would be easier to work with someone that is already configured with Cylance and GitLab running on a machine as all the necessary preliminary configurations are already done. Based on information reported to us, build problems have been resolved by creating memory protection exclusions in Cylance policy.
We currently are writing up a KB article to document the known processes that should be excluded to avoid any conflicts with GitLab. Those processes are:
/opt/gitlab/embedded/bin/ruby
I am happy to test Gitlab CE latest version on ubuntu with cylance. My admins put cylance on the server and now I can't run the omnibus update anymore due to segmenation fault.
Just found this issue. We also found that gitlab-exporter was not running with jemalloc in Omnibus (it does in Charts, because there we use a Ruby that was compiled with --with-jemalloc).
We're in the process of changing the Omnibus defaults for gitlab-exporter to set LD_LOADPATH in !4922 (merged). Should we play this issue instead? I'm a bit concerned about scope creep, with this requiring a settings deprecation and also affecting the main Rails app. cc @ayufan
Maybe something for the next 2-3 milestones @fzimmer@craig-gomes? I feel like this would fit better into ~"group::memory"'s backlog than groupdistribution.