The initial issue with using unsafe_serialization_hash: true in JsonCache in the License model is present again and needs to be removed again with a different approach.
In order to check if any new changed result in similar E2E failures at least the e2e:package-and-test pipeline job should be triggered manually (see this comment).
@jtapiab Here's the follow-up issue to give this another shot after restoring the original logic in the License model. I applied the same labels as https://gitlab.com/gitlab-org/gitlab/-/issues/354299. Please update them to something more appropriate. It's probably worth to check with your manager about the priority for this.
We should stop serializing encrypted attributes as it seems counter-productive if we encrypt on database, only to store it in the clear in Rails.cache (Redis)
Are there any concerns about caching the current license? No encrypted attributes are being cached AFAIK, but I'm still unsure about the security gaps we could potentially have if we keep the original implementation (if there is any). Thank you!
Is it just because it's not on master yet? I tried to gather the context around this issue but it spans multiple issues and MRs and I may have gotten lost along the way
There was a long history behind those MR. Sorry about that
Apologies if I'm misunderstanding here, but it seems like there is no unsafe_serialization_hash: true in JsonCache at the moment
That's correct. The unsafe_serialization_hash: true was removed in master. AFAIK, that option was created to avoid breaking Geo. This MR https://gitlab.com/gitlab-org/gitlab/-/issues/354299 removed Geo method from JsonCache, so we're not serializing sensible attributes anymore.
The question here is about the current license (latest, non-expired, and valid license), which is currently stored in JsonCache. Is it okay to keep serializing the license? There are no encrypted attributes, tokens, or other attributes being serialized. This is how the license is cached https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/app/models/license.rb#L53-L55
@jtapiab Thank you for driving the resolution of these issues forward!
I had a few iterations of how to answer the questions here, and eventually came to the following conclusion. Please correct me if I have something wrong.
It appears that License data is not sensitive enough for us to have decided to encrypt it in the database (which I find acceptable, by the way). And since it doesn't use attr_encrypted, it is unaffected by the addition of SensitiveSerializableHash, therefore unsafe_serialization_hash: true can remain removed from JsonCache. This explains why no License-related specs broke when SensitiveSerializableHash was added. There was no effect.
now I'm wondering if we needed to remove the License from JsonCache in the first place
So I agree with your thought here; we didn't have to modify License in the original issue. I apologize for basically not looking into it during issue breakdown, nor realizing this until now.
I appreciate your thoughts here @mkozono. Thank you for taking the time to review the discussion.
This explains why no License-related specs broke when SensitiveSerializableHash was added. There was no effect.
That's correct
So I agree with your thought here; we didn't have to modify License in the original issue. I apologize for basically not looking into it during issue breakdown, nor realizing this until now.
No worries at all. I noticed this too late, too
I suggest that the only thing remaining to do from the original issue is to remove the unsafe_serialization_hash option from SensitiveSerializableHash so that no one reintroduces the issue. I believe it's safe do this publicly.
Exactly, I think that your insights here are aligned with what I thought too. I will create a follow-up issue to remove the unsafe_serialization_hash option from SensitiveSerializableHash.
I think it's okay to close this issue since there is too much risk in trying to remove the License from JsonCache, and it seems there is no need to do so. I will expect AppSec confirmation so we can close this out.
Support team has been encountering customers experiencing strange licensing behavior that appears related to the code changes from gitlab-org/security/gitlab#735 and gitlab-org/gitlab#354299. Similar to the problems covered in #375886 (closed), customers uploading a new license may see their system behave as if no license was installed, or continue to use the feature set / user counts of the previous license
this license caching bug made it into the security releases for 15.4.1, 15.3.4 and 15.2.5
potential workaround
It appears a workaround can be achieved with License.reset_current from rails console, and/or the rake task gitlab-rake cache:clear:redis clearing redis cache is insufficient. the caching failure occurs in rails memory, therefore a puma restart is required, although a full GitLab restart is most ideal
@jtapiab Considering the impact this has on customers effectively losing access to features and Support getting an increased number of tickets requiring urgent assistance, can the Priority and Severity of this issue be upgraded?
This security issue is to find a better strategy for handling the caching of License.current, not intended to fix the licensing behavior problem, which was fixed by this MR !100112 (merged). That MR reverted the problem that customers are facing. However, it didn't apply specifically to versions 15.4.1, 15.3.4 and 15.2.5.
We are looking to backport !100112 (merged) to the 15.4 stable branch with this MR: !101434 (merged) (which is currently in review) and backport that bugfix to the older versions 15.3.4 and 15.2.5 too.
A quick update about this. The 15.4 stable branch is patched with the license bugfix. I have created a request to backport older versions (15.3 and 15.2) which needs to be reviewed by the release managers and product https://gitlab.com/gitlab-org/release/tasks/-/issues/4496
I've been encountering customers semi-regularly at this point hitting this bug -- I've had at least 3 customers this week alone (2022-11-14). I think we should have communicated this a bit louder to users. It's kind of a system breaking bug that was introduced into multiple releases, and has generated a fair amount of support tickets, not to mention was the source of an incident on dotcom
There is only one uncovered version at this point: 15.2.5, which is outside maintenance policy. In that case, we will document the workaround of resetting the cache here: https://docs.gitlab.com/ee/update/ and suggest an upgrade to 15.3.4 version, which is the next stable version that includes the patch.
Thanks for the update @jtapiab. I'll create an upgrade post for 15.2.5 to let customers know of the impact. Do we have a root cause of the problem? Do customer need to be applying new licenses to trigger this issue?
@ofernandez2 I'm hoping you can help here, even though this issue is now closed. We need to be able to notify all of our customers that they need to upgrade to a release that contains the fix to this problem or else they will continue encountering it and having to restart GitLab. Can you help? Who else do we need to involve? What's the right forum?
This is urgent and important, as we continue to have customers reporting the problem. And today one customer opened an emergency for it.
We need to be able to notify all of our customers that they need to upgrade to a release that contains the fix to this problem or else they will continue encountering it and having to restart GitLab.
@mdunninger does support have precedent and/or a documented process to doing this? I imagine that this is something we've done in the past.
And, to clarify, we would only want to notify customers that we know are on an impacted release, correct? I don't know if we reliably have that information, so that may be a challenge.
cc: @supadhyaya@justinfarris in case you know whether we've done customer comms around something like this in the past.
The Handbook documents three options for sending notices to customers, depending on the number of customers involved.
It would be best to send notices only to customers on an impacted release, but as you noted, we may not be able to do that. @jcolyer, reasonable options here? If none, we may have to consider sending a notice to all SM customers.
Omar, are the impacted releases still available for download? If so, should we perhaps consider pulling them?
I suspect a blog post (or the like) may be the best option if the issue is across multiple versions. That doesn't mean we can't also notify known affected users as well.
If we rely one targeted comms only, future users that upgrade into affected versions will not have a chance to consume that information.
I believe (thank you, @kevenhughes) that the affected versions are 15.2.5, 15.3.4, 15.4.1 and 15.4.2, with fixes delivered in 15.3.5 and 15.4.3.
And it is precisely out of concern for customers who might upgrade to an affected release in the future that I was asking about pulling those releases.
@ofernandez2 unfortunately we don't have a precise way to notify all customers that may be on a specific version (as you said). Many customers don't share data to know we've informed 100% of them. Although... given they are 15.x versions we may be a lot closer than ever before with Cloud License data
@supadhyaya highlighted what's been used in the past. I'd also loop in @spatching@dsakamoto / the CS team too. They might have a more targeted approach we can consider.
If this is time sensitive I recommend just doing that. We should also post the patch release in the releases blog.
If not we could look into how much CL coverage we have for SM accounts running those versions and try targeting them.
@justinfarris while I don't know the full context of this issue, I assume it urgent/time-sensitive and we want to notify customers based on a self-managed version (?). We can target specific releases with usage data for which we have ~70% of our ARR. @jdbeaumont@spatching
@dsakamoto we're closer to 50% of self-managed ARR, but point taken. Either way, we do have the version for 1/2 of our self-managed customers so we can at least help and alert them. I would reckon this is a great reason to highlight this is a core purpose for sharing Operational Metrics, too