Feature Flag reactive_caching_limit_environment Enable

What

~~Remove the :reactive_cache_limit feature flag.~~

The former FF got replaced by reactive_caching_limit_environment. Clarifying:

When we enabled :reactive_cache_limit live, it unfolded in not only the Deploy Boards, but also other areas of the code as being affected by the cache limits. So, we decided to take a step back and, instead of enabling the FF for all the code-base, we'll introduce it separately for each ReactiveCaching usage. ~"group::configure" will do it firstly for Deploy Boards, then we'll document a suggestion on how other teams can take the same approach and enable it for their usage of ReactiveCaching as well. We think this is a safer approach, since it's modular and also because each team will have better domain expertise to implement it over their section of code which uses ReactiveCaching.

Therefore, I've opened an MR which will remove :reactive_cache_limit, but will also make the feature disabled by default. So after we removing it, it will still not activate caching limits.

Here's the aforementioned MR: !34202 (merged)

Updated Plan 2020-03-06

Enabled and observe on GitLab.com for a week. If everything goes well, we set the flag as enabled by default and merge it so it's released on %12.9 . If there's no complains from on-premise customers on %12.9 , we can consider deleting the FF.

Updated 22-10-2020

stg	dev	gprd
enabled globally	enabled globally	enabled globally

Owners

Team: ~"group::configure"
Most appropriate slack channel to reach out to: #s_configure
Best individual to reach out to: João Cunha (@Alexand)

Expectations

### What are we expecting to happen?

Exceptions: Environment cached data will have 10MB limits.

We expect that this limits won't be exceeded, since we focused on setting big enough limits to not affect the current usage of GitLab.

What might happen if this goes wrong?

Services affected by ReactiveCaching that go over their limits won't have their data cached and the BE will silently send a ReactiveCaching::ExceededReactiveCacheLimit to Sentry. From a users perspective, whatever they're trying to load will simply not be loading.

List of commands to try to understand why the limit is being reached

# Fetch the environment
prd_env = Environment.find 00000 # ADD THE ENV ID

project = Project.find 00000 # OR FIND THE PROJECT FIRST
environments = project.environments.with_state(:available)
prd_env = environments.select {|e| e.name =="production" }.first # RENAME production FOR THE ENV YOU WANT

# Load the reactive cache synchronously for the given environment
prd_env_dep_plat = prd_env.deployment_platform
reactive_cache = prd_env_dep_plat.calculate_reactive_cache_for(prd_env)

# Check that the size of the cache is really exceeding
data_deep_size = Gitlab::Utils::DeepSize.new(reactive_cache, max_size: Environment.reactive_cache_hard_limit)
data_deep_size.valid?
data_deep_size.size

# Check what's inside of pods, deployments and ingresses to understand what needs to be handled better, paginated or whatever.
reactive_cache[:pods]
reactive_cache[:deployments]
reactive_cache[:ingresses]

What can we monitor to detect problems with this?

Best way is to monitor ReactiveCaching::ExceededReactiveCacheLimit on Sentry. Although, I believe that for on-premise we cannot have this quicker feedback.
The amount of Sidekiq workers in the reactive_caching queue might increase, since the FE will keep trying to load the data that is never loaded for cases where the limit is reached.

Beta groups/projects

If applicable, any groups/projects that are happy to have this feature turned on early. Some organizations may wish to test big changes they are interested in with a small subset of users ahead of time for example.

gitlab-org/gitlab project

Roll Out Steps

Edited Nov 17, 2020 by João Alexandre Cunha