Skip to content

Version Check - Handle errors with API response

Zack Cuddy requested to merge version-check-reactive-cache-no-nil-value into master

What is the issue?

This change fixes a bug first found on this thread: !103248 (comment 1174893288)
Shoutout to @rchanila for standing up for this one

ReactiveCache is a mechanism we use to store data on on the instance to save on expensive queries or external api requests. This cache typically has a set lifetime and logic to when to invalidate it early if needed.

The issue was discovered due to an inconsistency with how ReactiveCache manages nil data and what nil means

When ReactiveCache is initialized a nil value is placed in the cache as a placeholder. Asynchronously the cache value is then calculated and filled in when it resolves. Future reads then will happen synchronously and if the cache value is nil the reading method will not yield the value: https://gitlab.com/gitlab-org/gitlab/-/blob/master/app/models/concerns/reactive_caching.rb#L55. By not yielding a nil value then lets the synchronous read know that the data is not done being fetched asynchronously.

However, we do not block nil values being actually written to the cache: https://gitlab.com/gitlab-org/gitlab/-/blob/master/app/models/concerns/reactive_caching.rb#L98. This causes an issue where if something bad happens we get stuck in a situation where the synchronous reads aren't being resolved even though the asynchronous write is done. Essentially locking us out from invalidating the cache when it is nil and instead being stuck with a bad cached value.

What does this MR do and why?

There is a lot of abstraction and asynchronous logic involved in ReactiveCache. The implementation of Version Check with ReactiveCache is a bit of an edge case as it wraps an external API endpoint rather than an expensive DB query. Additionally version check has an incredibly long cache life.

Rather than refactor ReactiveCache for this edge case we can utilize a non-nil value in the cache to signify bad data. In this MR we chose an error hash { error: 'there is an error' }. From there when a synchronous read happens we know that if the error hash comes back then we hit an unexpected scenario and we should re-kick off the cache.

Version Check

Version Check is a very interesting use case for ReactiveCache as the data only gets invalidated/changed when:

  1. The instance is upgraded to a new GitLab Version
  2. GitLab releases a new version

Because of this the cache is refreshed only every 12 hours or if it never gets read it will completely reset after 7 days. Storing nil values are incredibly damaging as there is no way to get them out programmatically currently. A nil value can get written if the external api fails or the network request fails.

Screenshots or screen recordings

N/A

How to set up and validate locally

  1. Checkout this branch
  2. Apply Patch 1 (dropdown on next line):
Patch 1
diff --git a/lib/version_check.rb b/lib/version_check.rb
index eddcddbeb49..0b5e0eddbc0 100644
--- a/lib/version_check.rb
+++ b/lib/version_check.rb
@@ -61,14 +61,14 @@ def self.from_cache(*)
   end
 
   def id
-    Gitlab::VERSION
+    'codereview2'
   end
 
   def calculate_reactive_cache(*)
     response = Gitlab::HTTP.try_get(self.class.url, headers: self.class.headers)
 
     case response&.code
-    when 200
+    when 999
       Gitlab::Json.parse(response.body)
     else
       { error: 'version check failed', status: response&.code }
  1. Hit the /admin/version_check.json endpoint
  2. Ensure it returns null
  3. Wait a minute or so (this will allow sidekiq to run the async job to fetch the data)
  4. Revert Patch 1 and apply Patch 2 (dropdown on next line)
Patch 2
diff --git a/lib/version_check.rb b/lib/version_check.rb
index eddcddbeb49..3a1270666f9 100644
--- a/lib/version_check.rb
+++ b/lib/version_check.rb
@@ -61,7 +61,7 @@ def self.from_cache(*)
   end
 
   def id
-    Gitlab::VERSION
+    'codereview2'
   end
 
   def calculate_reactive_cache(*)
  1. Hit the /admin/version_check.json endpoint
  2. Ensure it returns null
  3. Wait a minute or so (this will allow sidekiq to run the async job to fetch the data)
  4. Ensure it returns an object of version check data

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Zack Cuddy

Merge request reports