Geo: preserve in-flight CI builds during primary-secodary failover or DR
In DR situations we have two possibilities:
-
If you use Geo with geographical distant locations: You can preserve the different hostnames like:
us.gitlab.example.com
,jp.gitlab.example.com
, and just promote one of the secondaries as your new primary temporarily (which you also intend to rollback in the future, when the installations are geographically based). -
If you use Geo with nearby, distinct, availability zones: You may want to promote your
backup.gitlab.example.com
as you new primary and switch permanently the hostname togitlab.example.com
making it your new primary.
There is also the use-case you want to use Geo to migrate from one infrastructure to another (like from the cloud to in-loco, or from cloud providers etc).
With our GCP migration, because we don't have the read-only mode ready yet, we are considering firewalling the whole primary instance to prevent changes, in order for the secondary to catch-up.
With the proposed read-only mode we still don't have a solution to prevent losing CI builds and artifact that start before switching to read-only.
One think we can do is after we switch to read-only mode, we stop sending new builds to CI workers (we can only control our own pool, but as people can still plug whatever worker they please, this as to be done on the API side.
To not lose in-flight builds, we need to coordinate with the CI worker. We can reply with something like the 503 Service Unavailable
, which will make the CI worker buffer the traces and any update it would send to GitLab, and keep retrying using an exponential backoff strategy.
For the situations where we demoted the primary as well, the API should start sending 307 Temporary Redirect
, so the CI worker can send updates to the new hostname, and they don't have to be re-configured to the new endpoint.
Because it's "Temporary", if the new endpoint goes dark, it should try with the original again.
The CI worker may also add a configuration to keep retrying (when receiving 503) only for X amount of time, so it doesn't wait forever.
WDYT @stanhu @nick.thomas @dbalexandre @ash.mckenzie @toon @digitalmoksha @ayufan