API on GitLab.com requests should contain a header or other load-balancer-routable indicator of the origin application stage
GitLab.com relies on a canary stage as a QA mechanism for upcoming releases.
Formally, a stage
is a partition of the application running a single version of the application, but sharing state (redis, cache, postgres, etc) with other stages in the same environment. Currently our production environment has a cny
stage and a main
stage. Since each stage can be in a transitionary period, for up to a hour, during deployments as new versions of the application rollout, with two stages, its possible for up to 4 different releases to be running concurrently in a single environment, sharing the same state (eg database, redis, etc).
There are several ways that requests get routed to the canary stage:
- Cookies: The
gitlab_canary
cookie. This cookie can have three states:gitlab_canary=true
,gitlab_canary=false
and cookie not set. - Path: The path of the request. Certain projects (including
gitlab-org/*
,gitlab-com/*
and some others) will be automatically routed to canary - Environment Status: when canary is set to drain, no request are sent to canary.
gitlab_canary missing |
gitlab_canary=true |
gitlab_canary=false |
|
---|---|---|---|
is canary path |
|
||
is not canary path |
|
This proposal deals with the mismatch when gitlab_canary
is missing, between a page, which is canary a path (
Edge Cases
This proposal deals with the situation in which the following conditions are met:
- The
gitlab_canary
cookie should be missing (or corrupt ie, nottrue
orfalse
, or otherwise not readable in HAProxy, eg too long) - The user is viewing a canary path page (eg
/gitlab-org/gitlab/*
) - The user attempts to perform an operation from the page that uses AJAX to the an
/api
endpoint, not prefixed by same path as the the page itself. Examples of these pages include/api/graphql
, MR approval, pipeline execution, and more.
In #219478 (closed), a breaking authentication change occurred that affected AJAX calls. AJAX requests from canary to canary were fine, but requests from canary to main stage failed with a 401.
As mentioned above, considering that up to 4 four versions of the application could be running concurrently in a single environment, we should do our best to ensure that no change is breaking, or at least supports backwards compatibility for a transition period.
However, as a second layer of protection, we should also endeavour to ensure that requests from a particular version of the Javascript client get routed through to the same version of the backend application.
As present, this will not occur at any time when the 3 edge case conditions described above are met. These conditions may seem unlikely, but as the incident on Friday showed, will still lead to considerable disruption.
Proposal
All AJAX calls made from the canary Javascript application should be routed to the canary stage backend application, on a best-effort basis.
Why best-effort? Because, if we drain the canary environment, we cannot route canary AJAX requests from currently running browsers to the canary stage. In this case, routing the requests to the main stage is our only option.
Implementation
When the Javascript application boots as the canary release, it should use a middleware to inject a header or query parameter into each AJAX request.
At the haproxy level, this could be used to route the AJAX request to the canary stage when appropriate.
For example, when running canary, an X-GitLab-Stage: cny
header could be injected into the AJAX. This could be used in HAproxy to route the request to the API.
Axios Example
Some API calls use Axios for AJAX. We could include a middleware in axios_utils.js
.
CORS
We would need to ensure that the X-GitLab-Stage
is CORS compliant.
Why not use a cookie?
An alternative suggestion was to set a cookie when accessing a canary path endpoint, from HAProxy.
The problem with this approach is that it is persistent, so once a user has accessed a canary path once, they will inadvertently "stick" to canary. This could drive much more traffic than expected to canary. Canary is only intended to serve a fraction of all traffic, and this should be kept small.