Incident Review: 2020-04-28: authenticated gitlab-pages unavailable for new sessions
Incident: No change issue was created as the change was made as part of the response to an incident where we were seeing an increase in errors on the pages fleet (production#1999 (closed)). This change was to mitigate those errors by changing the api endpoint.
Summary
- Service(s) affected : gitlab-pages
- Team attribution : n/a
- Minutes downtime or degradation : 4 hours, 48 minutes
TLDR: A config change was rolled out to the gitlab-pages
fleet, which was intended to make GitLab API requests go through an internal load balancer (as opposed to internet => cloudflare => external load balancer). Unfortunately, this config change also affected the URLs used for client-side authentication redirects. Those redirects then were sent to an internal address which did not resolve on the client side.
Detailed description
Background
Over the last few weeks, we've been seeing some issues with timeouts in gitlab-pages (production#1999 (closed)).
The investigation is still ongoing, but one of the findings was missing SYN/ACKs from gitlab.com when gitlab-pages is performing API requests. We are currently sending that API traffic through our public load balancer, and through cloudflare. One possible explanation (though unproven at this point) would be dropped SYNs from cloudflare.
In order to simplify the traffic flow and bypass potential failures, a proposal was made to route through an internal load balancer instead of going over the internet.
Existing traffic flow:
gitlab-pages => internet => cloudflare => external google lb => haproxy => web-api nodes
Proposed new traffic flow:
gitlab-pages => internal google lb => haproxy => web-api nodes
This was also discussed in gitlab-org/gitlab#215321 (closed), though the engineers rolling out the config change were not aware of that issue at the time (foreshadowing).
Configuration
In order to route through the internal load balancer, the config needs to be changed. gitlab-pages has a wide range of config settings for URLs:
-artifacts-server="": API URL to proxy artifact requests to, e.g.: 'https://gitlab.com/api/v4'
-auth-redirect-uri="": GitLab application redirect URI
-auth-server="": DEPRECATED, use gitlab-server instead. GitLab server, for example https://www.gitlab.com
-gitlab-server="": GitLab server, for example https://www.gitlab.com
In fact, the config behaviour is somewhat complex.
-
artifacts-server
is the API URL used for fetching CI job artifacts, includes/api/v4
path prefix, unclear if this is actively used at the moment. -
auth-redirect-uri
is an URL to redirect to after authenticating a user for pages with access control enabled. is currently set tohttps://projects.gitlab.io
. The authentication itself usesgitlab-server
. -
auth-server
is deprecated. -
gitlab-server
is the URL to a gitlab instance that is used for two things: API calls for domain resolution, browser redirects for authentication (unknown at time of incident). Furthermore, if this flag is not defined, it will inherit from the deprecatedauth-server
, and then fromartifacts-server
(also currently broken gitlab-org/gitlab-pages!270 (merged)).
In addition to this, those flags can be defined either inline as cli flags, or in a config file (via the namsral/flag package) -- the command-line options taking precedence.
It is quite challenging to understand what the current configuration is.
Config change rollout
In a first attempt to make this change, artifacts-server
was changed to an internal load balancer on 2020-04-24 (MR). However, this parameter is not used for the "domain resolution" API call -- which is the one being made frequently by the application.
Thus, in a second attempt, gitlab-server
was changed to also pointed to an internal load balancer (MR). The change was first deployed and verified on staging. Then it was deployed to a single production host. No issues were detected during this stage. Then it was deployed to the rest of the web-pages fleet.
The web-pages hosts served traffic. There was no increase in error rates.
Rollout procedure
It's worth noting that the method for rolling out this change was not standardized or documented in any runbooks.
One of the engineers (@jarv) explained ad-hoc that chef changes to this service are unsafe to auto-apply, due to the service having a relatively long init phase after process restarts. Thus, the recommended procedure is to stop chef on those nodes, and then manually run chef, two nodes at a time.
The ad-hoc method for doing this was:
knife ssh 'role:gprd-base-fe-web-pages' sudo systemctl stop chef-client
ssh web-pages-01-sv-gprd.c.gitlab-production.internal sudo chef-client
while true; do ssh web-pages-01-sv-gprd.c.gitlab-production.internal curl -s localhost:1080/-/readiness; sleep 30; done
ssh web-pages-02-sv-gprd.c.gitlab-production.internal sudo chef-client
while true; do ssh web-pages-02-sv-gprd.c.gitlab-production.internal curl -s localhost:1080/-/readiness; sleep 30; done
...
Not only is this error-prone, it also makes rollouts and rollbacks take a while -- since we need to wait for each of the 8 web-pages hosts to initialize the gitlab-pages process in sequence.
Slow init phase
The reason for the slow init of the process is due to how the per-project domain configuration is managed.
The pages application (running on web-pages-[0-8]-sv-gprd
) has an nfs mount from pages-01-stor-gprd
. This shared mount contains several TB of data and hundreds of thousands of directories.
On process start, gitlab-pages
scans through all of those directories looking for config.json
files which are then loaded into memory. With the added overhead of NFS, this takes a long time, as there is network latency for every getdents syscall.
This method of domain configuration is currently being replaced with on-demand API calls to gitlab's API. Once that replacement is complete, process init should be instant. See gitlab-org/gitlab-pages#379 (closed).
That should make the service safe for automated chef changes again.
Escalation
However, we did get customer reports regarding pages not loading. This report was escalated to the #gitlab-pages
channel in slack. Unfortunately it took some time before it was escalated to SRE. A faster path would have been to ping @sre-oncall
in #production
.
The rollback
The engineer who rolled out the original change (@igorwwwwwwwwwwwwwwwwwwww) performed a handover to the SRE on-call (@nnelson).
This was done in a zoom call, mainly sharing the method for incremental chef changes.
The on-call then performed the rollback. This again took over an hour due to the slow init of the process.
Metrics
- Number of domains affected: 5433 (source)
Customer Impact
- Who was impacted by this incident? external customers
- What was the customer experience during the incident? unauthenticated users trying to access a page protected by access control received a redirect to a non-resolving URL
- How many customers were affected? unknown
- If a precise customer impact number is unknown, what is the estimated potential impact? 5433 distinct domains received broken redirects during this time frame. number of users accessing those domains is unknown.
The nature of the outage makes it very hard to get accurate data. This is also the reason why it was so difficult to detect in the first place.
The gitlab-pages service has an access control feature (docs). This can be enabled on a per-project basis.
Pages protected by access control will make users go through an oauth authorization cycle.
Users who are already authenticated have a session cookie and do not need to go through this cycle. If they arrive for the first time, or the session expires, they will enter the authentication workflow.
This works by redirecting the user's browser to ${gitlab-server}/oauth/authorize?redirect_uri=${auth-redirect-uri}
. Before the config change, gitlab-server
was set to https://gitlab.com
. After the config change, it became https://int.gprd.gitlab.net
.
int.gprd.gitlab.net
resolves to the IP of the internal load balancer -- it is not accessible from outside the prod network. Thus, when users get that redirect, what they see is:
However, because this error is purely on the client side, we have no way to track it. There is no log or metric that will tell us about this error.
One possible signal would be a change in traffic patterns, though that would be extremely noisy.
The conditions that were required to get this error:
- Project is using access control
- No existing session via session cookie
This is only a subset of projects, and a subset of users.
We can count total redirects, but there is no guarantee that those are legit.
Incident Response Analysis
- How was the event detected? customer report
- How could detection time be improved? blackbox testing of authorization workflow
- How did we reach the point where we knew how to mitigate the impact? customer report escalated to SRE
- How could time to mitigation be improved? escalation path to SRE directly, faster deployment of gitlab-pages
Blackbox testing
This was suggested by @jarv. If we have some sort of synthetic blackbox test that runs through the authentication, issues with access controlled projects could be detected more easily.
Escalation path
Production issues (especially when affecting large customers) should be escalated directly to SRE, possibly via @sre-oncall
in the #production
channel.
gitlab-pages
Deployment of The slow init phase also slowed down the rollback. By retiring disk-based configuration, we can make the deployment process faster and safer (gitlab-org/gitlab-pages#379 (closed)).
Communication between SRE and Dev
The gitlab-pages
developers were aware of the issue with this config parameter (gitlab-org/gitlab#215321 (closed)). Working more closely with devs on this investigation and config change rollout could have signalled the issue earlier.
Post Incident Analysis
- How was the root cause diagnosed?
- How could time to diagnosis be improved?
- Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident?
- Was this incident triggered by a change (deployment of code or change to infrastructure. if yes, have you linked the issue which represents the change?)?
Most of this is covered above.
Timeline
All times 2020-04-28 UTC.
-
16:56
: rollout of config change begins (MR) -
17:07
: first host gets config change -
17:54
: rollout complete -
18:08
: customer report of outage -
18:45
: report escalated to SRE via slack -
19:48
: SRE acknowledges the issue -
19:51
: rollback procedure discussed via zoom -
20:12
: rollback begins (MR) -
21:55
: rollback complete
5 Whys
Skipping this for now, as the analysis above is quite comprehensive and covers several layers of contributing factors.
Lessons Learned
- Browser-facing configuration errors are really hard to detect!
Corrective Actions
- Blackbox test for
gitlab-pages
auth flow - Document
gitlab-pages
config rollout procedure - Retire disk-based domain configuration (gitlab-org/gitlab-pages#379 (closed))
- Introduce separate config flag for API URL (gitlab-org/gitlab#215321 (closed))