Incident Review: 2020-04-28: authenticated gitlab-pages unavailable for new sessions

Incident: No change issue was created as the change was made as part of the response to an incident where we were seeing an increase in errors on the pages fleet (production#1999 (closed)). This change was to mitigate those errors by changing the api endpoint.

Summary

Service(s) affected : gitlab-pages
Team attribution : n/a
Minutes downtime or degradation : 4 hours, 48 minutes

TLDR: A config change was rolled out to the gitlab-pages fleet, which was intended to make GitLab API requests go through an internal load balancer (as opposed to internet => cloudflare => external load balancer). Unfortunately, this config change also affected the URLs used for client-side authentication redirects. Those redirects then were sent to an internal address which did not resolve on the client side.

Detailed description

Background

Over the last few weeks, we've been seeing some issues with timeouts in gitlab-pages (production#1999 (closed)).

The investigation is still ongoing, but one of the findings was missing SYN/ACKs from gitlab.com when gitlab-pages is performing API requests. We are currently sending that API traffic through our public load balancer, and through cloudflare. One possible explanation (though unproven at this point) would be dropped SYNs from cloudflare.

In order to simplify the traffic flow and bypass potential failures, a proposal was made to route through an internal load balancer instead of going over the internet.

Existing traffic flow:

gitlab-pages => internet => cloudflare => external google lb => haproxy => web-api nodes

Proposed new traffic flow:

gitlab-pages => internal google lb => haproxy => web-api nodes

This was also discussed in gitlab-org/gitlab#215321 (closed), though the engineers rolling out the config change were not aware of that issue at the time (foreshadowing).

Configuration

In order to route through the internal load balancer, the config needs to be changed. gitlab-pages has a wide range of config settings for URLs:

  -artifacts-server="": API URL to proxy artifact requests to, e.g.: 'https://gitlab.com/api/v4'
  -auth-redirect-uri="": GitLab application redirect URI
  -auth-server="": DEPRECATED, use gitlab-server instead. GitLab server, for example https://www.gitlab.com
  -gitlab-server="": GitLab server, for example https://www.gitlab.com

In fact, the config behaviour is somewhat complex.

artifacts-server is the API URL used for fetching CI job artifacts, includes /api/v4 path prefix, unclear if this is actively used at the moment.
auth-redirect-uri is an URL to redirect to after authenticating a user for pages with access control enabled. is currently set to https://projects.gitlab.io. The authentication itself uses gitlab-server.
auth-server is deprecated.
gitlab-server is the URL to a gitlab instance that is used for two things: API calls for domain resolution, browser redirects for authentication (unknown at time of incident). Furthermore, if this flag is not defined, it will inherit from the deprecated auth-server, and then from artifacts-server (also currently broken gitlab-org/gitlab-pages!270 (merged)).

In addition to this, those flags can be defined either inline as cli flags, or in a config file (via the namsral/flag package) -- the command-line options taking precedence.

It is quite challenging to understand what the current configuration is.

Config change rollout

In a first attempt to make this change, artifacts-server was changed to an internal load balancer on 2020-04-24 (MR). However, this parameter is not used for the "domain resolution" API call -- which is the one being made frequently by the application.

Thus, in a second attempt, gitlab-server was changed to also pointed to an internal load balancer (MR). The change was first deployed and verified on staging. Then it was deployed to a single production host. No issues were detected during this stage. Then it was deployed to the rest of the web-pages fleet.

The web-pages hosts served traffic. There was no increase in error rates.

Rollout procedure

It's worth noting that the method for rolling out this change was not standardized or documented in any runbooks.

One of the engineers (@jarv) explained ad-hoc that chef changes to this service are unsafe to auto-apply, due to the service having a relatively long init phase after process restarts. Thus, the recommended procedure is to stop chef on those nodes, and then manually run chef, two nodes at a time.

The ad-hoc method for doing this was:

knife ssh 'role:gprd-base-fe-web-pages' sudo systemctl stop chef-client

ssh web-pages-01-sv-gprd.c.gitlab-production.internal sudo chef-client
while true; do ssh web-pages-01-sv-gprd.c.gitlab-production.internal curl -s localhost:1080/-/readiness; sleep 30; done

ssh web-pages-02-sv-gprd.c.gitlab-production.internal sudo chef-client
while true; do ssh web-pages-02-sv-gprd.c.gitlab-production.internal curl -s localhost:1080/-/readiness; sleep 30; done

...

Not only is this error-prone, it also makes rollouts and rollbacks take a while -- since we need to wait for each of the 8 web-pages hosts to initialize the gitlab-pages process in sequence.

Slow init phase

The reason for the slow init of the process is due to how the per-project domain configuration is managed.

The pages application (running on web-pages-[0-8]-sv-gprd) has an nfs mount from pages-01-stor-gprd. This shared mount contains several TB of data and hundreds of thousands of directories.

On process start, gitlab-pages scans through all of those directories looking for config.json files which are then loaded into memory. With the added overhead of NFS, this takes a long time, as there is network latency for every getdents syscall.

This method of domain configuration is currently being replaced with on-demand API calls to gitlab's API. Once that replacement is complete, process init should be instant. See gitlab-org/gitlab-pages#379 (closed).

That should make the service safe for automated chef changes again.

Escalation

However, we did get customer reports regarding pages not loading. This report was escalated to the #gitlab-pages channel in slack. Unfortunately it took some time before it was escalated to SRE. A faster path would have been to ping @sre-oncall in #production.

The rollback

The engineer who rolled out the original change (@igorwwwwwwwwwwwwwwwwwwww) performed a handover to the SRE on-call (@nnelson).

This was done in a zoom call, mainly sharing the method for incremental chef changes.

The on-call then performed the rollback. This again took over an hour due to the slow init of the process.

Metrics

Number of domains affected: 5433 (source)

Customer Impact

Who was impacted by this incident? external customers
What was the customer experience during the incident? unauthenticated users trying to access a page protected by access control received a redirect to a non-resolving URL
How many customers were affected? unknown
If a precise customer impact number is unknown, what is the estimated potential impact? 5433 distinct domains received broken redirects during this time frame. number of users accessing those domains is unknown.

The nature of the outage makes it very hard to get accurate data. This is also the reason why it was so difficult to detect in the first place.

The gitlab-pages service has an access control feature (docs). This can be enabled on a per-project basis.

Pages protected by access control will make users go through an oauth authorization cycle.

Users who are already authenticated have a session cookie and do not need to go through this cycle. If they arrive for the first time, or the session expires, they will enter the authentication workflow.

This works by redirecting the user's browser to ${gitlab-server}/oauth/authorize?redirect_uri=${auth-redirect-uri}. Before the config change, gitlab-server was set to https://gitlab.com. After the config change, it became https://int.gprd.gitlab.net.

int.gprd.gitlab.net resolves to the IP of the internal load balancer -- it is not accessible from outside the prod network. Thus, when users get that redirect, what they see is:

However, because this error is purely on the client side, we have no way to track it. There is no log or metric that will tell us about this error.

One possible signal would be a change in traffic patterns, though that would be extremely noisy.

The conditions that were required to get this error:

Project is using access control
No existing session via session cookie

This is only a subset of projects, and a subset of users.

We can count total redirects, but there is no guarantee that those are legit.

Incident Response Analysis

How was the event detected? customer report
How could detection time be improved? blackbox testing of authorization workflow
How did we reach the point where we knew how to mitigate the impact? customer report escalated to SRE
How could time to mitigation be improved? escalation path to SRE directly, faster deployment of gitlab-pages

Blackbox testing

This was suggested by @jarv. If we have some sort of synthetic blackbox test that runs through the authentication, issues with access controlled projects could be detected more easily.

Escalation path

Production issues (especially when affecting large customers) should be escalated directly to SRE, possibly via @sre-oncall in the #production channel.

Deployment of `gitlab-pages`

The slow init phase also slowed down the rollback. By retiring disk-based configuration, we can make the deployment process faster and safer (gitlab-org/gitlab-pages#379 (closed)).

Communication between SRE and Dev

The gitlab-pages developers were aware of the issue with this config parameter (gitlab-org/gitlab#215321 (closed)). Working more closely with devs on this investigation and config change rollout could have signalled the issue earlier.

Post Incident Analysis

How was the root cause diagnosed?
How could time to diagnosis be improved?
Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident?
Was this incident triggered by a change (deployment of code or change to infrastructure. if yes, have you linked the issue which represents the change?)?

Most of this is covered above.

Timeline

All times 2020-04-28 UTC.

16:56: rollout of config change begins (MR)
17:07: first host gets config change
17:54: rollout complete
18:08: customer report of outage
18:45: report escalated to SRE via slack
19:48: SRE acknowledges the issue
19:51: rollback procedure discussed via zoom
20:12: rollback begins (MR)
21:55: rollback complete

5 Whys

Skipping this for now, as the analysis above is quite comprehensive and covers several layers of contributing factors.

Lessons Learned

Browser-facing configuration errors are really hard to detect!

Corrective Actions

Blackbox test for gitlab-pages auth flow
Document gitlab-pages config rollout procedure
Retire disk-based domain configuration (gitlab-org/gitlab-pages#379 (closed))
Introduce separate config flag for API URL (gitlab-org/gitlab#215321 (closed))

Guidelines

Blameless RCA Guideline