Post mortem for API endpoint vulnerability

Context

This issue serves as a place to gather the post-mortem and root cause analysis of https://gitlab.com/gitlab-org/gitlab-ce/issues/37946.

In responding to the vulnerability, API endpoints were blocked, which - together with a separate issue with HAProxy - led to an outage of GitLab.com.

Timeline

On date: 2017-09-15

22:14 UTC - HackerOne report received
22:27 UTC - Respond to H1 report and open confidential issue
22:35 UTC - API v3 and v4 endpoints blocked, and GitLab.com outage starts. Hot patch worked on and deployed.
23:54 UTC - Tweet from GitLab status that GitLab.com is back up

Incident Analysis

How was the incident detected?
- HackerOne report
Is there anything that could have been done to improve the time to detection?
- Automated tests with "negative assertions" to ensure that certain attributes aren't exposed, rather than our tests at the time which only asserted that something is exposed. These tests would have caused the original MR that introduced the bug to fail and would have prevented the merge in the first place.
How was the root cause discovered?
- Investigating and debugging the code behind the affected API endpoint.
Was this incident triggered by a change?
- Yes, https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/13501
Was there an existing issue that would have either prevented this incident or reduced the impact?
- No.

Root Cause Analysis

Follow the the 5 whys in a blameless manner as the core of the post mortem.

For this it is necessary to start with the incident, and question why did this incident happen, once there is an explanation of why this happened keep iterating asking why until we reach 5 whys.

It's not a hard rule that it has to be 5 times, but it helps to keep questioning to get deeper in finding the actual root cause. Additionally, from one why there may come more than one answer, consider following the different branches.

A root cause can never be a person, the way of writing has to refer to the system and the context rather than the specific actors.

5 Whys

Private user tokens were exposed via an API endpoint
Why? - The API endpoint wasn't using a sanitized Entity object
Why? - The default present options were overridden to an empty Hash {}
Why? - We assumed it would fall back to the success Entities::User definition
In reality, this success method is purely informational and has no effect on how the API actually responds.

What went well

Identify the things that worked well
- Immediate response, investigation, and mitigation with collaboration between Development, Security, and Production teams.

What can be improved

Using the root cause analysis, explain what things can be improved.
- See corrective actions

Corrective actions

We patched our systems on Friday evening after the initial report. Since then, we've ensured the fixes are in our master branches for CE and EE, as well as the 10-0-stable branches from which our packages are built. That will cover people installing GitLab "from source". RC4 packages, which were built from the stable branches this morning and which contain the fix, are now available to everyone installing via Omnibus packages. We recently finished deploying gitlab.com from an RC4 package and have verified the fix is still in place.
We've also completed our investigation into potential exploitation of this vulnerability and have determined that no malicious activity appears to have taken place. We identified a very small subset of users who potentially had their access tokens exposed and have reset their personal access, RSS, and incoming email tokens as a precaution. Those users will receive the email notification as in the screenshot below.
(conf) Add tests for leaking tokens to request specs: https://gitlab.com/gitlab-org/gitlab-ce/issues/37948
(conf) Override User JSON serialization methods to always raise an exception: https://gitlab.com/gitlab-org/gitlab-ce/issues/37947
Deprecating the private tokens and will remove them in 10.2: https://gitlab.com/gitlab-org/gitlab-ce/issues/37301
We had an error in our HA-Proxy chef script that was introduced for test coverage on our testing platform (digital ocean). This error caused the errant creation of a sub-interface on the production nodes when ran that would lead to their corruption of the routing tables. So when we applied the v3 & v4 API block and issued the chef commands to cycle the LB nodes, this error was encountered. @ilyaf has since refactored the errant test setup out of the Chef script.

Screenshot of message sent to impacted users

Guidelines

Edited Oct 12, 2017 by Ernst van Nierop