Post mortem for API endpoint vulnerability
Context
This issue serves as a place to gather the post-mortem and root cause analysis of https://gitlab.com/gitlab-org/gitlab-ce/issues/37946.
In responding to the vulnerability, API endpoints were blocked, which - together with a separate issue with HAProxy - led to an outage of GitLab.com.
Timeline
On date: 2017-09-15
- 22:14 UTC - HackerOne report received
- 22:27 UTC - Respond to H1 report and open confidential issue
- 22:35 UTC - API v3 and v4 endpoints blocked, and GitLab.com outage starts. Hot patch worked on and deployed.
- 23:54 UTC - Tweet from GitLab status that GitLab.com is back up
Incident Analysis
- How was the incident detected?
- HackerOne report
- Is there anything that could have been done to improve the time to detection?
- Automated tests with "negative assertions" to ensure that certain attributes aren't exposed, rather than our tests at the time which only asserted that something is exposed. These tests would have caused the original MR that introduced the bug to fail and would have prevented the merge in the first place.
- How was the root cause discovered?
- Investigating and debugging the code behind the affected API endpoint.
- Was this incident triggered by a change?
- Was there an existing issue that would have either prevented this incident or reduced the impact?
- No.
Root Cause Analysis
Follow the the 5 whys in a blameless manner as the core of the post mortem.
For this it is necessary to start with the incident, and question why did this incident happen, once there is an explanation of why this happened keep iterating asking why until we reach 5 whys.
It's not a hard rule that it has to be 5 times, but it helps to keep questioning to get deeper in finding the actual root cause. Additionally, from one why there may come more than one answer, consider following the different branches.
A root cause can never be a person, the way of writing has to refer to the system and the context rather than the specific actors.
5 Whys
- Private user tokens were exposed via an API endpoint
- Why? - The API endpoint wasn't using a sanitized Entity object
-
Why? - The default
presentoptions were overridden to an empty Hash{} -
Why? - We assumed it would fall back to the
success Entities::Userdefinition - In reality, this
successmethod is purely informational and has no effect on how the API actually responds.
What went well
- Identify the things that worked well
- Immediate response, investigation, and mitigation with collaboration between Development, Security, and Production teams.
What can be improved
- Using the root cause analysis, explain what things can be improved.
- See corrective actions
Corrective actions
- We patched our systems on Friday evening after the initial report. Since then, we've ensured the fixes are in our
masterbranches for CE and EE, as well as the10-0-stablebranches from which our packages are built. That will cover people installing GitLab "from source". RC4 packages, which were built from the stable branches this morning and which contain the fix, are now available to everyone installing via Omnibus packages. We recently finished deploying gitlab.com from an RC4 package and have verified the fix is still in place. - We've also completed our investigation into potential exploitation of this vulnerability and have determined that no malicious activity appears to have taken place. We identified a very small subset of users who potentially had their access tokens exposed and have reset their personal access, RSS, and incoming email tokens as a precaution. Those users will receive the email notification as in the screenshot below.
- (conf) Add tests for leaking tokens to request specs: https://gitlab.com/gitlab-org/gitlab-ce/issues/37948
- (conf) Override User JSON serialization methods to always raise an exception: https://gitlab.com/gitlab-org/gitlab-ce/issues/37947
- Deprecating the private tokens and will remove them in 10.2: https://gitlab.com/gitlab-org/gitlab-ce/issues/37301
- We had an error in our HA-Proxy chef script that was introduced for test coverage on our testing platform (digital ocean). This error caused the errant creation of a sub-interface on the production nodes when ran that would lead to their corruption of the routing tables. So when we applied the v3 & v4 API block and issued the chef commands to cycle the LB nodes, this error was encountered. @ilyaf has since refactored the errant test setup out of the Chef script.
