Abuse dashboard and alerts
At a guess, I would say 15% - 20% of all outages on GitLab.com are the result of user induced resource constraint problems on our git infrastructure (file servers, gitaly etc). These include:
- GitLab.com git infrastructure being used as a content distribution mechanism simultaneously by a large number of clients
- Badly written or abusive API scripts
- Badly written crawlers
@jacobvosmaer-gitlab pointed out the every time this happens, it's usually first spotted by a member of the Gitaly team (or @stanhu of course
Additionally, diagnosing the problem once the site is experiencing 502s is reactive, stress-inducing, and bad for our uptime.
I would like to build a set of dashboards and alerts that will identify these issues, hopefully in advance of the problem leading to outages.
This issue will track that effort.
Note that identifying the issue early (eg, this issue) does not lessen the need to improve the issues in our stack that are leading to these situations in the first place. Those are being addressed in parallel through issues like:
Recent examples of this type of Gitaly resource outage include: