Treat 50x in blackbox metrics as errors
During https://gitlab.com/gitlab-com/infrastructure/issues/2864 we found an improvement opportunity for our blackbox metrics: https://performance.gitlab.net/dashboard/db/gitlab-web-status?orgId=1&from=1506563100934&to=1506566861021&panelId=25&fullscreen&edit&tab=general
- we have a very slow MR URL, https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/9546
- it is checked with prometheus: https://dev.gitlab.org/cookbooks/chef-repo/blob/master/roles/prometheus-server.json#L95
- if the app is slow to a point where it can't return this URL within 60 seconds, it returns 502 errors
- checker assumes it got a response and plots 60 seconds on a graph
What should be:
- checker should treat 50x as errors and alert on them.
Also, since this page takes 20s all by itself, can we somehow estimate the load we're inflicting on ourselves by monitoring it constantly?