2020-04-21 increased error rates on web-pages-03 and 06
TLDR
For the 2nd time today, 2 nodes in the web-pages tier intermittently failed while trying to make calls to the API tier. This may have been specific to GCP zone us-east1-b
, as the 2 nodes in that zone were the only nodes affected by this error rate spike.
Summary
PagerDuty alert: Firing 1 - The web-pages
service (main
stage) has an error-ratio exceeding SLO
This was the 2nd time today this problem occurred. The previous incident https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1966 was an even larger spike between 11:03 and 12:03 UTC today.
This alert indicates that the web-pages
service has over 1% of requests resulting in an error outcome. Note that surprisngly the structured log does not include the HTTP response code, so filtering in Kibana by json.status
does not find these events. Instead search Kibana for json.error exists
.
Most of the matching log events have:
-
json.err
: "retrieval context done" -
json.msg
: "could not fetch domain information from a source"
@igorwwwwwwwwwwwwwwwwwwww shared some context around what this failing API call aims to do:
gitlab-pages talks to the gitlab api to resolve information about a domain. it caches this information for 1 minute, but overall there are ~30 QPS against the gitlab api (not sure if that is per pages host or overall). the message means that this gitlab api call timed out.
For reference, here's a comparison of the 1st and 2nd incident today -- both involved the same 2 hosts.
During the 1st incident (11:00 - 12:20 UTC) had:
- 4000 errors from
web-pages-03
- 25000 errors from
web-pages-06
- None from all other hosts
https://log.gprd.gitlab.net/goto/394f73766740482469db802c6d2c65af
During the 2nd incident (i.e. this ticket), from 16:20 to 16:50 UTC, we have the same 2 hosts in an inverted proportion:
- 5100 errors from
web-pages-03
- 583 errors from
web-pages-06
- 4 errors from
web-pages-07
More information will be added as we investigate the issue.
Timeline
All times UTC.
2020-04-21
- 16:29 - Error rate starts to climb
- 16:36 - Error rate starts to fall
- 16:38 - PagerDuty alert triggers
- 16:43 - PagerDuty alert clears
Graphs
6-hour timeline of the alerting metric
Kibana timeline of error events from all web-pages nodes during 16:20 to 16:50 UTC timespan
https://log.gprd.gitlab.net/goto/2a55076e450f122a812ec012c5e9d24e
List of nodes by GCP zone
The 2 affected nodes were both in zone us-east1-b
, and no other nodes are in that zone.
msmiley@saoirse:~$ gcloud --project='gitlab-production' compute instances list | grep 'web-pages'
web-pages-03-sv-gprd us-east1-b n1-standard-8 10.220.12.2 RUNNING
web-pages-06-sv-gprd us-east1-b n1-standard-8 10.220.12.4 RUNNING
web-pages-01-sv-gprd us-east1-c n1-standard-8 10.220.12.5 RUNNING
web-pages-04-sv-gprd us-east1-c n1-standard-8 10.220.12.3 RUNNING
web-pages-07-sv-gprd us-east1-c n1-standard-8 10.220.12.8 RUNNING
web-pages-02-sv-gprd us-east1-d n1-standard-8 10.220.12.6 RUNNING
web-pages-05-sv-gprd us-east1-d n1-standard-8 10.220.12.7 RUNNING
web-pages-08-sv-gprd us-east1-d n1-standard-8 10.220.12.9 RUNNING