2020-04-21 increased error rates on web-pages-03 and 06

TLDR

For the 2nd time today, 2 nodes in the web-pages tier intermittently failed while trying to make calls to the API tier. This may have been specific to GCP zone us-east1-b, as the 2 nodes in that zone were the only nodes affected by this error rate spike.

Summary

PagerDuty alert: Firing 1 - The web-pages service (main stage) has an error-ratio exceeding SLO

This was the 2nd time today this problem occurred. The previous incident https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1966 was an even larger spike between 11:03 and 12:03 UTC today.

This alert indicates that the web-pages service has over 1% of requests resulting in an error outcome. Note that surprisngly the structured log does not include the HTTP response code, so filtering in Kibana by json.status does not find these events. Instead search Kibana for json.error exists.

Most of the matching log events have:

  • json.err: "retrieval context done"
  • json.msg: "could not fetch domain information from a source"

@igorwwwwwwwwwwwwwwwwwwww shared some context around what this failing API call aims to do:

gitlab-pages talks to the gitlab api to resolve information about a domain. it caches this information for 1 minute, but overall there are ~30 QPS against the gitlab api (not sure if that is per pages host or overall). the message means that this gitlab api call timed out.

For reference, here's a comparison of the 1st and 2nd incident today -- both involved the same 2 hosts.

During the 1st incident (11:00 - 12:20 UTC) had:

  • 4000 errors from web-pages-03
  • 25000 errors from web-pages-06
  • None from all other hosts

https://log.gprd.gitlab.net/goto/394f73766740482469db802c6d2c65af

During the 2nd incident (i.e. this ticket), from 16:20 to 16:50 UTC, we have the same 2 hosts in an inverted proportion:

  • 5100 errors from web-pages-03
  • 583 errors from web-pages-06
  • 4 errors from web-pages-07

More information will be added as we investigate the issue.

Timeline

All times UTC.

2020-04-21

  • 16:29 - Error rate starts to climb
  • 16:36 - Error rate starts to fall
  • 16:38 - PagerDuty alert triggers
  • 16:43 - PagerDuty alert clears

Graphs

6-hour timeline of the alerting metric

Screenshot_from_2020-04-21_09-46-25

Kibana timeline of error events from all web-pages nodes during 16:20 to 16:50 UTC timespan

https://log.gprd.gitlab.net/goto/2a55076e450f122a812ec012c5e9d24e

Screenshot_from_2020-04-21_15-38-16

List of nodes by GCP zone

The 2 affected nodes were both in zone us-east1-b, and no other nodes are in that zone.

msmiley@saoirse:~$ gcloud --project='gitlab-production' compute instances list | grep 'web-pages'
web-pages-03-sv-gprd                                 us-east1-b  n1-standard-8                            10.220.12.2                   RUNNING
web-pages-06-sv-gprd                                 us-east1-b  n1-standard-8                            10.220.12.4                   RUNNING
web-pages-01-sv-gprd                                 us-east1-c  n1-standard-8                            10.220.12.5                   RUNNING
web-pages-04-sv-gprd                                 us-east1-c  n1-standard-8                            10.220.12.3                   RUNNING
web-pages-07-sv-gprd                                 us-east1-c  n1-standard-8                            10.220.12.8                   RUNNING
web-pages-02-sv-gprd                                 us-east1-d  n1-standard-8                            10.220.12.6                   RUNNING
web-pages-05-sv-gprd                                 us-east1-d  n1-standard-8                            10.220.12.7                   RUNNING
web-pages-08-sv-gprd                                 us-east1-d  n1-standard-8                            10.220.12.9                   RUNNING
Edited Aug 03, 2020 by 🤖 GitLab Bot 🤖
Assignee Loading
Time tracking Loading