Avatars not loading

Recently noticed that some avatars are having trouble loading. Eventually they load to the default Gitter logo. Happening on production and staging.

https://avatars-01.gitter.im/group/iv/3/57542c89c43b8c601977426d?s=22
https://avatars-02.gitter.im/group/iv/3/57542d2fc43b8c601977a533?s=22
https://avatars-02.gitter.im/group/iv/3/57542c7bc43b8c60197739aa?s=22

https://avatars-staging-01.gitter.im/group/iv/3/57542c89c43b8c601977426d?s=22
https://avatars-staging-02.gitter.im/group/iv/3/57542d2fc43b8c601977a533?s=22
https://avatars-staging-02.gitter.im/group/iv/3/57542c7bc43b8c60197739aa?s=22

Notes

The avatars are just a basic nginx proxy and CloudFront, https://gitlab.com/gitlab-com/gl-infra/gitter-infrastructure/blob/6f063d711439678ecefa8db76c69f1b806a73655/ansible/roles/gitter/avatars/templates/nginx-conf.j2

apps servers -> avatars-origin.gitter.im Route53 -> Cloudfront -> avatars.gitter.im, avatars-01.gitter.im, avatars-02.gitter.im, avatars-03.gitter.im, avatars-04.gitter.im, avatars-05.gitter.im

Looking at Monit for apps-01/apps-02 for hosts avatars/avatars-staging, I see Connection failed at port 80

`apps-01`	`apps-02`

When I go to https://avatars-04.gitter.im/api/private/health_check, I see the following 504 error

504 ERROR

The request could not be satisfied.

CloudFront attempted to establish a connection with the origin, but either the attempt failed or the origin closed the connection. 
If you received this error while trying to use an app or access a website, please contact the provider or website owner for assistance. 
If you provide content to customers through CloudFront, you can find steps to troubleshoot and help prevent this error by following steps in the CloudFront documentation (https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/http-504-gateway-timeout.html). 

Generated by cloudfront (CloudFront)
Request ID: FlJevG4IJ74pMwSVEEUokUoLOyoh5pznoYdAjoMLzjzWyCgcfzXF7A==

Any coincidence with the `webapp` fleet being recycled recently?

Does this coincide at all with the webapp AMI and fleet update? https://gitlab.com/gitlab-com/gl-infra/gitter-infrastructure/merge_requests/120

My initial thought is no 🤔

But what does this line do in the avatars nginx that loops through the webapp servers? https://gitlab.com/gitlab-com/gl-infra/gitter-infrastructure/blob/6f063d711439678ecefa8db76c69f1b806a73655/ansible/roles/gitter/avatars/templates/nginx-conf.j2#L6-10

/etc/nginx/sites-enabled/gitter-avatars.conf

upstream gitter-avatars-api-backend {
  server webapp-04.prod.gitter:5025 max_fails=3 fail_timeout=10s;
  server webapp-04.prod.gitter:5023 max_fails=3 fail_timeout=10s;
    server webapp-03.prod.gitter:5025 max_fails=3 fail_timeout=10s;
  server webapp-03.prod.gitter:5023 max_fails=3 fail_timeout=10s;
    server webapp-01.prod.gitter:5025 max_fails=3 fail_timeout=10s;
  server webapp-01.prod.gitter:5023 max_fails=3 fail_timeout=10s;
    server webapp-08.prod.gitter:5025 max_fails=3 fail_timeout=10s;
  server webapp-08.prod.gitter:5023 max_fails=3 fail_timeout=10s;
    server webapp-06.prod.gitter:5025 max_fails=3 fail_timeout=10s;
  server webapp-06.prod.gitter:5023 max_fails=3 fail_timeout=10s;
    server webapp-07.prod.gitter:5025 max_fails=3 fail_timeout=10s;
  server webapp-07.prod.gitter:5023 max_fails=3 fail_timeout=10s;
    server webapp-02.prod.gitter:5025 max_fails=3 fail_timeout=10s;
  server webapp-02.prod.gitter:5023 max_fails=3 fail_timeout=10s;
    server webapp-05.prod.gitter:5025 max_fails=3 fail_timeout=10s;
  server webapp-05.prod.gitter:5023 max_fails=3 fail_timeout=10s;
}

...

Do we need to clear some cached IPs like a known_hosts for SSH?
- Maybe, I just restarted nginx which seemed to do something, see below

I just restarted nginx on apps-01 and apps-02 and everything seems to be healthy now.

nginx -t to check if the configuration is valid
nginx -s reload to reload nginx

`apps-01`	`apps-02`

The health check endpoint now loads again as well, https://avatars-04.gitter.im/api/private/health_check

OK from webapp-05:5025, running 8ac7d5, branch 19-48-0, commit 8ac7d5c7e70535f534afc2e919e320121a2b1fcb

Preventing in the future

Two options:

Make nginx not "cache" DNS so webapp-0x.prod.gitter:5025 always points to the new instance
- This one seems more reasonable
When a new webapp server is added, restart nginx on apps-0x servers

^ This stuff is now tracked by https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/6984

Alerting

We should see some avatars_health_check_failed in PagerDuty but they are probably tuned down too much

✅ Just added monit-prod-error to our Slack integration with #gitter-alerts so we have more visibility, https://gitter.pagerduty.com/services/P16ONUD/integrations

Other reports

https://gitter.im/gitterHQ/gitter?at=5d03356f6f0ec85ade0476f0

cc @viktomas

Edited Jun 14, 2019 by Eric Eastwood