Avatars not loading
Recently noticed that some avatars are having trouble loading. Eventually they load to the default Gitter logo. Happening on production and staging.
https://avatars-01.gitter.im/group/iv/3/57542c89c43b8c601977426d?s=22
https://avatars-02.gitter.im/group/iv/3/57542d2fc43b8c601977a533?s=22
https://avatars-02.gitter.im/group/iv/3/57542c7bc43b8c60197739aa?s=22
https://avatars-staging-01.gitter.im/group/iv/3/57542c89c43b8c601977426d?s=22
https://avatars-staging-02.gitter.im/group/iv/3/57542d2fc43b8c601977a533?s=22
https://avatars-staging-02.gitter.im/group/iv/3/57542c7bc43b8c60197739aa?s=22
Notes
The avatars are just a basic nginx proxy and CloudFront, https://gitlab.com/gitlab-com/gl-infra/gitter-infrastructure/blob/6f063d711439678ecefa8db76c69f1b806a73655/ansible/roles/gitter/avatars/templates/nginx-conf.j2
apps servers -> avatars-origin.gitter.im Route53 -> Cloudfront -> avatars.gitter.im, avatars-01.gitter.im, avatars-02.gitter.im, avatars-03.gitter.im, avatars-04.gitter.im, avatars-05.gitter.im
Looking at Monit for apps-01/apps-02 for hosts avatars/avatars-staging, I see Connection failed at port 80
apps-01 |
apps-02 |
|---|---|
![]() |
![]() |
When I go to https://avatars-04.gitter.im/api/private/health_check, I see the following 504 error
504 ERROR
The request could not be satisfied.
CloudFront attempted to establish a connection with the origin, but either the attempt failed or the origin closed the connection.
If you received this error while trying to use an app or access a website, please contact the provider or website owner for assistance.
If you provide content to customers through CloudFront, you can find steps to troubleshoot and help prevent this error by following steps in the CloudFront documentation (https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/http-504-gateway-timeout.html).
Generated by cloudfront (CloudFront)
Request ID: FlJevG4IJ74pMwSVEEUokUoLOyoh5pznoYdAjoMLzjzWyCgcfzXF7A==
Any coincidence with the webapp fleet being recycled recently?
Does this coincide at all with the webapp AMI and fleet update? https://gitlab.com/gitlab-com/gl-infra/gitter-infrastructure/merge_requests/120
- My initial thought is no
🤔 - But what does this line do in the avatars nginx that loops through the
webappservers? https://gitlab.com/gitlab-com/gl-infra/gitter-infrastructure/blob/6f063d711439678ecefa8db76c69f1b806a73655/ansible/roles/gitter/avatars/templates/nginx-conf.j2#L6-10/etc/nginx/sites-enabled/gitter-avatars.conf
upstream gitter-avatars-api-backend { server webapp-04.prod.gitter:5025 max_fails=3 fail_timeout=10s; server webapp-04.prod.gitter:5023 max_fails=3 fail_timeout=10s; server webapp-03.prod.gitter:5025 max_fails=3 fail_timeout=10s; server webapp-03.prod.gitter:5023 max_fails=3 fail_timeout=10s; server webapp-01.prod.gitter:5025 max_fails=3 fail_timeout=10s; server webapp-01.prod.gitter:5023 max_fails=3 fail_timeout=10s; server webapp-08.prod.gitter:5025 max_fails=3 fail_timeout=10s; server webapp-08.prod.gitter:5023 max_fails=3 fail_timeout=10s; server webapp-06.prod.gitter:5025 max_fails=3 fail_timeout=10s; server webapp-06.prod.gitter:5023 max_fails=3 fail_timeout=10s; server webapp-07.prod.gitter:5025 max_fails=3 fail_timeout=10s; server webapp-07.prod.gitter:5023 max_fails=3 fail_timeout=10s; server webapp-02.prod.gitter:5025 max_fails=3 fail_timeout=10s; server webapp-02.prod.gitter:5023 max_fails=3 fail_timeout=10s; server webapp-05.prod.gitter:5025 max_fails=3 fail_timeout=10s; server webapp-05.prod.gitter:5023 max_fails=3 fail_timeout=10s; } ... - Do we need to clear some cached IPs like a
known_hostsfor SSH?- Maybe, I just restarted nginx which seemed to do something, see below
I just restarted nginx on apps-01 and apps-02 and everything seems to be healthy now.
-
nginx -tto check if the configuration is valid -
nginx -s reloadto reload nginx
apps-01 |
apps-02 |
|---|---|
![]() |
![]() |
The health check endpoint now loads again as well, https://avatars-04.gitter.im/api/private/health_check
OK from webapp-05:5025, running 8ac7d5, branch 19-48-0, commit 8ac7d5c7e70535f534afc2e919e320121a2b1fcb
Preventing in the future
Two options:
- Make nginx not "cache" DNS so
webapp-0x.prod.gitter:5025always points to the new instance- This one seems more reasonable
- When a new
webappserver is added, restart nginx onapps-0xservers
^ This stuff is now tracked by https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/6984
Alerting
We should see some avatars_health_check_failed in PagerDuty but they are probably tuned down too much
- https://gitter.pagerduty.com/incidents/P80EQ33
- https://gitter.pagerduty.com/incidents/PQ9DU46
- https://gitter.pagerduty.com/incidents/PI4IKWJ
monit-prod-error to our Slack integration with #gitter-alerts so we have more visibility, https://gitter.pagerduty.com/services/P16ONUD/integrations
Other reports
cc @viktomas




