Gitter WS servers fails to start when there is an incorrect instance in the autoscaling group
Today during performing ws server deploy https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7539 following situation occurred.
- Thanks to a bug in the user-data script a newly created instance wasn't able to set it's
hostname
and it'sName
AWS Tag. - When creating ansible inventory for provisioning the next instance, the empty name was considered normal name
- Following NGINX config was generated thanks to that:
upstream gitter-websockets-backend {
server localhost:5030 max_fails=3 fail_timeout=10s;
server localhost:5031 max_fails=3 fail_timeout=10s;
server ws-02.prod.gitter:5030 max_fails=3 fail_timeout=10s backup;
server ws-02.prod.gitter:5031 max_fails=3 fail_timeout=10s backup;
server ws-05.prod.gitter:5030 max_fails=3 fail_timeout=10s backup;
server ws-05.prod.gitter:5031 max_fails=3 fail_timeout=10s backup;
server ws-07.prod.gitter:5030 max_fails=3 fail_timeout=10s backup;
server ws-07.prod.gitter:5031 max_fails=3 fail_timeout=10s backup;
server ws-03.prod.gitter:5030 max_fails=3 fail_timeout=10s backup;
server ws-03.prod.gitter:5031 max_fails=3 fail_timeout=10s backup;
server ws-01.prod.gitter:5030 max_fails=3 fail_timeout=10s backup;
server ws-01.prod.gitter:5031 max_fails=3 fail_timeout=10s backup;
server ws-04.prod.gitter:5030 max_fails=3 fail_timeout=10s backup;
server ws-04.prod.gitter:5031 max_fails=3 fail_timeout=10s backup;
server ws-08.prod.gitter:5030 max_fails=3 fail_timeout=10s backup;
server ws-08.prod.gitter:5031 max_fails=3 fail_timeout=10s backup;
server ws-06.prod.gitter:5030 max_fails=3 fail_timeout=10s backup;
server ws-06.prod.gitter:5031 max_fails=3 fail_timeout=10s backup;
server .prod.gitter:5030 max_fails=3 fail_timeout=10s backup;
server .prod.gitter:5031 max_fails=3 fail_timeout=10s backup;
keepalive 1024;
}
...
- The last entry in the config causes an ansible task to fail and because nginx can't boot, the whole startup fails:
RUNNING HANDLER [gitter/websockets : reload nginx] *****************************
fatal: [ws-09.prod.gitter]: FAILED! => {"changed": true, "cmd": ["nginx", "-t"], "delta": "0:00:00.018724", "end": "2019-08-19 07:33:14.277307", "msg": "non-zero return code", "rc": 1, "start": "2019-08-19 07:33:14.258583", "stderr": "nginx: [emerg] host not found in upstream \".prod.gitter:5030\" in /etc/nginx/sites-enabled/gitter-websockets.conf:21\nnginx: configuration file /etc/nginx/nginx.conf test failed", "stderr_lines": ["nginx: [emerg] host not found in upstream \".prod.gitter:5030\" in /etc/nginx/sites-enabled/gitter-websockets.conf:21", "nginx: configuration file /etc/nginx/nginx.conf test failed"], "stdout": "", "stdout_lines": []}
Solution
The naive solution is to ignore empty name when creating the inventory.
Better solution would be to make the nginx startup independent on other ws
instances.