Potential bug with mailroom Pod

Summary

When ill configured the mailroom process does not appear to run, though the Pod remains in a state that would indicate otherwise. Due to this situation, I worry that we do not have the appropriate monitoring from our scripting to report failures to Kubernetes.

Steps to reproduce

Provide a minimal configuration to the mailroom components. Purposely, example, setting a hostname that may be valid, but does not respond on the IMAP TCP port.

Observe that the Pod is in a ready state:

NAME                                  READY   STATUS      RESTARTS   AGE
a-mailroom-7c996fd78c-28sd9           1/1     Running     0          92m

Observe that that ruby process responsible for mailroom is not running:

% kubectl exec -it a-mailroom-7c996fd78c-28sd9 /bin/sh
$ ps -efl
F S UID          PID    PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY          TIME CMD
4 S git            1       0  0  80   0 -  1070 wait   17:33 ?        00:00:00 /bin/sh -c /scripts/process-wrapper
4 S git            7       1  0  80   0 -  4933 wait   17:33 ?        00:00:00 /bin/bash /scripts/process-wrapper
0 S git            9       7  0  80   0 -  1493 hrtime 17:33 ?        00:00:00 tail -f /var/log/gitlab/mail_room.log
4 S git         5336       0  0  80   0 -  1070 wait   19:01 pts/0    00:00:00 /bin/sh
0 R git         5341    5336  0  80   0 -  9596 -      19:01 pts/0    00:00:00 ps -efl

Observe the logs where we see the obvious connection failure scenario:

% k logs a-mailroom-7c996fd78c-28sd9
+ /scripts/set-config /etc /etc
+ exec /bin/sh -c /scripts/process-wrapper
Begin parsing .erb files from /etc
Starting Mailroom
/usr/lib/ruby/2.6.0/net/imap.rb:1136:in `rescue in tcp_socket': Timeout to open TCP connection to 172.16.1.1.xip.io:993 (exceeds 30 seconds) (Net::OpenTimeout)
        from /usr/lib/ruby/2.6.0/net/imap.rb:1131:in `tcp_socket'
        from /usr/lib/ruby/2.6.0/net/imap.rb:1089:in `initialize'
        from /usr/lib/ruby/gems/2.6.0/gems/mail_room-0.9.1/lib/mail_room/connection.rb:74:in `new'
        from /usr/lib/ruby/gems/2.6.0/gems/mail_room-0.9.1/lib/mail_room/connection.rb:74:in `imap'
        from /usr/lib/ruby/gems/2.6.0/gems/mail_room-0.9.1/lib/mail_room/connection.rb:84:in `log_in'
        from /usr/lib/ruby/gems/2.6.0/gems/mail_room-0.9.1/lib/mail_room/connection.rb:68:in `setup'
        from /usr/lib/ruby/gems/2.6.0/gems/mail_room-0.9.1/lib/mail_room/connection.rb:8:in `initialize'
        from /usr/lib/ruby/gems/2.6.0/gems/mail_room-0.9.1/lib/mail_room/mailbox_watcher.rb:57:in `new'
        from /usr/lib/ruby/gems/2.6.0/gems/mail_room-0.9.1/lib/mail_room/mailbox_watcher.rb:57:in `connection'
        from /usr/lib/ruby/gems/2.6.0/gems/mail_room-0.9.1/lib/mail_room/mailbox_watcher.rb:28:in `run'
        from /usr/lib/ruby/gems/2.6.0/gems/mail_room-0.9.1/lib/mail_room/coordinator.rb:19:in `each'
        from /usr/lib/ruby/gems/2.6.0/gems/mail_room-0.9.1/lib/mail_room/coordinator.rb:19:in `run'
        from /usr/lib/ruby/gems/2.6.0/gems/mail_room-0.9.1/lib/mail_room/cli.rb:52:in `start'
        from /usr/lib/ruby/gems/2.6.0/gems/mail_room-0.9.1/bin/mail_room:5:in `<top (required)>'
        from /usr/bin/mail_room:23:in `load'
        from /usr/bin/mail_room:23:in `<main>'
/usr/lib/ruby/2.6.0/socket.rb:61:in `connect_internal': Connection timed out - user specified timeout (Errno::ETIMEDOUT)
        from /usr/lib/ruby/2.6.0/socket.rb:137:in `connect'
        from /usr/lib/ruby/2.6.0/socket.rb:641:in `block in tcp'
        from /usr/lib/ruby/2.6.0/socket.rb:227:in `each'
        from /usr/lib/ruby/2.6.0/socket.rb:227:in `foreach'
        from /usr/lib/ruby/2.6.0/socket.rb:631:in `tcp'
        from /usr/lib/ruby/2.6.0/net/imap.rb:1132:in `tcp_socket'
        from /usr/lib/ruby/2.6.0/net/imap.rb:1089:in `initialize'
        from /usr/lib/ruby/gems/2.6.0/gems/mail_room-0.9.1/lib/mail_room/connection.rb:74:in `new'
        from /usr/lib/ruby/gems/2.6.0/gems/mail_room-0.9.1/lib/mail_room/connection.rb:74:in `imap'
        from /usr/lib/ruby/gems/2.6.0/gems/mail_room-0.9.1/lib/mail_room/connection.rb:84:in `log_in'
        from /usr/lib/ruby/gems/2.6.0/gems/mail_room-0.9.1/lib/mail_room/connection.rb:68:in `setup'
        from /usr/lib/ruby/gems/2.6.0/gems/mail_room-0.9.1/lib/mail_room/connection.rb:8:in `initialize'
        from /usr/lib/ruby/gems/2.6.0/gems/mail_room-0.9.1/lib/mail_room/mailbox_watcher.rb:57:in `new'
        from /usr/lib/ruby/gems/2.6.0/gems/mail_room-0.9.1/lib/mail_room/mailbox_watcher.rb:57:in `connection'
        from /usr/lib/ruby/gems/2.6.0/gems/mail_room-0.9.1/lib/mail_room/mailbox_watcher.rb:28:in `run'
        from /usr/lib/ruby/gems/2.6.0/gems/mail_room-0.9.1/lib/mail_room/coordinator.rb:19:in `each'
        from /usr/lib/ruby/gems/2.6.0/gems/mail_room-0.9.1/lib/mail_room/coordinator.rb:19:in `run'
        from /usr/lib/ruby/gems/2.6.0/gems/mail_room-0.9.1/lib/mail_room/cli.rb:52:in `start'
        from /usr/lib/ruby/gems/2.6.0/gems/mail_room-0.9.1/bin/mail_room:5:in `<top (required)>'
        from /usr/bin/mail_room:23:in `load'
        from /usr/bin/mail_room:23:in `<main>'

Configuration used

      incomingEmail:
        enabled: false
        address: ""
        host: "172.16.1.1.xip.io"
        port: 993
        ssl: true
        startTls: false
        user: ""
        password:
          secret: "imap-creds"
          key: password
        mailbox: inbox
        idleTimeout: 60

Current behavior

The Pod is not crashing

Expected behavior

The Pod should crash. As noted in a working mailroom configuration we have a running ruby process:

% k exec -it b-mailroom-5f87b4695-98bjs /bin/sh
$ ps -efl
F S UID          PID    PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY          TIME CMD
4 S git            1       0  0  80   0 -  1070 wait   16:51 ?        00:00:00 /bin/sh -c /scripts/process-wrapper
4 S git            7       1  0  80   0 -  4930 wait   16:51 ?        00:00:00 /bin/bash /scripts/process-wrapper
4 S git            8       7  0  80   0 - 63823 poll_s 16:51 ?        00:00:01 /usr/bin/ruby /usr/bin/mail_room -c /var/opt
0 S git            9       7  0  80   0 -  1493 hrtime 16:51 ?        00:00:00 tail -f /var/log/gitlab/mail_room.log
4 S git         8502       0  0  80   0 -  1070 wait   19:12 pts/0    00:00:00 /bin/sh
0 R git         8507    8502  0  80   0 -  9596 -      19:12 pts/0    00:00:00 ps -efl

Versions

Chart: 9df0f92e27b3106c2fb4c41da8c6357fdc8d02bd
Platform:
- Cloud: GKE
Kubernetes:
- Client: 1.14.7
- Server: 1.14.7
Helm:
- Client: 2.14.2
- Server: n/a

Potential Root Cause

Our wrapper script https://gitlab.com/gitlab-org/build/CNG/blob/0a9e17d9308d0c8056f0cc212097613544b23b4d/gitlab-mailroom/scripts/process-wrapper#L8 performs a tail which will never exit, and we never reach our wait command. So when the ruby process dies the wrapper script is still running due to the tail. Our healthchecks are also not valid: https://gitlab.com/gitlab-org/charts/gitlab/blob/master/charts/gitlab/charts/mailroom/templates/deployment.yaml#L86-97. The use of --full in pgrep will look for anything that contains mail_room in it, which the tail command matches. So Kubernetes never finds out that the ruby process responsible for Mailroom can sometimes die.

Edited Aug 06, 2020 by John Jarvis