gitaly behind aws "classic" elb (for tls termination) fails (service unavaiable) with no error message

Summary

enabling tls for remote gitaly server causes readiness check to fail (503 Service Unavailable) when using Amazon Web Services "Classic" ELB configured with an acm cert to terminate tls wither either tcp grdp gitaly backend, or non-verified (eg. self-signed) tls grdp gitaly backend.

Steps to reproduce

gitlab application server /etc/gitlab/gitlab.rb:

git_data_dirs({
  'storage1' => { 
    'path' => '/srv/gitlab/storage1/repositories/', 
    'gitaly_address' => 'tls://gitaly.example.org:9999',
  },
})

gitaly server /etc/gitlab/gitlab.rb:

gitaly['tls_listen_addr'] = "0.0.0.0:9999"
gitaly['certificate_path'] = "/var/opt/gitlab/gitaly/ssl/gitaly-selfsigned.crt"
gitaly['key_path'] = "/var/opt/gitlab/gitaly/ssl/private/gitaly-selfsigned.key"
gitaly['storage'] = [
  { 'name' => 'storage1', 'path' => '/srv/gitlab/storage1/repositories',},
]

self signed certificates generated with gnutls-bin (certtool):

#!/bin/sh 
install -o git -g git -d /var/opt/gitlab/gitaly/ssl/
install -o git -g git -d /var/opt/gitlab/gitaly/ssl/private/

tee -a /var/opt/gitlab/gitaly/ssl/gitaly-selfsigned.template <<'EOF'
organization = "Example Org"
unit = "gitlab"
cn = "gitaly"
expiration_days = 365
tls_www_server
encryption_key
EOF

certtool --generate-privkey \
    --outfile /var/opt/gitlab/gitaly/ssl/private/gitaly-selfsigned.key
certtool --generate-self-signed \
    --load-privkey /var/opt/gitlab/gitaly/ssl/private/gitaly-selfsigned.key \
    --outfile /var/opt/gitlab/gitaly/ssl/gitaly-selfsigned.crt \
    --template /var/opt/gitlab/gitaly/ssl/gitaly-selfsigned.template
certtool --certificate-info \
    --infile /var/opt/gitlab/gitaly/ssl/gitaly-selfsigned.crt

On the aws:

  • internal hosted zone "example.org"
  • internal classic load balancer ssl:9999 -> ssl:9999 with acm cert for *.example.org, backed by gitaly server

(described in terraform script)

resource "aws_route53_record" "gitaly" {
  zone_id = "<internal example.org zone_id>"
  name = "gitaly"
  type = "A"
  alias {
    name = "${aws_elb.gitaly.dns_name}"
    zone_id = "${aws_elb.gitaly.zone_id}"
    evaluate_target_health = false
  }
}

resource "aws_elb" "gitaly" {
  name = "gitaly"
  internal = true
  cross_zone_load_balancing = true
  idle_timeout = 3630  # 1hr + 30sec: ssh sessions, upstream nginx set to 3600
  connection_draining = true
  connection_draining_timeout = 240
  security_groups = [
    # tcp 8075,9999 this elb  ingress from gitlab application server permitted 
    # tcp 8075,9999 this elb  egress to any permitted  (or no outbound restrictions)
    # tcp 8075,9999 gitaly server  ingress from this elb permitted 
  ]
  subnets = [ 
    # vpc subnets
  ]
  listener {
    instance_port = 9999
    instance_protocol = "ssl"
    lb_port = 9999
    lb_protocol = "ssl"
    ssl_certificate_id = "<valid-acm-cert>"  # for example *.example.org for gitaly.example.org
  }
  health_check {
    healthy_threshold = 2
    unhealthy_threshold = 4
    timeout = 5
    interval = 30
    target = "TCP:9999"
  }
}

What is the current bug behaviour?

unicorn reports no errors and gitaly logs no errors

(with tcp:8075 -> tcp:8075, gitlab unicorn reports the self signed cert is not trusted)
(with ssl:9999 -> tcp:8075, gitlab logs no error, gitaly logs no error)
(with ssl:9999 -> ssl:9999, gitlab logs no error, gitaly logs no error)
(with tcp:8075 -> tcp:8075, works normally, as if there were no elb, but also lacks transport encryption)

under all tls-enabled circumstances terminated by self signed or acm certificates, gitlab readiness check reports 503 Service Unavailable, with self signed at least the tls failure is logged, with acm no failure reason is logged.

Found with the help of gitlab:check, the gitaly storage is failed with storage1 ... FAIL: 14:Connect Failed, both from the gitlab server and the gitaly server tcp connections and tls connections to the gitaly server as configured succeed:

$ nc -vzw1 gitaly.example.org 8075
Connection to gitaly.example.org 8075 port [tcp/*] succeeded!
$ gnutls-cli gitaly.example.org:9999
Processed 133 CA certificate(s).
Resolving 'gitaly.example.org:9999'...
Connecting to '10.15.165.93:9999'...
- Certificate type: X.509
- Got a certificate list of 4 certificates.
- Certificate[0] info:
 - subject `CN=*.example.org', issuer `CN=Amazon,OU=Server CA 1B,O=Amazon,C=US', serial 0x0ff9193f85d30c6c5e05b6f8b6d1428b, RSA key 2048 bits, signed using RSA-SHA256, activated `2018-08-16 00:00:00 UTC', expires `2019-09-16 12:00:00 UTC', pin-sha256="I/HmJbuHiHBUfM61YRgSt/IVrU055skqkAaN6htDF0I="
	Public Key ID:
		sha1:bce690fde4e41e0fb269beb144e7073577bed79d
		sha256:23f1e625bb878870547cceb5611812b7f215ad4d39e6c92a90068dea1b431742
	Public Key PIN:
		pin-sha256:I/HmJbuHiHBUfM61YRgSt/IVrU055skqkAaN6htDF0I=
	Public key's random art:
		+--[ RSA 2048]----+
		|                 |
		|                 |
		|            o . .|
		|       .   . o o |
		|        S o     .|
		|       + + .    =|
		|      o B * .  E+|
		|       =.% =    .|
		|       oBo= .    |
		+-----------------+

- Certificate[1] info:
 - subject `CN=Amazon,OU=Server CA 1B,O=Amazon,C=US', issuer `CN=Amazon Root CA 1,O=Amazon,C=US', serial 0x067f94578587e8ac77deb253325bbc998b560d, RSA key 2048 bits, signed using RSA-SHA256, activated `2015-10-22 00:00:00 UTC', expires `2025-10-19 00:00:00 UTC', pin-sha256="JSMzqOOrtyOT1kmau6zKhgT676hGgczD5VMdRMyJZFA="
- Certificate[2] info:
 - subject `CN=Amazon Root CA 1,O=Amazon,C=US', issuer `CN=Starfield Services Root Certificate Authority - G2,O=Starfield Technologies\, Inc.,L=Scottsdale,ST=Arizona,C=US', serial 0x067f944a2a27cdf3fac2ae2b01f908eeb9c4c6, RSA key 2048 bits, signed using RSA-SHA256, activated `2015-05-25 12:00:00 UTC', expires `2037-12-31 01:00:00 UTC', pin-sha256="++MBgDH5WGvL9Bcn5Be30cRcL0f5O+NyoXuWtQdX1aI="
- Certificate[3] info:
 - subject `CN=Starfield Services Root Certificate Authority - G2,O=Starfield Technologies\, Inc.,L=Scottsdale,ST=Arizona,C=US', issuer `OU=Starfield Class 2 Certification Authority,O=Starfield Technologies\, Inc.,C=US', serial 0x00a70e4a4c3482b77f, RSA key 2048 bits, signed using RSA-SHA256, activated `2009-09-02 00:00:00 UTC', expires `2034-06-28 17:39:16 UTC', pin-sha256="KwccWaCgrnaw6tsrrSO61FgLacNgG2MMLq8GE6+oP5I="
- Status: The certificate is trusted. 
- Description: (TLS1.2)-(ECDHE-RSA-SECP256R1)-(AES-128-GCM)
- Session ID: A2:57:9E:C6:37:8A:2B:11:35:C5:99:26:65:8C:F5:51:DF:84:52:88:50:6F:3E:A2:45:A7:24:03:77:05:A4:6F
- Ephemeral EC Diffie-Hellman parameters
 - Using curve: SECP256R1
 - Curve size: 256 bits
- Version: TLS1.2
- Key Exchange: ECDHE-RSA
- Server Signature: RSA-SHA512
- Cipher: AES-128-GCM
- MAC: AEAD
- Compression: NULL
- Options: safe renegotiation,
- Handshake was completed

- Simple Client Mode:


# gitaly replies with 0x04

What is the expected correct behavior?

Ideally, the tls grpc client should cooperate with tls termination middle-ware like ELB, and gitaly behind elb should be able to function.

at the very least, what is failing should be logged somewhere. or possibly the readiness check could report what failed when accessed from a trusted source (eg localhost)

Relevant logs and/or screenshots

as stated above, the lack of a log entry is complaint, I am capable of strace'ing any gitlab components between the application server and gitaly server upon request, I couldn’t identify anything at fault on the unicorn or gitaly processes network stacks when actuating the readiness check.

Results of GitLab environment info

Expand for output related to GitLab environment info
System information
System:         Ubuntu 18.04
Proxy:          no
Current User:   git
Using RVM:      no
Ruby Version:   2.5.3p105
Gem Version:    2.7.6
Bundler Version:1.16.6
Rake Version:   12.3.2
Redis Version:  3.2.12
Git Version:    2.18.1
Sidekiq Version:5.2.5
Go Version:     unknown

GitLab information Version: 11.8.0-ee Revision: 002a28279f5 Directory: /opt/gitlab/embedded/service/gitlab-rails DB Adapter: postgresql DB Version: 9.6.9 URL: https:// HTTP Clone URL: https:///some-group/some-project.git SSH Clone URL: git@:some-group/some-project.git Elasticsearch: no Geo: no Using LDAP: no Using Omniauth: yes Omniauth Providers: saml

GitLab Shell Version: 8.4.4 Repository storage paths:

  • default: /var/opt/gitlab/git-data/repositories
  • storage1: /srv/gitlab/storage1/repositories/repositories Hooks: /opt/gitlab/embedded/service/gitlab-shell/hooks Git: /opt/gitlab/embedded/bin/git

Results of GitLab application Check

Expand for output related to the GitLab application check
Checking GitLab subtasks ...

Checking GitLab Shell ...

GitLab Shell: ... GitLab Shell version >= 8.4.4 ? ... OK (8.4.4) Running /opt/gitlab/embedded/service/gitlab-shell/bin/check Check GitLab API access: OK Redis available via internal API: OK

Access to /var/opt/gitlab/.ssh/authorized_keys: OK gitlab-shell self-check successful

Checking GitLab Shell ... Finished

Checking Gitaly ...

Gitaly: ... default ... OK storage1 ... FAIL: 14:Connect Failed # nc -vzw1 gitaly.example.org 9999, tcp connections do work

Checking Gitaly ... Finished

Checking Sidekiq ...

Sidekiq: ... Running? ... yes Number of Sidekiq processes ... 1

Checking Sidekiq ... Finished

Checking Incoming Email ...

Incoming Email: ... Checking Reply by email ...

IMAP server credentials are correct? ... yes Init.d configured correctly? ... skipped MailRoom running? ... skipped

Checking Reply by email ... Finished

Checking Incoming Email ... Finished

Checking LDAP ...

LDAP: ... LDAP is disabled in config/gitlab.yml

Checking LDAP ... Finished

Checking GitLab App ...

Git configured correctly? ... yes Database config exists? ... yes All migrations up? ... yes Database contains orphaned GroupMembers? ... no GitLab config exists? ... yes GitLab config up to date? ... yes Log directory writable? ... yes Tmp directory writable? ... yes Uploads directory exists? ... yes Uploads directory has correct permissions? ... yes Uploads directory tmp has correct permissions? ... skipped (no tmp uploads folder yet) Init script exists? ... skipped (omnibus-gitlab has no init script) Init script up-to-date? ... skipped (omnibus-gitlab has no init script) Projects have namespace: ... 3/1 ... yes 3/3 ... yes 5/4 ... yes 3/6 ... yes Redis version >= 2.8.0? ... yes Ruby version >= 2.3.5 ? ... yes (2.5.3) Git version >= 2.18.0 ? ... yes (2.18.1) Git user has default SSH configuration? ... yes Active users: ... 4 Elasticsearch version 5.6 - 6.x? ... skipped (elasticsearch is disabled)

Checking GitLab App ... Finished

Checking GitLab subtasks ... Finished

Possible fixes

I suspect the unicorn side has a fault

Edited Feb 26, 2019 by Dylan Grafmyre
Assignee Loading
Time tracking Loading