Skip to content

Gitlab server docker container crashing EC2 instance host

Summary

In the future we hope to run Gitlab in Kubernetes, but we're not in a position to do that right now. We're in the process of migrating from on-prem hardware to an AWS EC2 instance, and opted to run Gitlab in docker on the host. We have this working now, and all part of user-data->cloud-init->docker-compose. The only problem we have is that about once a day (sometimes a few times a day, sometimes not for a few days), we see the CPU spike to 100%, the EC2 instance become unresponsive and the instance killed by AWS (it's in an ASG). We have the config and logs on a persistent EBS volume, but there is nothing in the logs during that time.

Steps to reproduce

  1. EC2 instance in an Autoscaling group
  2. User-data like:
aws ec2 attach-volume --volume-id $VOL_ID --instance-id $INST_ID --device /dev/sdh
mkdir -p /mnt/ebs
echo "/dev/nvme1n1 /mnt/ebs  ext4  defaults,nofail  0  2" >> /etc/fstab
mount -a
...
cat << EOF > /root/docker-compose.yml
version: '3'
services:
  gitlab:
    image: 'gitlab/gitlab-ee:latest'
...
docker-compose -f /root/docker-compose.yml up -d

docker-compose.yml:

version: '3'
services:
  gitlab:
    image: 'gitlab/gitlab-ee:latest'
    restart: always
    hostname: 'gitlab.example.com'
    environment:
      GITLAB_OMNIBUS_CONFIG: |
        # General config
        external_url 'https://gitlab.example.com'
...
        # Prometheus monitoring

        prometheus['enable'] = false
        gitlab_monitor['listen_address'] = '0.0.0.0'
        gitlab_monitor['listen_port'] = '9168'
        gitaly['prometheus_listen_addr'] = '0.0.0.0:9236'
        redis_exporter['listen_address'] = '0.0.0.0:9121'
        postgres_exporter['listen_address'] = '0.0.0.0:9187'

    ports:
      - '2222:22'
      - '80:80'
      - '443:443'
      - '9168:9168'
      - '9236:9236'
      - '9121:9121'
      - '9187:9187'

    volumes:
      - '/mnt/ebs/gitlab/config:/etc/gitlab'
      - '/mnt/ebs/gitlab/logs/gitlab:/var/log/gitlab'
      - '/mnt/ebs/gitlab/data:/var/opt/gitlab'

I've removed a bunch of the details from the above as there is some sensitive data, and to remove some of the logic around attaching the EBS volume - this bit works flawlessly, and the gitlab server always comes back up with it's original config and logs.

What is the current bug behavior?

On average, about once a day we see a spike in CPU (this is recorded by AWS Cloudwatch, and prometheus via the node_exporter), followed a few minutes later by the EC2 instance becoming unresponsive and failing EC2 status checks. As the instance is in an autoscaling group, AWS terminate the instance and it gets reprovisioned.

What is the expected correct behavior?

We would like to see gitlab/docker to crash no more than once a month.

Relevant logs

There's nothing, the logs get cut off right at the start of the CPU spike, and resume again after the fresh instance comes up.

Details of package version

Debian:

> lsb_release -a
No LSB modules are available.
Distributor ID:	Debian
Description:	Debian GNU/Linux 8.11 (jessie)
Release:	8.11
Codename:	jessie

Docker:

> docker version
Client:
 Version:           18.06.1-ce
 API version:       1.38
 Go version:        go1.10.3
 Git commit:        e68fc7a
 Built:             Tue Aug 21 17:25:03 2018
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
 Engine:
  Version:          18.06.1-ce
  API version:      1.38 (minimum version 1.12)
  Go version:       go1.10.3
  Git commit:       e68fc7a
  Built:            Tue Aug 21 17:23:29 2018
  OS/Arch:          linux/amd64
  Experimental:     false

Gitlab:

> docker image ls --no-trunc
REPOSITORY          TAG                 IMAGE ID                                                                  CREATED             SIZE
gitlab/gitlab-ee    latest              sha256:81cec2cfbac7bdaacc25f2c1e70b1df8ace916d4abc2a638151ee5a355265b71   40 hours ago        1.75GB

Environment details

  • Operating System: Debian Jessie 8.11
  • Installation Target, remove incorrect values:
    • VM: AWS
  • Installation Type, remove incorrect values:
    • New Installation
  • Is there any other software running on the machine: No
  • Is this a single or multiple node installation? Single
  • Resources
    • CPU: 4 (m5.xlarge)
    • Memory total: 16GB (m5.xlarge)

Configuration details

See docker-compose.yml above.