Gitlab server docker container crashing EC2 instance host
Summary
In the future we hope to run Gitlab in Kubernetes, but we're not in a position to do that right now. We're in the process of migrating from on-prem hardware to an AWS EC2 instance, and opted to run Gitlab in docker on the host. We have this working now, and all part of user-data->cloud-init->docker-compose. The only problem we have is that about once a day (sometimes a few times a day, sometimes not for a few days), we see the CPU spike to 100%, the EC2 instance become unresponsive and the instance killed by AWS (it's in an ASG). We have the config and logs on a persistent EBS volume, but there is nothing in the logs during that time.
Steps to reproduce
- EC2 instance in an Autoscaling group
- User-data like:
aws ec2 attach-volume --volume-id $VOL_ID --instance-id $INST_ID --device /dev/sdh
mkdir -p /mnt/ebs
echo "/dev/nvme1n1 /mnt/ebs ext4 defaults,nofail 0 2" >> /etc/fstab
mount -a
...
cat << EOF > /root/docker-compose.yml
version: '3'
services:
gitlab:
image: 'gitlab/gitlab-ee:latest'
...
docker-compose -f /root/docker-compose.yml up -d
docker-compose.yml:
version: '3'
services:
gitlab:
image: 'gitlab/gitlab-ee:latest'
restart: always
hostname: 'gitlab.example.com'
environment:
GITLAB_OMNIBUS_CONFIG: |
# General config
external_url 'https://gitlab.example.com'
...
# Prometheus monitoring
prometheus['enable'] = false
gitlab_monitor['listen_address'] = '0.0.0.0'
gitlab_monitor['listen_port'] = '9168'
gitaly['prometheus_listen_addr'] = '0.0.0.0:9236'
redis_exporter['listen_address'] = '0.0.0.0:9121'
postgres_exporter['listen_address'] = '0.0.0.0:9187'
ports:
- '2222:22'
- '80:80'
- '443:443'
- '9168:9168'
- '9236:9236'
- '9121:9121'
- '9187:9187'
volumes:
- '/mnt/ebs/gitlab/config:/etc/gitlab'
- '/mnt/ebs/gitlab/logs/gitlab:/var/log/gitlab'
- '/mnt/ebs/gitlab/data:/var/opt/gitlab'
I've removed a bunch of the details from the above as there is some sensitive data, and to remove some of the logic around attaching the EBS volume - this bit works flawlessly, and the gitlab server always comes back up with it's original config and logs.
What is the current bug behavior?
On average, about once a day we see a spike in CPU (this is recorded by AWS Cloudwatch, and prometheus via the node_exporter), followed a few minutes later by the EC2 instance becoming unresponsive and failing EC2 status checks. As the instance is in an autoscaling group, AWS terminate the instance and it gets reprovisioned.
What is the expected correct behavior?
We would like to see gitlab/docker to crash no more than once a month.
Relevant logs
There's nothing, the logs get cut off right at the start of the CPU spike, and resume again after the fresh instance comes up.
Details of package version
Debian:
> lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description: Debian GNU/Linux 8.11 (jessie)
Release: 8.11
Codename: jessie
Docker:
> docker version
Client:
Version: 18.06.1-ce
API version: 1.38
Go version: go1.10.3
Git commit: e68fc7a
Built: Tue Aug 21 17:25:03 2018
OS/Arch: linux/amd64
Experimental: false
Server:
Engine:
Version: 18.06.1-ce
API version: 1.38 (minimum version 1.12)
Go version: go1.10.3
Git commit: e68fc7a
Built: Tue Aug 21 17:23:29 2018
OS/Arch: linux/amd64
Experimental: false
Gitlab:
> docker image ls --no-trunc
REPOSITORY TAG IMAGE ID CREATED SIZE
gitlab/gitlab-ee latest sha256:81cec2cfbac7bdaacc25f2c1e70b1df8ace916d4abc2a638151ee5a355265b71 40 hours ago 1.75GB
Environment details
- Operating System:
Debian Jessie 8.11 - Installation Target, remove incorrect values:
- VM:
AWS
- VM:
- Installation Type, remove incorrect values:
New Installation
- Is there any other software running on the machine:
No - Is this a single or multiple node installation?
Single - Resources
- CPU:
4(m5.xlarge) - Memory total:
16GB(m5.xlarge)
- CPU:
Configuration details
See docker-compose.yml above.