occasional 500 requiring server restart. Possible promethius error

Summary

Gitlab 9.5.1-ee keeps going down (500 internal server error). The logs make it seem that Promethius is at fault.

Steps to reproduce

No idea. This is just a random thing that happens. Sometimes while editing issues, someimes just after a pull, sometimes just after a push. It always happens after some user activity.

What is the current bug behavior?

While using gitlab it occasionally just 500s. using gitlab-ctl to restart doesnt fix anything. Resetting the whole server works though. This is very disruptive. It happens when there is a lot of user activity. I dont think it's a resource issue because one user can trigger the problem and we chave 8gB memory and a dual core processor. It only started happening a few days ago.

We are runnning it in a vm on Google. just before every crash there is a massive spike in processing - 100% of all cpus are eaten up. The rest of the time usage is pretty low.

I did a tail of the logs and it seems like promethius is the problem.

Relevant logs and/or screenshots

2017-08-29_09:56:18.79202 time="2017-08-29T09:56:18Z" level=error msg="Error opening memory series storage: leveldb: manifest corrupted (field 'comparer'): missing [file=MANIFEST-000167]" source="main.go:191" 
2017-08-29_09:56:19.87760 time="2017-08-29T09:56:19Z" level=info msg="Starting prometheus (version=, branch=, revision=)" source="main.go:88" 
2017-08-29_09:56:19.87763 time="2017-08-29T09:56:19Z" level=info msg="Build context (go=go1.8.1, user=, date=)" source="main.go:89" 
2017-08-29_09:56:19.88449 time="2017-08-29T09:56:19Z" level=info msg="Loading configuration file /var/opt/gitlab/prometheus/prometheus.yml" source="main.go:251" 
2017-08-29_09:56:19.88627 time="2017-08-29T09:56:19Z" level=error msg="Could not open the fingerprint-to-metric index for archived series. Please try a 3rd party tool to repair LevelDB in directory \"/var/opt/gitlab/prometheus/data/archived_fingerprint_to_metric\". If unsuccessful or undesired, delete the whole directory and restart Prometheus for crash recovery. You will lose all archived time series." source="persistence.go:213" 
2017-08-29_09:56:19.88639 time="2017-08-29T09:56:19Z" level=error msg="Error opening memory series storage: leveldb: manifest corrupted (field 'comparer'): missing [file=MANIFEST-000167]" source="main.go:191" 
2017-08-29_09:56:20.96307 time="2017-08-29T09:56:20Z" level=info msg="Starting prometheus (version=, branch=, revision=)" source="main.go:88" 
2017-08-29_09:56:20.96311 time="2017-08-29T09:56:20Z" level=info msg="Build context (go=go1.8.1, user=, date=)" source="main.go:89" 
2017-08-29_09:56:20.96902 time="2017-08-29T09:56:20Z" level=info msg="Loading configuration file /var/opt/gitlab/prometheus/prometheus.yml" source="main.go:251" 
2017-08-29_09:56:20.97053 time="2017-08-29T09:56:20Z" level=error msg="Could not open the fingerprint-to-metric index for archived series. Please try a 3rd party tool to repair LevelDB in directory \"/var/opt/gitlab/prometheus/data/archived_fingerprint_to_metric\". If unsuccessful or undesired, delete the whole directory and restart Prometheus for crash recovery. You will lose all archived time series." source="persistence.go:213" 
2017-08-29_09:56:20.97074 time="2017-08-29T09:56:20Z" level=error msg="Error opening memory series storage: leveldb: manifest corrupted (field 'comparer'): missing [file=MANIFEST-000167]" source="main.go:191" 
2017-08-29_09:56:22.04998 time="2017-08-29T09:56:22Z" level=info msg="Starting prometheus (version=, branch=, revision=)" source="main.go:88" 
2017-08-29_09:56:22.05005 time="2017-08-29T09:56:22Z" level=info msg="Build context (go=go1.8.1, user=, date=)" source="main.go:89" 
2017-08-29_09:56:22.05710 time="2017-08-29T09:56:22Z" level=info msg="Loading configuration file /var/opt/gitlab/prometheus/prometheus.yml" source="main.go:251" 
2017-08-29_09:56:22.05878 time="2017-08-29T09:56:22Z" level=error msg="Could not open the fingerprint-to-metric index for archived series. Please try a 3rd party tool to repair LevelDB in directory \"/var/opt/gitlab/prometheus/data/archived_fingerprint_to_metric\". If unsuccessful or undesired, delete the whole directory and restart Prometheus for crash recovery. You will lose all archived time series." source="persistence.go:213" 
2017-08-29_09:56:22.05887 time="2017-08-29T09:56:22Z" level=error msg="Error opening memory series storage: leveldb: manifest corrupted (field 'comparer'): missing [file=MANIFEST-000167]" source="main.go:191" 

Results of GitLab environment info

Expand for output related to GitLab environment info

System information System: Ubuntu 17.04 Proxy: no Current User: git Using RVM: no Ruby Version: 2.3.3p222 Gem Version: 2.6.6 Bundler Version:1.13.7 Rake Version: 12.0.0 Redis Version: 3.2.5 Git Version: 2.13.5 Sidekiq Version:5.0.4 Go Version: unknown

GitLab information Version: 9.5.1-ee Revision: 32be8c9 Directory: /opt/gitlab/embedded/service/gitlab-rails DB Adapter: postgresql DB Version: 9.6.3 URL: https://git.waxedlab.co.za HTTP Clone URL: https://git.waxedlab.co.za/some-group/some-project.git SSH Clone URL: git@git.waxedlab.co.za:some-group/some-project.git Elasticsearch: no Geo: no Using LDAP: no Using Omniauth: no

GitLab Shell Version: 5.8.0 Repository storage paths:

  • default: /var/opt/gitlab/git-data/repositories Hooks: /opt/gitlab/embedded/service/gitlab-shell/hooks Git: /opt/gitlab/embedded/bin/git

Results of GitLab application Check

Expand for output related to the GitLab application check

Checking GitLab Shell ...

GitLab Shell version >= 5.8.0 ? ... OK (5.8.0) Repo base directory exists? default... yes Repo storage directories are symlinks? default... no Repo paths owned by git:root, or git:git? default... yes Repo paths access is drwxrws---? default... yes hooks directories in repos are links: ... 4/1 ... ok 4/2 ... ok 4/3 ... ok 4/4 ... ok 4/5 ... ok 4/6 ... ok 20/7 ... ok 4/8 ... ok 4/9 ... ok 4/10 ... ok 4/11 ... ok 4/12 ... ok 15/14 ... ok 15/15 ... ok 15/16 ... ok 15/17 ... ok 15/18 ... ok 15/19 ... ok 15/20 ... ok 14/21 ... ok 13/22 ... repository is empty 13/23 ... repository is empty 13/24 ... ok 12/25 ... ok 12/26 ... ok 12/27 ... ok 12/28 ... ok 14/29 ... ok 14/30 ... ok 14/31 ... ok 14/32 ... ok 14/33 ... ok 14/34 ... ok 14/35 ... ok 14/36 ... ok 14/37 ... ok 14/38 ... ok 17/39 ... ok 20/40 ... ok 20/41 ... ok 20/42 ... ok 4/43 ... ok 19/45 ... ok 4/46 ... repository is empty 22/47 ... ok 22/48 ... repository is empty 4/49 ... ok 4/50 ... ok 4/51 ... ok 4/52 ... ok 4/53 ... ok 4/54 ... ok Running /opt/gitlab/embedded/service/gitlab-shell/bin/check Check GitLab API access: OK Access to /var/opt/gitlab/.ssh/authorized_keys: OK Send ping to redis server: OK gitlab-shell self-check successful

Checking GitLab Shell ... Finished

Checking Sidekiq ...

Running? ... yes Number of Sidekiq processes ... 1

Checking Sidekiq ... Finished

Checking Reply by email ...

Reply by email is disabled in config/gitlab.yml

Checking Reply by email ... Finished

Checking LDAP ...

LDAP is disabled in config/gitlab.yml

Checking LDAP ... Finished

Checking GitLab ...

Git configured correctly? ... yes Database config exists? ... yes All migrations up? ... yes Database contains orphaned GroupMembers? ... no GitLab config exists? ... yes GitLab config up to date? ... yes Log directory writable? ... yes Tmp directory writable? ... yes Uploads directory exists? ... yes Uploads directory has correct permissions? ... yes Uploads directory tmp has correct permissions? ... yes Init script exists? ... skipped (omnibus-gitlab has no init script) Init script up-to-date? ... skipped (omnibus-gitlab has no init script) Projects have namespace: ... 4/1 ... yes 4/2 ... yes 4/3 ... yes 4/4 ... yes 4/5 ... yes 4/6 ... yes 20/7 ... yes 4/8 ... yes 4/9 ... yes 4/10 ... yes 4/11 ... yes 4/12 ... yes 15/14 ... yes 15/15 ... yes 15/16 ... yes 15/17 ... yes 15/18 ... yes 15/19 ... yes 15/20 ... yes 14/21 ... yes 13/22 ... yes 13/23 ... yes 13/24 ... yes 12/25 ... yes 12/26 ... yes 12/27 ... yes 12/28 ... yes 14/29 ... yes 14/30 ... yes 14/31 ... yes 14/32 ... yes 14/33 ... yes 14/34 ... yes 14/35 ... yes 14/36 ... yes 14/37 ... yes 14/38 ... yes 17/39 ... yes 20/40 ... yes 20/41 ... yes 20/42 ... yes 4/43 ... yes 19/45 ... yes 4/46 ... yes 22/47 ... yes 22/48 ... yes 4/49 ... yes 4/50 ... yes 4/51 ... yes 4/52 ... yes 4/53 ... yes 4/54 ... yes Redis version >= 2.8.0? ... yes Ruby version >= 2.3.3 ? ... yes (2.3.3) Git version >= 2.7.3 ? ... yes (2.13.5) Active users: ... 9 Elasticsearch version 5.1 - 5.3? ... skipped (elasticsearch is disabled)

Checking GitLab ... Finished

Assignee Loading
Time tracking Loading