omnibus-gitlab issueshttps://gitlab.com/gitlab-org/omnibus-gitlab/-/issues2021-12-30T06:06:57Zhttps://gitlab.com/gitlab-org/omnibus-gitlab/-/issues/3718Minor changes in sentinel.conf by Sentinel causes restarts in Sentinel2021-12-30T06:06:57ZStan HuMinor changes in sentinel.conf by Sentinel causes restarts in SentinelOn GSTG today, we noticed that Sentinel was restarted. If you look at the reconfigure logs:
```
Staging REDIS_CHECKCMD_ERROR root@redis-01-db-gstg.c.gitlab-staging-1.internal:/var/log/gitlab/reconfigure# vi 1533224357.log
# Logfile cre...On GSTG today, we noticed that Sentinel was restarted. If you look at the reconfigure logs:
```
Staging REDIS_CHECKCMD_ERROR root@redis-01-db-gstg.c.gitlab-staging-1.internal:/var/log/gitlab/reconfigure# vi 1533224357.log
# Logfile created on 2018-08-02 15:39:17 +0000 by logger.rb/56815
[2018-08-02T15:39:17+00:00] INFO: Started chef-zero at chefzero://localhost:1 with repository at /opt/gitlab/embedded
One version per cookbook
[2018-08-02T15:39:17+00:00] INFO: *** Chef 13.6.4 ***
[2018-08-02T15:39:17+00:00] INFO: Platform: x86_64-linux
[2018-08-02T15:39:17+00:00] INFO: Chef-client pid: 1304
[2018-08-02T15:39:17+00:00] INFO: The plugin path /etc/chef/ohai/plugins does not exist. Skipping...
[2018-08-02T15:39:18+00:00] INFO: Setting the run_list to ["recipe[gitlab-ee]"] from CLI options
[2018-08-02T15:39:18+00:00] INFO: Run List is [recipe[gitlab-ee]]
[2018-08-02T15:39:18+00:00] INFO: Run List expands to [gitlab-ee]
[2018-08-02T15:39:18+00:00] INFO: Starting Chef Run for redis-01-db-gstg.c.gitlab-staging-1.internal
[2018-08-02T15:39:18+00:00] INFO: Running start handlers
[2018-08-02T15:39:18+00:00] INFO: Start handlers complete.
[2018-08-02T15:39:19+00:00] INFO: Loading cookbooks [gitlab-ee@0.0.1, package@0.1.0, gitlab@0.0.1, consul@0.0.0, repmgr@0.1.0, runit@0.14.2, postgresql@0.1.0, registry@0.1.0, mattermost@0.1.0, gitaly@0.1.0, letsencrypt@0.1.0, nginx@0.1.
0, acme@3.1.0, crond@0.1.0, compat_resource@12.19.0]
[2018-08-02T15:39:25+00:00] WARN: gitlab-rails 'redis_host' will be ignored as sentinel is defined.
[2018-08-02T15:39:25+00:00] WARN: Selected systemd because systemctl shows .mount units
[2018-08-02T15:39:25+00:00] INFO: The plugin path /etc/chef/ohai/plugins does not exist. Skipping...
[2018-08-02T15:39:25+00:00] INFO: template[/var/opt/gitlab/redis/redis.conf] backed up to /opt/gitlab/embedded/cookbooks/cache/backup/var/opt/gitlab/redis/redis.conf.chef-20180802153925.998785
[2018-08-02T15:39:25+00:00] INFO: template[/var/opt/gitlab/redis/redis.conf] removed backup at /opt/gitlab/embedded/cookbooks/cache/backup/var/opt/gitlab/redis/redis.conf.chef-20180731180823.608603
[2018-08-02T15:39:26+00:00] INFO: template[/var/opt/gitlab/redis/redis.conf] updated file contents /var/opt/gitlab/redis/redis.conf
[2018-08-02T15:39:26+00:00] INFO: template[/var/opt/gitlab/redis/redis.conf] sending restart action to service[redis] (immediate)
[2018-08-02T15:39:27+00:00] INFO: service[redis] restarted
[2018-08-02T15:39:27+00:00] WARN: only_if block for template[/var/opt/gitlab/sentinel/sentinel.conf] returned "/var/opt/gitlab/sentinel/sentinel.conf", did you mean to run a command? If so use 'only_if "/var/opt/gitlab/sentinel/sentinel
.conf"' in your code.
[2018-08-02T15:39:27+00:00] INFO: template[/var/opt/gitlab/sentinel/sentinel.conf] backed up to /opt/gitlab/embedded/cookbooks/cache/backup/var/opt/gitlab/sentinel/sentinel.conf.chef-20180802153927.407600
[2018-08-02T15:39:27+00:00] INFO: template[/var/opt/gitlab/sentinel/sentinel.conf] removed backup at /opt/gitlab/embedded/cookbooks/cache/backup/var/opt/gitlab/sentinel/sentinel.conf.chef-20180731214102.769393
[2018-08-02T15:39:27+00:00] INFO: template[/var/opt/gitlab/sentinel/sentinel.conf] updated file contents /var/opt/gitlab/sentinel/sentinel.conf
[2018-08-02T15:39:27+00:00] INFO: template[/var/opt/gitlab/sentinel/sentinel.conf] sending restart action to service[sentinel] (immediate)
[2018-08-02T15:39:27+00:00] INFO: service[sentinel] restarted
[2018-08-02T15:39:27+00:00] INFO: Chef Run complete in 9.277243458 seconds
[2018-08-02T15:39:27+00:00] INFO: Running report handlers
```
Based on this log, you can see the diff:
```
Staging REDIS_CHECKCMD_ERROR root@redis-01-db-gstg.c.gitlab-staging-1.internal:/var/log/gitlab/reconfigure# sudo diff /opt/gitlab/embedded/cookbooks/cache/backup/var/opt/gitlab/sentinel/sentinel.conf.chef-20180802153927.407600 /var/opt/gitlab/sentinel/sentinel.conf
205,207c205,206
< sentinel config-epoch gstg-redis 86
< sentinel leader-epoch gstg-redis 86
< sentinel known-slave gstg-redis 10.224.7.103 6379
---
> sentinel config-epoch gstg-redis 88
> sentinel leader-epoch gstg-redis 88
209c208
< sentinel known-sentinel gstg-redis 10.224.7.102 26379 c6c70b3130af78431deb724a2f056bebe3eb91f5
---
> sentinel known-slave gstg-redis 10.224.7.103 6379
211c210,211
< sentinel current-epoch 86
---
> sentinel known-sentinel gstg-redis 10.224.7.102 26379 c6c70b3130af78431deb724a2f056bebe3eb91f5
> sentinel current-epoch 88
```
For clarity, you can see the files themselves.
## Previous
```
sentinel config-epoch gstg-redis 86
sentinel leader-epoch gstg-redis 86
sentinel known-slave gstg-redis 10.224.7.103 6379
sentinel known-slave gstg-redis 10.224.7.102 6379
sentinel known-sentinel gstg-redis 10.224.7.102 26379 c6c70b3130af78431deb724a2f056bebe3eb91f5
sentinel known-sentinel gstg-redis 10.224.7.103 26379 34e0dd1665774c12b2a883110d399d8bd72027aa
sentinel current-epoch 86
```
## Current
```
sentinel config-epoch gstg-redis 88
sentinel leader-epoch gstg-redis 88
sentinel known-slave gstg-redis 10.224.7.102 6379
sentinel known-slave gstg-redis 10.224.7.103 6379
sentinel known-sentinel gstg-redis 10.224.7.103 26379 34e0dd1665774c12b2a883110d399d8bd72027aa
sentinel known-sentinel gstg-redis 10.224.7.102 26379 c6c70b3130af78431deb724a2f056bebe3eb91f5
sentinel current-epoch 88
```
It seems that the changes here are minor (e.g. current-epoch went from 86 -> 88), line ordering changes, etc. This seems like a fundamental problem with tying `sentinel.conf` to the restart step. Do we only need to bootstrap this file once, and let Sentinel manage it later?
/cc: @andrewn, @brodockhttps://gitlab.com/gitlab-org/omnibus-gitlab/-/issues/3727Document replication slots for PG HA2021-12-30T06:06:57ZIan BaumDocument replication slots for PG HArepmgr can automatically manage replication slots by setting `repmgr['user_replication_slots'] = 1`
They do come with risks though. Data partition can fill up with WAL logs if a standby is left down for too long.
We should document ho...repmgr can automatically manage replication slots by setting `repmgr['user_replication_slots'] = 1`
They do come with risks though. Data partition can fill up with WAL logs if a standby is left down for too long.
We should document how to use these in the PG HA doc, as well as what the implications are.
We might even consider turning them on by default.
The DB instances for gitlab.com do use replication slots.https://gitlab.com/gitlab-org/omnibus-gitlab/-/issues/3842Update PG HA documentation to advise for more secure defaults2020-08-14T15:46:49ZIan BaumUpdate PG HA documentation to advise for more secure defaultshttps://docs.gitlab.com/ee/administration/high_availability/database.html
We currently document using `trust_auth_cidr_addresses`, but don't go into the full implications of this. If done improperly, a database could be left fairly open...https://docs.gitlab.com/ee/administration/high_availability/database.html
We currently document using `trust_auth_cidr_addresses`, but don't go into the full implications of this. If done improperly, a database could be left fairly open.
We should update the documentation to better handle this.
* If we want to keep recommending `trust_auth_cidr_adresses`, we should explain what `Network Address` should and should not be.
* We should also consider other [authentication methods](https://www.postgresql.org/docs/9.6/static/auth-methods.html) as defaults. I think it might be possible to do certificate authentication without too much work.https://gitlab.com/gitlab-org/omnibus-gitlab/-/issues/3865Consider the current method for checking for which postgresql server is prima...2021-12-30T06:05:02ZJohn SkarbekConsider the current method for checking for which postgresql server is primary is too dangerous for use### Summary
The current method of checking for a primary in a postgresql cluster is dangerous if there's an event involving the cluster where a server might already be down. The command `gitlab-ctl repmgr cluster show` will try to reac...### Summary
The current method of checking for a primary in a postgresql cluster is dangerous if there's an event involving the cluster where a server might already be down. The command `gitlab-ctl repmgr cluster show` will try to reach out to every server participating in the repmgr configuration. When a server is down, this command will take a long time. In doing so, we do not progress through this function properly.
Due to this, consul does not configure the databases.ini. If this problem is active during a situation where postgres fails over to a new primary, consul will be unable to update pgbouncer, and all connections will be incorrect, leading to an outage.
As an added bonus to this problem, since this function appears to execute every 10 seconds (at least on GitLab.com), we end up with a bunch of backed up requests performing the same check, since the timing of this check takes roughly 2 minutes.
### Proposal
Determine a better method to figure out the primary postgresql node.
### References
https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5357
https://gitlab.com/gitlab-org/omnibus-gitlab/blob/master/files/gitlab-ctl-commands-ee/lib/repmgr.rb#L225-238https://gitlab.com/gitlab-org/omnibus-gitlab/-/issues/4387Setup Grafana in an HA setup2021-12-30T06:02:01ZIan BaumSetup Grafana in an HA setupWe should plan out, document, and make any necessary changes to properly support our Grafana instance in an HA setup.We should plan out, document, and make any necessary changes to properly support our Grafana instance in an HA setup.https://gitlab.com/gitlab-org/omnibus-gitlab/-/issues/4577Better handling the situation when failed master comes online after failing o...2020-09-21T13:03:13ZXiaogang WenBetter handling the situation when failed master comes online after failing over to new master<!--
Read me first!
Before you create a new issue, please make sure to search in https://gitlab.com/gitlab-org/omnibus-gitlab/issues,
to verify that the issue you are about to submit isn't a duplicate.
-->
### Summary
Currently after ...<!--
Read me first!
Before you create a new issue, please make sure to search in https://gitlab.com/gitlab-org/omnibus-gitlab/issues,
to verify that the issue you are about to submit isn't a duplicate.
-->
### Summary
Currently after fail-over when the old master comes online, pgbouncer will see there are 2 masters and stop routing. This can cause service interruption although ensuring data consistency.
### Proposal
PGBouncer keeps routing to the new master.
### References
<!-- Provide references related to this proposal -->Backloghttps://gitlab.com/gitlab-org/omnibus-gitlab/-/issues/4581Support of synchronized commit in PG HA2021-12-30T05:59:58ZXiaogang WenSupport of synchronized commit in PG HA<!--
Read me first!
Before you create a new issue, please make sure to search in https://gitlab.com/gitlab-org/omnibus-gitlab/issues,
to verify that the issue you are about to submit isn't a duplicate.
-->
### Summary
Support PG synch...<!--
Read me first!
Before you create a new issue, please make sure to search in https://gitlab.com/gitlab-org/omnibus-gitlab/issues,
to verify that the issue you are about to submit isn't a duplicate.
-->
### Summary
Support PG synchronized commit in PG HA for higher level of data consistency between master and standby
### Proposal
According to PG documentation at https://www.postgresql.org/docs/10/warm-standby.html#SYNCHRONOUS-REPLICATION, if `synchronous_commit=on` and `synchronous_standby_names` is set the commit will only return if the transaction is written to disk on both master and standby. If we could automate the setup of `synchronous_standby_names` and handle the failover it would add another level of data consistency in our PG HA solution.
### References
<!-- Provide references related to this proposal -->https://gitlab.com/gitlab-org/omnibus-gitlab/-/issues/4822Automate creation of peers.json for consul servers2021-12-30T05:56:15ZIan BaumAutomate creation of peers.json for consul servers
### Summary
To recover a failed consul cluster, one step involves creating a [peers.json](https://learn.hashicorp.com/consul/day-2-operations/outage#manual-recovery-using-peersjson) file with information about the server nodes we expec...
### Summary
To recover a failed consul cluster, one step involves creating a [peers.json](https://learn.hashicorp.com/consul/day-2-operations/outage#manual-recovery-using-peersjson) file with information about the server nodes we expect to see.
Creating this file manually can be cumbersome, and should be possible to automate
### Proposal
Add a command `gitlab-ctl consul create-peers-json` (I'm willing to bend on the naming) to automatically create the file
### References
Arising from work in https://gitlab.com/gitlab-org/omnibus-gitlab/merge_requests/3400