With the addition of Gitaly Cluster and Praefect however it requires a separate database. For efficiency and maintenance reasons the desire will always be to have the databases on the same database cluster. However this isn't currently possible as failover is only handled by PgBouncer and Consul for the gitlabhq_production database and there doesn't appear to be a way to get it to track other databases on the same cluster.
Without this capability we can't use Praefect on the same database cluster, which is a notable inconvenience for Customers as well as ourselves. From previous interactions with customers I believe the general consensus will strongly be that they'll want to use the same cluster where possible.
Specifically adding this ability to make it so PgBouncer and Consul can track other databases will unlock this and any other future components to be fully managed in Omnibus.
To provide some context here: this is mostly an issue with PgBouncer notification script and to some extent the way that gitlab-ctl pgb-notify is implemented.
Consul watches PostgreSQL and when it goes down it triggers the notification script with the new leader's host name. At this point Consul hands it over to the handler script and it is up to this script to decide what to do with it. Currently it calls gitlab-ctl pgb-notify. This command accepts a database name but currently it is not used which falls back to the default value.
So the scaffolding for the fix is there but must be revisited with the new requirement in mind. That being said I know that one of the enterprise customers actually looking into something similar.
@pursultani We did have this working at one point with the pgbouncer['databases'] attribute. It wasn't an intentional feature at the time, and the workflow wasn't great so it was never documented. But I believe currently if you add entries to that without host information, pgb-notify will track those when the leader changes, alongside the gitlabhq_production entry. It has been a while so my memory is a little off on how it worked.
Subsequent changes may have also broken this, but it could be a good starting point.
Thanks @ibaum - this is becoming pretty important for a large opportunity. Would you be able to take a quick look and see if it's something that still works? We'll have a larger meeting on this early next week.
@joshlambert I'll probably defer to @pursultani on that. I'm not fully up to speed on the current state of, or plans for, spinning up a Praefect cluster in omnibus-gitlab alongside a DB cluster. I do think if we're automating that, we should include this.
I think we should also do some work to make sure this continues to work. This wasn't really a planned feature, so at a minimum we may want to add some specs to ensure this setup generates the expected configuration. Also, the line that @grantyoung found should be fixed before we recommend using this in a production environment.
@grantyoung This definitely needs some more rigorous testing, but it should work for failovers.
Discussed this with the Distribution team today, the core issue here is that Omnibus does not currently manage the Praefect database by design: gitaly#2476. Praefect assumes a "Bring Your Own DB" model, which means you manage failures, etc.
@zj-gitlab@mjwood - should Omnibus manage this database? Is this something the Gitaly team would prioritize?
@joshlambert at the moment the amount of automated configuration that Omnibus does for Gitaly Cluster database is almost non-existent. This requires some work on not only Consul/PgBouncer but also Praefect recipes as well.
Discussed this with the Distribution team today, the core issue here is that Omnibus does not currently manage the Praefect database by design: gitaly#2476. Praefect assumes a "Bring Your Own DB" model, which means you manage failures, etc.
Right now that's the case as we'd have no way to have LISTEN/NOTIFY support and HA with omnibus. It feels we can finally resolve this with #6039 (closed).
Other than that, we'd love native Omnibus support Praefect databases!
Reflecting on this today this may now actually be possible with the addition of Pgbouncer support for both of Praefect's connections.
There's still a hardcode relating to split brain but this isn't a blocker per say to day to day functioning of Praefect HA and failover with Omnibus Postgres.
I'm going to assign this to myself to test it out.
Ok so I've built out an environment with Praefect configured to use the same database server as GitLab and going through the same PgBouncer cluster for both pooling and failover (this includes both Praefect connections for main and cache).
As a smoke test the functionality here looks to be viable. Consul correctly updated PgBouncer to move all database connections for GitLab, Praefect and Praefect Cache over to the new primary:
Test Environment details
10k Reference Architecture with Praefect configured to use the main Omnibus Postgres cluster:
Praefect connects through the main load balanced PgBouncer cluster for both it's main and cache connections
PgBouncer is configured to track all "3" databases (one is a second connection to the Praefect database but in session mode)
So initially then it appears Praefect HA support with Omnibus Postgres is possible now. However there are several caveats:
This setup can only work either colocated or completely separated. If a separate Omnibus Postgres instance is desired it will require separate Postgres, PgBouncer and Consul nodes (as consul will register both under the same name currently if combined).
Praefect database and user need to be created. The user needs to have an MD5 hashed password.
PgBouncer either needs to be configured with TLS or Praefect disabled from requiring it (as this is it's default, which is different from Rails)
PgBouncer needs to be configured to follow all three databases
A database function, pg_shadow_lookup, needs to be created manually in the Praefect database (with the gitlab-psql user).
Will look to do some more testing, specifically a GPT run but this form of setup will need to be validated with relevant stakeholders before recommending to any customer and updating the Reference Architectures.
@twk3, @timzallmann, @joshlambert, @manuel.kraft, @stanhu - I've tested above colocating Praefect's database with GitLab's on the same Omnibus Postgres setup running through the same PgBouncer cluster. Initial tests look to confirm this works with failover working as expected.
I'm to do more testing on my end but I feel such a change will need to be validated by relevant stakeholders before proceeding. I've pinged you as a first group, feel free to review the above or ping someone else you think would be relevant to review.
I think #6562 (closed) is an important improvement to help further reduce the issues when encountering a split-brain, but not a blocker, as even if it was working as intended, there is no guarantee it would come into effect in time.
If a separate Omnibus Postgres instance is desired it will require separate Postgres, PgBouncer and Consul nodes (as consul will register both under the same name currently if combined).
@grantyoung I think you can re-use the same consul, but you would need to set
On the praefect db patroni nodes and praefect pgbouncer nodes. Then it should use that in the key and dns in consul. But at the moment you can't re-use the pgbouncer unless you completely colocate as you've pointed out. (We only have it looking at that one database ini file, which is being handled by a single failover watcher).
As a customer, setting 3 more nodes (+ 3 more per secondary site, as we are using Geo) juste for Consul could be really annoying (and expensive). If the solution of @twk3 works it will be really appreciated.