Study potential data-loss scenario with current HA implementation

This is a study to demonstrate potential data-loss scenario with repmgr and pgbouncer automatic configuration using Consul.

I will try to describe how all the parts are configured and bounded together and analyze the potential pitfalls. Then I will try to demonstrate my conclusions with a PoC of the current HA solution implemented at gitlab.com.

How pgbouncer configuration is updated using repmgr state through consul

Consul watchers and checks are used to monitor the state of master configured with repmgr and to execute action in case of another master is chosen.

Consul is configured as follow for postgresql hosts:

{
  "service": {
    "name":"postgresql",
    "address":"",
    "port":5432,
    "checks": [ {
      "script":"/opt/gitlab/bin/gitlab-ctl repmgr-check-master",
      "interval":"10s"
    } ]
  },
  "watches": [ {
    "type":"keyprefix",
    "prefix":"gitlab/ha/postgresql/failed_masters/",
    "handler":"/opt/gitlab/bin/gitlab-ctl consul watchers handle-failed-master"
  } ]
}

For pgbouncer hosts:

{
  "watches": [ {
    "type":"service",
    "service":"postgresql",
    "handler":"/var/opt/gitlab/consul/scripts/failover_pgbouncer"
  } ]
}

The gitlab-ctl repmgr-check-master executed by the check every 10 seconds in each host is running this snippet:

    def is_master?
      hostname = attributes['repmgr']['node_name'] || `hostname -f`.chomp
      query = "SELECT name FROM repmgr_gitlab_cluster.repl_nodes WHERE type='master' AND active != 'f'"
      host = attributes['gitlab']['postgresql']['unix_socket_directory']
      port = attributes['gitlab']['postgresql']['port']
      master = Repmgr::Base.execute_psql(database: 'gitlab_repmgr', query: query, host: host, port: port, user: 'gitlab-consul')
      show_count = Repmgr::Base.cmd(
        %(gitlab-ctl repmgr cluster show | awk 'BEGIN { count=0 } $2=="master" {count+=1} END { print count }'),
        Etc.getpwuid.name
      ).chomp
      raise MasterError, "#{master} #{show_count}" if master.length != 1 || show_count != '1'
      master.first.eql?(hostname)
    end

It checks the state of master in repmgr nodes table and that there is not more than a master in "repmgr cluster show" output.

When the check result change the state of service postgresql registered in Consul also changes firing the watcher handler script /var/opt/gitlab/consul/scripts/failover_pgbouncer on the pgbouncer nodes. The script receive as input the list of nodes registered as service postgresql with check results for each node. The scripts filter the nodes that pass the serfHealth check and the postgresql service's custom check (the script gitlab-ctl repmgr-check-master resumed in the snippet above). Those nodes are the masters pointed by repmgr tables and if more than 1 are found the script kill pgbouncer to avoid brain split (gitlab-ctl pgb-kill --pg-database gitlabhq_production --user gitlab-consul). If zero exit and if 1 it updates the pgbouncer configuration to point to the new master and reload it (gitlab-ctl pgb-notify --newhost #{newhost} --user pgbouncer --hostuser gitlab-consul).

Guard to remove old master from repmgr

When repmgrd execute an automatic failover promoting a new master the event repmgrd_failover_promote is fired and a consul KV gitlab/ha/postgresql/failed_masters/<old master> (where old master is extracted from repmgrd event's detail message) is written, this change is detected by the consul watcher on the old master that executes gitlab-ctl consul watchers handle-failed-master that at last unregister the old master node from repmgr thus making pgbouncer update its configuration to stop pointing to the master as fast as possible.

Conclusions

We can identify that there is a 10s delay between pgbouncer pointing to old master and pgbouncer pointing to the new one. If during this period a brain split occur pgbouncer will route traffic to the old master while a new master is being promoted.

More important problem is that there is not a real STONITH mechanism to get rid of the old master in case of an automatic failover happens. Since this is not handled there is always a chance that a split brain occurs.

The only way to eliminate those race conditions would be to include the automatic failover functionality (currently implemented by repmgr) in a state managed by Consul (or a similar service implementing a Distributed Consensus Algorithm) so that if Consul is not available no failover can occur and no brain split can happen. This is because Consul, and any other service implementing a Distributed Consensus Algorithm, guarantee that an operation is consistent despite any state of the hardware.

PoC

We generated a PoC to demostrate our conclusions. The current architecture is quite complex to emulate with a PoC. It is not 1-1 with current production status and may contain bugs or important difference that could make the results wrong in the strict sense. In any case is a good tool to understand what could be possibly wrong.

The PoC is composed of a Maven project that builds a docker image with custom builds of postgresql 9.6.8, pgbouncer 1.8.1, consul 0.9.0 and repmgr 3.3 (we also tested with 4.1.0 for completeness) and run a docker compose that spawn an infrastructure of 3 consul nodes (forming the consul cluster), 4 postgresql nodes (running 1 master pg and 3 replicas, repmgrd and consul agent) and 1 pgbouncer node (with pgbouncer and consul agent).

The project also include a Java tool that is able to check master consistency by connecting to current master pointed by consul check (as implemented in script /var/opt/gitlab/consul/scripts/failover_pgbouncer), and to simulate network failures (with iptables).

We implemented following testcases:

Normal failover
Failover with stopped postgres on target
Failover with stopped repmgrd on target
Failover with stopped repmgrd on master
Failover after normal failover and clone of old master: with this we test failback
Failover with master network failure
Failover with target master network failure
Failover with master and target master network failure: this is actually a network partition where master and target still see between them but are isolated by other hosts
Failover with master and pgbouncer network failure: this is actually a network partition where master and pgbouncer still see between them but are isolated by other hosts

All the code of the PoC is available at repository: https://gitlab.com/teoincontatto/gitlab-ha-poc

Results

With PG 9.6.8, pgbouncer 1.8.1, consul 0.9.0, repmgr 3.3:
- Normal automatic failover is working
- Automatic failover is NOT working if repmgrd is dead on master node of normal automatic failover
- Automatic failover is NOT working if repmgrd is dead on target node of normal automatic failover
- Automatic failover is working if postgresql is dead on target node of normal automatic failover
- Normal automatic failback is NOT working after clone & follow of initial master
- Automatic failover is NOT working if current master node has a network split (not reachable by any other node): after the split network disappear we detect a split brain since old master have not been stopped.
- Automatic failover is working if target node has a network split (not reachable by any other node): after the split the node follow the new master.
- Automatic failover is NOT working if current master node and target node has a network split (not reachable by any other node): after the split network disappear we detect a split brain since old master have not been stopped; also target instance is still following old master; while automatic failover promoted new master, pgbouncer was pointed to old master causing a real split brain.
- Automatic failover is NOT working if current master node and pgbouncer node has a network split (not reachable by any other node): after the split network disappear we detect a split brain since old master have not been stopped and pgbouncer stop serve traffic since two master are available; while automatic failover promoted new master, pgbouncer was pointing to old master causing a real split brain.
With PG 9.6.8, pgbouncer 1.8.1, consul 0.9.0, repmgr 4.1.0:
- Normal automatic failover is working
- Automatic failover is NOT working if repmgrd is dead on master node of normal automatic failover
- Automatic failover is NOT working if repmgrd is dead on target node of normal automatic failover
- Automatic failover is working if postgresql is dead on target node of normal automatic failover
- Normal automatic failback is working after clone & follow of initial master
- Automatic failover is NOT working if current master node has a network split (not reachable by any other node): after the split network disappear we detect a split brain since old master have not been stopped.
- Automatic failover is working if target node has a network split (not reachable by any other node): after the split the node follow the new master.
- Automatic failover is NOT working if current master node and target node has a network split (not reachable by any other node): after the split network disappear we detect a split brain since old master have not been stopped; also target instance is still following old master; while automatic failover promoted new master, pgbouncer was pointed to old master causing a real split brain.
- Automatic failover is NOT working if current master node and pgbouncer node has a network split (not reachable by any other node): after the split network disappear we detect a split brain since old master have not been stopped and pgbouncer stop serve traffic since two master are available; while automatic failover promoted new master, pgbouncer was pointing to old master causing a real split brain.

Edited Aug 20, 2018 by Matteo Melli