GitLab GEO documentation
Collecting comments from GEO MRs.
What we've done
- PostgreSQL replication worked well
- Redis is sharing session (primary <-> secondary)
- Authentication should happen on the primary server because it changes a lot of states (most related with security measures, like brute force prevention, password recovery, etc).
- If this becomes an issue, there is an alternative to store the states in Redis, but will take some time to implement.
- We have a middleware to prevent potentially writing operations on secondary servers (to make sure its readonly)
How we are going to update repositories on secondary servers:
I had a call with @dzaporozhets today, and we discussed about the following approaches:
- Primary should push to all secondaries
- Each Secondary should pull from primary
The conclusion is that if Primary can push to any secondary, then can any other user. To overcome this, we will have to break our permission system.
It's easier to have Secondary servers pulling from Primary, and it also makes easier to deal with connection issues, as primary doesn't have to keep track of that.
We are already replicating PostgreSQL from Primary to Secondary,
and we do need to replicate Redis to share session data, so it's the perfect place to store data and coordinate secondary repositories updates. We don't want to do "cron like" pooling of every single repository, but update only the modified ones. The idea we discussed is to create one queue for each instance in redis, where primary will enqueue repositories that secondaries needs to fetch. This queue should be namespaced with something like
geo/[:secondary_host]. We can use redis MULTI to bulk update every namespace in a single transaction, to make sure we don't have inconsistencies between secondary instances.
As we don't want to require Procfile updates to run Geo, we will try to use Sidekiq-cron to pool from the repository updates queue. As a starting point we will pool every 10 seconds, but will try to make it down to 5.
Database replication are in master <-> slave topology where master is GitLab Geo primary node and where writing operations happen and the slaves are any Geo secondary node, which are read-only.
I followed digita ocean's instruction (link here: https://www.digitalocean.com/community/tutorials/how-to-set-up-master-slave-replication-on-postgresql-on-an-ubuntu-12-04-vps)
simplified steps on the master node:
- create a user with replication role and password
- list slaves IPs in pg_hba.conf with replication type and set for md5 authentication
- change some parameters in postgres.conf to enable wal, listen on public ip, etc
simplified steps on the slave node:
- list master IP in pg_hba with replication type and set for md5 authentication
- same paramteres changes in postgres.conf
- create recovery.conf with some parameters pointing to master, etc.
To get it working, db:backup db:restore is not enough. This may not be the best way to do it, but was the most common instruction I found (after database restart):
psql -c "select pg_start_backup('initial_backup');" rsync -cva --inplace --exclude=*pg_xlog* /var/lib/postgresql/9.1/main/ slave_IP_address:/var/lib/postgresql/9.1/main/ psql -c "select pg_stop_backup();"
(pause database for backup, copy database data files, resume) if you just dump and restore, there is something I couldn't figure out, that puts you "out of sync" from the wal logging.
getting a new secondary geo node up and running will also require
repositories folder to be rsynced from master initially. (if this step is not done, it will eventually clone and fetch every missing repository as they are updated on master). and final step will be to regenerate keys for
rake gitlab:shell:setup (https clone will work without this extra step)