Walk through instructions for setting up HA with Omnibus GitLab
Overview
This issue is to document my experience of setting up Omnibus HA for the first time. The goal is to learn the configuration needed to set up HA, and identify areas of the documentation or process that aren't intuitive for a new administrator of GitLab.
Process
For my first attempt, I'm following the minimum recommendations for HA. I am creating the HA instance on GCP. I'm following the order outlined in the installation instructions section of the Scaling and High Availability page.
-
Create a Cloud Load Balancer on GCP (larissa-ha in Network services. Front end configured with a static IP but still need to configure the backend to specify a node group) -
Create and configure 3 Consul nodes. Create an instance group and add the three consul nodes to the group -
Create and configure 3 PostgreSQL nodes with pgbouncer -
Create and configure 3 Redis nodes -
Create and configure 1 Gitaly storage server -
Create and configure 1 NFS storage server -
Create and configure 2 application nodes -
Create and configure 1 monitoring node
Next
-
Follow the GL documentation to set up GitLab HA on AWS
Findings
General
- I set up my HA nodes on GCP and had to figure out the best way to do this. I noticed that we have documentation on how to set up HA on AWS, but don't have it for GCP. It would have been helpful to have similar documentation for GCP.
- The HA documentation mentions that we may eventually have a different storage solution to NFS so we can have a fully HA offering and links to #2472 (closed) as the issue to track. That issue has been closed. What is the status and future plans for this? Do we need to update the documentation to reflect any changes?
- When I got to the step of creating Consul nodes, there were no guidelines on the specifications I need to use for the VM instances in GCP. The Prerequisites section on the Consul HA page just says to download/install GitLab on each node. It would be helpful to have a link or section on the recommended specifications at this point. I did go back and find the specifications used for the reference architecture. This would be a good link to add on the Consul HA page because a reader may not have read that far down the Scaling and High Availability page before jumping to the Consul HA page.
- Setting up identical servers for HA, such as three Consul servers, seems like a good use case for using GCP auto-scaling instance groups. Is this a good approach? If so, it would be great to have platform-specific instructions that provide tips. Could I set up one instance group per identical server grouping shown in the diagrams on the Scaling and HA page? So for example, for the Horizontal scaling architecture, would I need six instance groups? If I'm using an instance group in GCP, and I select autoscaling, do I even need to create multiple nodes for the same service, or will GCP autoscaling handle that for me? If so, we could update the docs to reflect that.
- In the HA architecture examples we provide, having 1 Gitaly server, 1 shared storage server, 1 load balancer, and 1 monitoring node seems like single points of failure. Should we specifically call that out as a risk for this architecture example?
- Does using cloud platform-provided autohealing eliminate the need for any GitLab-provided self-monitoring services?
Scaling and High Availability page
- It was difficult to find the HA documentation from the Docs site navigation gitlab-docs#450 (closed)
- For the Hybrid model, it would be good to have some real-world examples of cases when I might need to move from a Horizontal to a Hybrid scaling model. Something like "In larger environments this is a good architecture to consider if you foresee or do have contention due to certain workloads. For example, if you often have a large volume of ??? in your environment."
- The installation instructions are included under Basic Scaling but not in any other section. Should they be pulled out and put at the top? Do the steps in those instructions apply for every scaling and HA configuration listed on the page? Or do we need to have a unique set of installation steps for each section?
- It would be useful to have a high-level overview about how we handle scaling and HA on this overview page. For example, what is happening when a node goes down? How can I test that HA/scaling is actually working as expected. What will happen when a failed node comes back up? Does GitLab HA use an active/active configuration? Or are the additional nodes standby?
- This page is divided into HA architecture examples and Reference architecture examples, but doesn't really link the two. For each HA architecture example, I'd like to see a note about which Reference architectures they are best suited for so that I don't have to cross reference the number of nodes listed under the HA architecture and Reference architecture to determine a match.
- In the Fully Distributed section, the diagram doesn't match the description. This is confusing.
- We say that Gitaly doesn't have HA, but we recommend setting up multiple Gitaly servers. What is the purpose of this if it isn't HA? Is this purely for scaling? Is it using sharding? If so, we should mention that.
- The Installation Instructions say "A link at the end of each section will bring you back to the Scalable Architecture Examples section so you can continue with the next step" but when I click on Step 1. Load Balancers, there is no link at the end of that page that brings me back to the Scalable Architecture Examples page.
- Is it accurate to say that our definition of HA involves: 1) supporting the use of a third-party load balancer to distribute the load of GitLab services across multiple nodes so that services can still run if one service node goes down, and 2) providing a failover and failback solution for nodes that are responsible for storing GitLab data?
- The installation instructions say that the steps need to be completed in order, but the load balancer step only seems to relate to load balancing for the application servers. Why do I need to set up a load balancer before setting up the Consul nodes? This doesn't seem accurate.
Load Balancer for GitLab HA The Installation Instructions on the Scaling and High Availability page point readers to the Load Balancer page as the first step for setting up HA.
- In the intro section, should we mention some cloud native load balancers in the examples?
- "We hope that if you’re managing HA systems like GitLab you have a load balancer of choice already." sounds a bit condescending and could be removed or reworded to "you are probably using already".
- In this page, we assume that users are using NGINX and provide instructions based on that. Is that a safe assumption? Is that because we package NGINX? Can users disable NGINX and use something
- "As part of its High Availability stack, GitLab Premium includes a bundled version of Consul" - doesn't Consul come bundled in all tiers?
- "A Consul cluster consists of multiple server agents, as well as client agents that run on other nodes which need to talk to the Consul cluster." - Can we be more specific about which nodes the client agents need to run on? e.g. nodes that are running services x,y,or z
- I think it would be helpful if this page started with an overview of what Consul is and why someone should be using it in their HA configuration. The HA instructions on the Scaling and High Availability page send users to this Consul page as the second step after setting up a load balancer so at this stage a new user may not understand the benefits of Consul. Reading Hashicorp's page, Consul offers a number of benefits and it's not clear exactly how GitLab uses Consul. Which services does it monitor? What about those services does it monitor? What does it detect and what does it do when it detects it?
- When creating the instance group for the consul nodes in GCP, I was required to enter port numbers for port name mapping. I wasn't sure which port numbers I needed to include here so I added 80, 22, and 443
- Do I need to install Postfix on every node in my HA cluster?
- When I got to Step 2 in https://docs.gitlab.com/ee/administration/high_availability/consul.html#configuring-the-consul-nodes, I thought I was supposed to find an existing section in
gitlab.rb
and replaceY.Y.Y.Y consul1.gitlab.example.com Z.Z.Z.Z
with my consul node's IP address. It took me a while to figure out that the code snippet in the docs doesn't exist in gitlab.rb and I actually need to paste it in there. It would help to have a section in the Scaling and HA overview page explaining the concept of HA roles in GitLab and prepping users by saying that each node in their HA environment will need to be assigned a role by uncommenting theroles
line. As a newcomer to both GitLab and HA, this wasn't obvious. It may have also helped this customer: https://gitlab.zendesk.com/agent/tickets/137548. Instead of "Edit/etc/gitlab/gitlab.rb
how about something like, "In/etc/gitlab/gitlab.rb
, find theroles
line and replace it with the following code. Replace values noted in..." - By pasting the code snippet from Step 2 in https://docs.gitlab.com/ee/administration/high_availability/consul.html#configuring-the-consul-nodes into
gitlab.rb
I end up with duplicategitlab_rails['auto_migrate']
lines: one that was already there but is commented out, and the new uncommented line I just pasted in. This seems a bit messy. Perhaps we should leave that line out of the copyable code snippet in the docs and instead have an additional step in the instructions that instructs users to changegitlab_rails['auto_migrate']
fromtrue
tofalse
and uncomment it. - There is no explanation of why I'm instructed to disable auto migrations and I was left wondering. It would be great to explain why we ask users to disable this.
- In Step 3 of https://docs.gitlab.com/ee/administration/high_availability/consul.html#configuring-the-consul-nodes, I think the user experience would be better if we just provide the
reconfigure
command instead of sending the user to a different page to get it. - What is the difference between setting
auto_migrate
to false versus just leaving it commented out? Does it default to true if not specified? Might be useful to explain that in the docs. - In "Configuring the Consul nodes", instead of just being instructed to "perform the following", I would like to know what each of these settings do and why I am being told to change them. For example, what does
retry_join
do? - After following the steps to install and configure Consul on each node, I ran
/opt/gitlab/embedded/bin/consul members
. The output on each node was just the node itself, not the two other nodes in the cluster. I referenced the troubleshooting section and tried runninggitlab-ctl tail consul
. I got the errorsCoordinate update error: No cluster leader
andfailed to sync remote state: No cluster leader
. I decided to try the fix in https://docs.gitlab.com/ee/administration/high_availability/consul.html#consul-server-agents-unable-to-communicate and specify abind_addr
. When I went togitlab.rb
I saw that thebind_addr
line didn't exist so I added it and used the internal IP address of the node as the value. All of the other lines underconsul['configuration']
use hash rocket syntax, but the example in the docs uses a colon syntax for key value pairs. This made me question whether the colon syntax was going to work. I later realized that I now have twoconsul[configuration']
sections because I pasted one in from the code sample in the docs, and then there was already one in the config file that was commented out. Overall, having the code sample in the docs as a code block caused a lot of confusion for me. Because thebind_addr
value I entered didn't resolve any issues, I went back intogitlab.rb
and deleted my changes. - Going back over the instructions to troubleshoot why I can't get each of my Consul nodes to work, I wondered what exactly is meant by "Make sure you collect CONSUL_SERVER_NODES". What is
CONSUL_SERVER_NODES
? This is the first time I've seen this and I wonder if it is a variable from another file. I'm also wondering if I entered the correct value in theretry_join
line. I entered the internal IP address of the node that I was on. Should I have entered the internal IP address of all three consul nodes? I went back and changed this line to have all three internal IP addresses. I made the change on all three nodes and then rangitlab-ctl reconfigure
. That didn't resolve any issues. - Next, I tried changing the
retry_join
line to only include the internal IP addresses of the other two nodes, and exclude the IP address of the node that I'm on. That also didn't resolve any issues. Themembers
command still only returns the current node, and thetail
command still returns errors. - Next, I tried adding the
bind_addr
back in each node'sgitlab.rb
file and used the internal IP address for the node as the value (e.g.bind_addr: '10.138.0.58'
). That still didn't resolve the issues so I deleted it. - I used telnet to confirm that my nodes are able to connect to each other.
- I found in this Zendesk ticket that one customer had explicitly enabled Consul so I tried that but it didn't help.
- In the end, I scrapped most of what I had done and started again and successfully joined the three Consul servers to the Consul cluster
- Throughout this experience, I found the syntax checker used in
gitlab-ctl reconfigure
to be really helpful. - Do all of the Consul nodes in a cluster need to be running the same version of GitLab? I installed GitLab on two nodes and then installed on the third node after a new minor version of GitLab had been released. It seemed to work fine but it did make me wonder if we recommend against that.
- At the top of the page we say "A Consul cluster consists of multiple server agents, as well as client agents that run on other nodes which need to talk to the Consul cluster.". I think it would be helpful to add "In the steps on this page, you will create and configure the server agents. Client agents will be configured (when?)"
- If we're following the instructions in order, we haven't installed any Consul agents yet, which makes the output in https://docs.gitlab.com/ee/administration/high_availability/consul.html#checking-cluster-membership different to what the user would see.
- Why do we have this section that only applies to Consul agents when we haven't installed any agents yet https://docs.gitlab.com/ee/administration/high_availability/consul.html#restarting-the-server-cluster. It's not clear to me at this point which nodes the Consul agents will be installed on
- https://docs.gitlab.com/ee/administration/high_availability/consul.html#outage-recovery Is this a limitation of GitLab or Consul? Is there anything we can do to re-establish quorum when the nodes come back online and elect a leader?
Configuring PostgreSQL for Scaling and High Availability
- "Advanced configuration options are supported and can be added if needed." Is there documentation we can link to for details about advanced configuration options?
- "If you are a Community Edition or Starter user, consider using a cloud hosted solution." We should also urge users to upgrade to Premium if they want a fully-supported HA solution and mention the benefits of support.
- The master DB node is evicted from the HA cluster when it fails, and needs to be manually added back in when the node is recovered. Do customers want this to be an automated process (is this a feature request)?
- The docs don't say what happens when the failed master comes back online. Based on #4577 it doesn't sound like a good user experience. Curious how Patroni handles this in .com.
- The Scaling and High Availability page sent me to the Configuring PostgreSQL page as a first step. After reading the first several sections of the page I determined that I should go to GCP and create some DB nodes first and then return to this page for configuration instructions. I didn't know what specs I needed for my nodes. It would help to include guidelines on this page or link back to the reference architecture examples. The examples are lower down in the Scale and High Availability page so I hadn't seen them yet.
- The first couple of sections on the page jump between Omnibus Postgres and BYO Postgres. We could make that flow a bit better.
- Make it clearer that the page is divided into scaling and HA. In the scaling section, provide a note indicating to skip down to the HA section if HA is what you're actually trying to accomplish
- The link to Ultimate license information should link to the Self Managed page, not the .com page.
- In PostgreSQL with High Availability, the docs say "This section is relevant for High Availability Architecture environments including Horizontal, Hybrid, and Fully Distributed". However, the recommended configuration being described in the docs is based on the Hybrid architecture. We should say that.
- In https://docs.gitlab.com/ee/administration/high_availability/database.html#network-information, does "IP address of each nodes network interface." mean every node in the HA cluster, regardless of the node's role? Or just the nodes in the DB cluster, and nodes that communicate directly with the DB nodes?
- It's confusing jumping back and forth between all the different pages. Is there some consolidation we could do? Such as include the couple of steps from the install docs instead of sending the reader to the install docs. Can we have a single source of truth for that content that we can reuse throughout the docs but if the source of truth ever changes all of the sections that reuse that content are updated.
- In https://docs.gitlab.com/ee/administration/high_availability/database.html#postgresql-information the docs say "This is used to prevent replication from using up all of the available database connections." How many database connections are there?
- In https://docs.gitlab.com/ee/administration/high_availability/database.html#installing-omnibus-gitlab, the docs say "Make sure you install the necessary dependencies from step 1, add GitLab package repository from step 2." Step 1 and step 2 where? We could add "in the install steps" with a link to those docs so there's less ambiguity.
- In https://docs.gitlab.com/ee/administration/high_availability/database.html#high-availability-with-gitlab-omnibus-premium-only, I think we could get rid of "This configuration is GA in EE 10.2." now since we are moving away from referring to "EE" and we're many versions ahead now
- In the diagram in https://docs.gitlab.com/ee/administration/high_availability/database.html#architecture, be clearer that the Consul nodes installed on the db nodes are Consul agents
- In https://docs.gitlab.com/ee/administration/high_availability/database.html#architecture it's not really clear to me exactly which operations Consul provides compared to repmgrd. They are both described as monitoring the DB cluster. What exactly is each service monitoring? The roles and relationships of pgbouncer, consul agents, and repmgr could be explained more clearly here, especially for users that are new to setting up HA environments.
- In A minimum of one pgbouncer service node, but it’s recommended to have one per database node we recommend one pgbouncer node per database node. If that is our recommendation, we should reflect that in the architecture diagram below. We don't recommend that in the architecture models in https://docs.gitlab.com/ee/administration/high_availability/#high-availability-architecture-examples, but we do in the Reference Architectures. We should update the HA architecture models to represent the Reference Architecture recommendations.
- In https://docs.gitlab.com/ee/administration/high_availability/database.html#network-information, the docs say "This can be set to 0.0.0.0 to listen on all interfaces." I think this could be worded more clearly. Instead of "This can be", say what "this" is. I can't set the listening interface until I've installed Postgres, so it seems like it would be more logical to include this as a step after installing GitLab/Postgres. I was confused because I thought this was information I was supposed to gather now, even though it doesn't exist yet.
- https://docs.gitlab.com/ee/administration/high_availability/database.html#consul-information left me confused about where the username is stored, how exactly to change it, which node I should be on when I generate the consul password hash and whether I need to store that somewhere or if GitLab stores it for me. I scanned the rest of the page looking for more guidance but the Consul username, password, and hash aren't mentioned anywhere else on the page so that left me thinking that there's an important step that I haven't completed and don't know how to complete. This is another section of the docs that I think should come after the PostgreSQL install step so that I can look at some configuration files as I'm reading through it and figure out if values are being stored.