Geo installation review
TL;DR
I installed a minimal Geo system using the Google Cloud Platform (GCP). The Geo installation consisted of one primary located in London and a secondary located on the US east coast:
graph LR;
fz-geo-primary-europe-west2-c --> fz-geo-secondary-us-east1-b
I succeeded installing Geo using Omnibus within one business day; however, the process itself was highly manual, error prone and time consuming. This issue documents the problems I encountered and why I believe these are relevant for system administrators or DevOps Engineers. This issue also raise questions about technical details and contains some ideas on what I think could improve the situation. I will create related issues where I believe items are actionable and link them back.
I am very keen to receive feedback from the @geo-team and will use this feedback to inform our product vision.
Next steps
- Move actionable concerns into individual tickets
- Discuss installation with Geo team
- Meet customers and investigate their needs
Installation statistics
Task | Time taken | User errors |
---|---|---|
GCP setup | 2h | 3 |
Geo setup | 4-5h | 5 |
Coffee consumed | 1L | None |
Environment and versions used
- 2x n1-standard-2 (2 vCPUs, 7.5 GB memory) on GCP, same VPC
- 40GB SSD persistent disk
- Ubuntu 18.04 LTS GCP image
- GitLab v11.10.4-ee
- EE licence applied (thanks @vsizov!)
Installing GEO on GCP
The following section follows the installation flow from start to finish as outlined in the documentation for installing GitLab using GCP and installing Geo for Omnibus.
Installing GitLab on GCP (not Geo related)
- Selecting the disc size and type in screenshots provided in the Creating the VM section are inconsistent. In the first screenshot, 80GB are selected, in the second screenshot 40GB are selected.
- The machine selected is larger than the minimum of 2CPU and 4GB of RAM. Why are these settings chosen? As a system administrator, I would like to chose an appropriate size for my GitLab instance on GCP so that I don't run into performance issues. Consider linking back to: https://docs.gitlab.com/ee/install/requirements.html
Issue: https://gitlab.com/gitlab-org/gitlab-ee/issues/11724
-
Following installation steps for Omnibus on Ubuntu was straight forward
-
Setting the
EXTERNAL_URL
parameter threw me off because I had not thought about external DNS names until this point. The documentation only states thatYou can use the IP address from the step above, as the hostname.
This is ambiguous because it could ether reference the internal IP or the external IP address for the instance.
Issue: https://gitlab.com/gitlab-org/gitlab-ee/issues/11724
-
Is the hostname the same as the
EXTERNAL_URL
? There is no mention of setting up DNS in the GCP installation guide until this point; however it is mentioned in a later section under "Using a domain name". I ended up using Google Cloud DNS, recommended by @rnienaber, as it was already set up for internal GitLab use. -
After that I could login and reset my
root
password at https://fz-geo-primary.gogitlab.com
The Geo panel on the primary
After installing GitLab on my primary, I visited the Geo Panel in the Admin area. As I did not have a EE licence, I was greeted with these two banners (notifications?):
-
The message
You need a different license to use Geo replication.
was not helpful to me. As a system administrator, I would like to be informed what licence I am on, what licence tier is required to use GitLab Geo and where I can find more information regarding licenses. A simple link to the licensing page could improve the situation. Also, for consistency the message should referenceGitLab Geo
, notGeo replication
. @vsizov helped me obtain a licence and all was well. -
Hashed storage will be enforced in
12.0
(https://gitlab.com/gitlab-org/gitlab-ee/issues/8690) - do we need to change the banner? I just followed the instructions and it worked. -
The nice green
New Node
button is a lie. It is clearly stated in the sentence that a lot of setup is needed until you can actually use the button, but as a user I still found myself a bit disappointed. Wouldn't it be great to just click the button and be guided through the setup? The current behaviour is that it will only work after a lot of setup
Setting up Geo
-
I followed the link to the setup instructions. Why does this bring me to the help page? What is the difference to our documentation? Is this so you can access the docs without internet access?
-
The help pages are not formatted correctly and provide double Caution and Note indications:
-
The Geo panel links directly to the setup instructions; however, the Geo requirements are above those instructions. Unless you scroll up, you will never know about which ports to forward. We should mention the requirements as well?
-
Setting up the secondary node was much faster the second time around (no pun intended).
-
I would strongly advocate toe move the source installation instructions onto a separate page, or remove them completely once we remove support for installation from source. Having those mixed in in several places makes it hard to parse the instructions and follow the flow of the documentation.
-
I wish we had a better way of visually distinguishing steps between the secondary and primary in the documentation - it required a significant amount of focus not to mix ups steps. Maybe we could change the background color or introduce another visual clue that helps distinguish these?
Geo database replication
-
Step 1. Configure the primary server
-
Why are these steps only for GitLab
10.4
and upwards? This is the first time a version requirement is mentioned? -
Why am I required to create a password MD5 hash and then paste a clear text password into the same file? Storing a plain text password in configuration feels dangerous?
-
What is the difference between
postgresql['sql_user_password']
andgitlab_rails['db_password']
? -
The inline comments in
/etc/gitlab/gitlab.rb
state that GitLab database settings, includinggitlab_rails['db_password']
only need to change if you use an external database. I am just following the Omnibus installation. Is this still needed? The documentation gives no indication of this. -
Section 5
Configure PostgreSQL to listen on network interfaces:
does not read well. I think instructions should be separated between a section for GCP or any cloud provider and a section for regular servers. The documentation jumps between the two configurations. -
The inline comments for section 5 regarding replacing UP addresses were helpful. That being said is this the best way to document these, especially because they are different/absent in
/etc/gitlab/gitlab.rb
-
Disabling and enabling
gitlab_rails['auto_migrate']
feels clunky - can we make this easier? I also don't understand why this is necessary? -
Copying data manually between primary and secondary, here the
server.crt
, was the first thing that really surprised me. Every other step so far is about configuring theprimary
but I did not anticipate having to copy things betweenprimary
andsecondary
using my clipboard. Is communicating safely between the two GitLab instances a bottleneck?
-
-
Step 2. Configure the secondary server
-
Creating a
server.crt
file usingeditor server.crt
is inconsistent with previous steps. So far I was asked to edit files without specifying an editor (the default here). Maybe just delete this command. It defaulted tonano
I like vim ;) -
From
12.0
should FDW be on by default for asecondary
? -
The configuration in part 7. are very repetitive and require me to fill in the same things again. That being said IP addresses are different and here I made my first manual error in mixing up IPs.
-
Why does step 9 provision FDW? I thought I had reconfigured gitlab before in step 8?
This last reconfigure will provision the FDW configuration and enable it.
I don't understand why.
-
-
Step 3. Initiate the replication process
- What is the difference between a slot name and a domain name? Can't we infer this automatically by default? What does database-friendly mean? I chose
fz-geo-secondary
and was rewarded with:
- What is the difference between a slot name and a domain name? Can't we infer this automatically by default? What does database-friendly mean? I chose
--------------------------------------------------------------
WARNING: Make sure this script is run from the secondary server
---------------------------------------------------------------
*** You are about to delete your local PostgreSQL database, and replicate the primary database. ***
*** The primary geo node is `10.154.0.24` ***
*** Are you sure you want to continue (replicate/no)? ***
Confirmation: replicate
* Executing GitLab backup task to prevent accidental data loss
* Stopping PostgreSQL and all GitLab services
Enter the password for gitlab_replicator@10.154.0.24:
* Checking for replication slot fz-geo-secondary
* Creating replication slot fz-geo-secondary
ERROR: replication slot name "fz-geo-secondary" contains invalid character
-
- It would be good to know that there are invalid characters and I think this script could just run with default values and infer settings from
/etc/gitlab/gitlab.rb
? - I have to enter plain text passwords again?
- The instructions
If your database is too large to be transferred in 30 minutes, you will need to increase the timeout, e.g., --backup-timeout=3600 if you expect the initial replication to take under an hour.
are not very helpful to me. I may have no idea how long this will take and it depends on many factors (size of DB, speed of connection etc`. Is there a way to infer this? Why do we need a time out?
- It would be good to know that there are invalid characters and I think this script could just run with default values and infer settings from
-
I did not configure PGBouncer yet.
-
After these steps I was personally ready to use Geo! Until I released that this was only step 3 of 7
.
Configure fast lookup of authorized SSH keys in the database
-
but note that the Write to "authorized keys" file checkbox only needs to be unchecked on the primary node since it will be reflected automatically on the secondary if database replication is working.
confused me because I had no idea where this checkbox was? In the Geo Admin panel? Somewhere else? This is actually mentioned below asWrite to "authorized_keys" file in the Admin Area > Settings > Network > Performance optimization of your GitLab installation.
consider moving to the first section -
Otherwise this section was easy to do; however, having to do it twice introduces overhead. Can't we configure this automatically on the secondary? This would need to be done n times for each new geo node.
Geo configuration (GitLab Omnibus)
Configuring a new secondary node
-
Step 1. Manually replicate secret GitLab values
- This section jumps between
primary
andsecondary
- it requires a lot of user attention. - Manually copying
/etc/gitlab/gitlab-secrets.json
between the two nodes feels clunky. I imagine automating this is technically hard though? There is an open issue https://gitlab.com/gitlab-org/gitlab-ee/issues/3789 , which I'd love to discuss though!
- This section jumps between
-
Step 2. Manually replicate the primary node's SSH host keys
- More manual transfer! Why can
/etc/gitlab/gitlab-secrets.json
be copied manually while SSH keys are transferred differently? I understand the security implications for SSH but why is it different for the above? - This is the first time
Disaster Recovery
is mentioned in the installation process! This is so cool but almost mentioned in passing! This feels like important information. - This section was pretty flaky for my GCP setup because SSH between instances does not work as described. You need to check out gcloud compute scp, maybe worth adding? I ended up manually up/downloading the tar ball.
- The number of commands and details here is pretty high. Maybe this could be automated using a script that can be executed from the
secondary
? - I am encouraged to manually compare that the SHA256 fingerprints between the two sets of keys matches, which is good. However, it would be helpful to give some context on why this is needed and/or provide some guidance on how to help with this. Clearly comparing letter by letter is not intended.
- More manual transfer! Why can
- Step 3. Add the secondary node
- This step made me happy! I could finally use the UI and add my secondary node!
. I also really enjoyed the twogitlab-rake gitlab:geo:check
tasks - they gave me a clear indication that things were going alright. Can we display this in the UI? - Step 4. Enabling Hashed Storage
- This surprised me. The first thing I saw when opening the Geo admin panel was a notification that I really should enable hashed storage, which I did. Now this turns up a the end of the documentation again? Consider moving.
- Step 5. was skipped
- Step 6. Enable Git access over HTTP/HTTPS
- Why do I have to change this manually? Shouldn't this be configured automatically on the primary when it is set to the role
geo_primary_role
Turns out in my installation this was also the default setting. -
Step 7. Verify proper functioning of the secondary node
- After all the steps I am rewarded with a great UI - this looks really nice! I enjoy the overview and it also just worked when I added a new repository.
The results of all of these steps are confidence inspiring! I added a new repository to the primary
and it just synced it to the secondary
. The UI reflected the change and I was confident that things were functioning. Well done!
Discussion and thoughts
Overall, the installation process was very manual, error prone and time consuming. I got stuck several times trying to install Geo and made a few manual mistakes that lead to issues. The whole process took me around one business day. That being said, a proficient system administrators can definitely set Geo up. The end result was the most rewarding: A UI dashboard and system that felt like "it just works". Ultimately, a good result.
In summary, I believe there is significant scope for improving the installation experience - and also simplifying disaster recovery procedures. During set up, a manual step can literally delete your main database if a user mixes up primary
and secondary
. Under time pressure, e.g. in a recovery situation, the likelihood of users making mistakes are even higher.
Here are a few initial high-level thoughts to be discussed:
-
Is installation a main concern for existing customers? I (Fabian) need to validate a lot of the assumptions with customers - maybe installation is not a big bottleneck. What about future customers? How many customers are not using Geo because they failed to install it / effort was too high?
-
A problem appears to be that
primary
andsecondary
need to communicate and trust each other - in order to establish this, many manual steps are necessary to copy files between nodes. What could be done to improve communication? -
Some operations use scripts, some others are all manual. Can we identify parts of the process that can be grouped into scripts? What would be needed to reduce the number of individual steps?
-
Can we move to a situation where most parameters are set to default values that make sense? What are the parameters that are truly variable?
-
GitLab is considering a Cloud-native first approach - could set up of Geo be simplified by using cloud-native solutions? (This feels like a thing far away in the future for Geo)
-
The documentation has few visual clues to distinguish
primary
andsecondary
- what are potential solutions? -
Imagine a future in which Geo could be set up in 15 minutes. What are the implications for us?