Identify at which step(s) downtime occurs during an upgrade. This might involve using HAProxy dashboards, real-time server logs and/or other means to get live feedback (end-to-end tests have a delay related to built-in waits inherent to these types of tests).
Fix any blockers to zero-downtime upgrades.
Test revised zero-downtime upgrade process on current and previous versions of GitLab (the versions with version-specific instructions available on docs.gitlab.com)
Revise zero-downtime instructions for current GitLab version, and update instructions for the previous GitLab versions tested in previous step (either with corrected instructions for zero-downtime upgrades or removal of instructions if zero-downtime is not possible for those versions)
@fzimmer@nhxnguyen We might consider this a ~bug as far as we know. From that perspective, it is somewhat severe, since there is currently no known workaround for the feature of "zero-downtime upgrade". I would weight 3 since testing and troubleshooting multi-server deployments can take a long time. We would open a follow up issue for any significant bug found.
It is very possible that we are just missing something in our procedure, and we just need to add more to the upgrade documentation.
If it is something we thought should work and it doesn't I agree that this is a ~bug.
I attached some priority/severity to it so we can pick this up. I'll let @nhxnguyen do the scheduling on this one :)
Pulling this onto the board. I think it would make sense to prioritize this ahead of further upgrade testing to see if we can better understand what's happening.
@nhxnguyen I'll need an engineer's help to monitor dashboards, troubleshoot and document while I perform the upgrade steps. Also I estimate this will need a block of 2-3 hours so we can perform each step on each node sequentially (usually I perform upgrade steps on non-deploy nodes simultaneously), so we can better pinpoint when issues happen.
@geo-team Would anyone be able to help? @mkozono has been helping out with the most recent upgrade demos but it would be nice to spread out the responsibility some more. Perhaps we can have one of our new team members join in with a more experienced team member.
@alexives@dbalexandre@aakriti.gupta The GCP Geo deployment used for upgrades is currently on version 12.10.12. Can we upgrade to 13.0 to test zero-downtime, as long as there's no pending background migrations? That my understanding based on these zero-downtime upgrade docs.
Or we can upgrade to 12.10.14.
I think it should be ok to upgrade to either 13.0 (to tackle both issues at once?) as long as the PG version is 11, otherwise upgrading PG (because 13.0 requires PG 11) will require downtime.
@fzimmer My understanding is that Step 1 Identify at which step(s) downtime occurs during an upgrade. was completed and #228898 (closed) (which is already workflowready for development) and #228954 are step 2. So those should be prioritized.
Then we can pick this issue up again and re-test the process and make any needed revisions to the instructions (steps 3 and 4). @jennielouie@alexives Please correct me if I'm wrong.
When running the first step under On all other nodes excluding the primary “deploy node” I am getting some downtime. This is going from 13.1.10 to the latest nightly and the downtime occurred during the upgrading of GitLab on all nodes.
Also its important to note that I was using automation for this, so I was upgrading all notes simultaneously. The docs suggest this is ok.
502Whoops, GitLab is taking too much time to respond.
I'm going to try again going from the earliest 13.9 to latest nightly as well.
Ohhhh. GitLab Bot just pinged me on an old issue omnibus-gitlab#5047 (closed). This may be the root of this issue: Reconfigure restarts the Rails web server after a GitLab package upgrade. Our zero-downtime instructions assume it does not (we tell you to hup puma afterwards).
I suppose the next question would be how sequential we need to go. reconfigure 1 server at a time which would mean about 23 reconfigures 1 after the next. Or would we only need to be sequential within each different node type i.e. restart 1 redis node at a time, and one postgres node at a time, but these can be done simultaneously.
The non Geo HA docs suggest we could reconfigure everything simultaneously except the webservice nodes which should be 1 at a time.
Upgrades on web (Puma/Unicorn) nodes must be done in a rolling manner, one after another, ensuring at least one node is always up to serve traffic. This is required to ensure zero-downtime.
Also the first line in the non Geo sections seems important
You can only upgrade 1 minor release at a time. So from 13.6 to 13.7, not to 13.8. If you attempt more than one minor release, the upgrade may fail.
I haven't seen this in the Geo section so my original upgrade from 13.1 to 13.9 wouldn't be valid if this holds true for Geo, which I assume it does. This could also be the cause of downtime rather than the reconfigure.
I haven't seen this in the Geo section so my original upgrade from 13.1 to 13.9 wouldn't be valid if this holds true for Geo, which I assume it does. This could also be the cause of downtime rather than the reconfigure.
I suppose the next question would be how sequential we need to go. reconfigure 1 server at a time which would mean about 23 reconfigures 1 after the next. Or would we only need to be sequential within each different node type i.e. restart 1 redis node at a time, and one postgres node at a time, but these can be done simultaneously.
The non Geo HA docs suggest we could reconfigure everything simultaneously except the webservice nodes which should be 1 at a time.
I don't know for sure but I assume that at least one node needs to be available to serve any requests.
I suppose the next question would be how sequential we need to go. reconfigure 1 server at a time which would mean about 23 reconfigures 1 after the next. Or would we only need to be sequential within each different node type i.e. restart 1 redis node at a time, and one postgres node at a time, but these can be done simultaneously.
The non Geo HA docs suggest we could reconfigure everything simultaneously except the webservice nodes which should be 1 at a time.
I don't know for sure but I assume that at least one node needs to be available to serve any requests.
Yes, at least one is required but I assume the recommendation to upgrade 1 web node at a time is necessarily conservative since the number of nodes needed depends on current usage etc.
As Alex noted we do see downtime when running gitlab-ctl hub puma as well. the non Geo docs recommend to run these commands on one node at a time and remove each node from the load balancer while the command is running.
Closing this issue, I've raised #326346 with some feedback on the documentation. I'm also working on a process for testing zero downtime upgrades actually work to help find this kind of issue in the future.