@grantyoung Was your plan for this issue to update the Reference Architectures to use Gitaly Cluster (it's been renamed from HA gitlab-org/gitlab!31824), and to run the existing performance tests. Or to add new tests as well?
I ask because we have several new tests we also need to perform, some of which will involve generating load and measuring performance: https://gitlab.com/gitlab-org/quality/team-tasks/-/issues/451. But I assume those weren't the tests referred to in this issue, right? If not, I'll probably still need your help when I get time to work on them.
Unfortunately, I haven't had a chance yet to progress further with performance#231
Task for now is to add to the reference architectures and verify they can handle the current load.
For any new straight up performance tests (ones just for testing load and not so much functional ones) that would be a separate task and more than happy to work with you on them.
Grant Youngchanged title from Add & Test Gitaly HA and Praefect to the Reference Architectures to Add & Test Gitaly Cluster and Praefect to the Reference Architectures
changed title from Add & Test Gitaly HA and Praefect to the Reference Architectures to Add & Test Gitaly Cluster and Praefect to the Reference Architectures
Looks like indeed it is ready to go with decent docs as well. Will start this ASAP as it's effectively generally available. Will go through the docs as well as part of this process and provide any feedback.
One aspect that we need to consider here is performance scaling. When we first set up the reference architectures the largest, 50k, simply couldn't handle the load with one large Gitaly node. It was then clear that this wasn't realistic and we needed more nodes which sorted the issue.
With Cluster it's a failover system meaning the other shards in the cluster aren't used proper until master fails. As such we'll likely need multiple clusters for the larger environments (thinking 25k and 50k, possibly only 50k that I think was the only one to falter at it's scale). Praefect looks to handle multiple clusters fine though so this isn't a problem outside of added costs.
At the moment without testing anything (yet) but based on testing in the past I suspect only 50k will truly need this (and then could only need 2) and the others can be single clusters with vertical scaling. Will update here when I know more.
Praefect will need it's own internal load balancer, much like we already do with PgBouncer.
For easiness sake we could just make the same same LB we use for pgbouncer but this feels like it would be very unwise as all repo and sql calls will be going through the one LB. Will need to retool the builder to have 2 internal LBs.
@zj-gitlab Can you give me a sanity check on the above? I heavily assume that praefect's load balancer should be exclusive.
@grantyoung On GitLab.com we did create a new LB for Praefect, also because its relatively easy to set up.
For easiness sake we could just make the same same LB we use for pgbouncer but this feels like it would be very unwise as all repo and sql calls will be going through the one LB.
Yes, you could. As long as the rules don't mix the traffic, but that goes without saying.
Below are just my notes \ ramblings on the docs \ install as I go through them.
Disabling all services on the Praefect node
I actually have a much bigger list of disables. I inherited this list from some previous support config. Unsure if they’re actually all needed?
git_data_dirs needing the default entry on GitLab Rails
Totally appreciate this is an overhang of GitLab and that we have to work with it but I did have some confusion around this and why it was needed and I suspect customers could as well.
Prometheus Config
I was wondering if with Consul this would be picked up automatically but it doesn’t for Praefect. PgBouncer is the same but it has an issue open. Is there plans to do the same for Praefect? Also the docs specify to do manual config for Gitaly as well but this isn’t required when consul is deployed.
The config is currently under the GitLab section but this may should be separated out into it’s own? For most customers deplying cluster the monitoring node will be separate.
In addition a very small nit pick is for the Gitaly “service” that’s enabled on GitLab rails nodes you’ll also need to specify the prometheus listen address (gitaly['prometheus_listen_addr'] = "0.0.0.0:9236") or it’ll show as localhost in prom and as an error. However a consequence of this on larger installs is a much larger exporter list of essentially unused Gitaly nodes. It may be best to find out if we can disable this (with consul auto discovery).
Postgres for Praefect should be a cluster also
Already discussed but worth adding here is that the docs make it seem like a single node postgres server for praefect is enough when it shouldn’t be.
Step 11 for GitLab Update the Repository storage settings refers to a storage location named praefect incorrectly?
The docs say to set the repository storage to praefect but this isn’t what it’s called if you’ve been following the docs. It should be storage-1?
Example outputs for the two praefect test commands?
The two test commands for praefect to check postgres and gitaly are reachable worked great and were quite helpful. May be good to get show what a passing output looks like though?
Convenience Script for Praefect DB setup?
Any plans for a convenience script or something similar for the Praefect database setup?
Step 10 for Praefect mistake?
To ensure that Praefect has updated its Prometheus listen address, restart Gitaly:
I think that’s meant to say restart Praefect?
Sidekiq Config?
We'll need to call out Sidekiq in the docs and that it will need the same git_data_dirs config as well if it's on separate nodes.
I was able to get Cluster up and running pretty easily thanks to the well written docs! As such I'm starting performance testing to validate the environment setup and then tune.
50k is the environment of choice for this as it's exposed issues in the past with Gitaly that smaller environment's didn't.
Test 1 - 50k, 1 Gitaly Cluster
First test was done with the same specs as given in the 50k Reference Architecture with the following additional notes:
Praefect specs are set to the same as pgbouncer (n1-highcpu-2)
Wanted to first test out one Gitaly Cluster for 50k with the same specs as the previous 4 separate nodes. We had multiple gitaly nodes set for 50k to achieve both performance and availability so with the latter now built in it's worth resetting and seeing where the sweet spot is. In addition to this with gitlab-org/gitaly#2650 added a single cluster may now be able to server this architecture.
Environment is using the same internal load balancer (with higher specs) for convenience.
We have several notable failures, all around Gitaly related test endpoints. This isn't necessarily a shock as we've "downgraded" here from 4 individual Gitaly nodes to 1.
However with gitlab-org/gitaly#2650 I was expecting this to be better. Looking again at the MRs for distributed reads there may be a config flag I need to set which I'll explore in the next test.
Looking at the metrics only one Gitaly node appeared to be getting used in this run and that weirdly was gitaly-3. I would've assumed if there's no distributed reading that the primary gitaly-1 should've been getting used. A question for the Gitaly team here.
api_v4_projects_repository_branches is notably bad here. It's a known problem endpoint but still this is an increase of 7 seconds.
Foolishly I've lost the previous 50k metrics for comparison but from knowledge and other environments Postgres usage looks to be higher in places with Praefect is more in use.
This is also an opportunity to reflect on Postgres's CPU metrics in general. Looking with fresh eyes, with the addition of Praefect on it, it may be worth exploring increasing it's recommended CPU specs at least as it's frequently higher that we'd like across the tests on the larger environments.
Going by the MRs for Distributed Reads it appeared I needed to enable a setting for it to work praefect['distributed_reads_enabled'] = true but this hasn't had an effect. The MR was only merged 2 days ago but I would've thought that was enough for it to make it through to nightly.
The above could be fundamental for how the new Reference Architectures look in terms of number of clusters so I'm keen to get this on if possible and test it. I suspect with it working the above errors would all be fixed.
As for gitaly-3 getting all the read requests this looks to be perhaps normal. Disabling that node worked as expected and all the request were then funnelled to gitaly-2.
Still having trouble getting distributed reads to work properly today. Currently discussing with @8bitlife. If the feature doesn't work as expected it looks like the 50k reference architecture will require multiple clusters (and large monorepos may still suffer).
After debugging some more with the Gitaly team it was discovered that the Gitaly nodes weren't in sync and replication jobs were sticking for some reason. As a first step going to try and rebuild the environment to eliminate any potential environmental issues and then continue debugging from there.
Reads in this setup are focused on the secondaries so we only effectively had 2 nodes being used here so going to add another and test again to get more performance.
One test api_v4_projects_repository_compare_commits posted a lot of 404s weirdly but I suspect this was due to postgres more than Gitaly being maxed out. Unsure if this is due to praefect alone but we may need a spec bump for the database nodes here. Edit - Postgres usage is actually way higher than expected with a doubling of node specs still not being enough, this feels like a bug and will investigate more next week.
Ok yeah after some more testing today there's definitely something up with how Praefect uses it's database when distributed reads are on. Essentially it's database use increases explosively.
To test this I separated out Praefect's database today as a test. When distributed reads are on it's usage increases tenfold:
This was with Postgres running on an 8 core machine. When I had it running against the main database on 50k I saw it max out a 64 core machine (50k typically only requires 16 core).
I tried to find out what queries it was running but this proved to be difficult (any help with this is appreciated).
Regardless this is a blocker for eventually recommending Cluster on the reference architectures and I should probably raise a bug. WDYT @8bitlife?
@grantyoung Could you report the RPC/s, split by RPC? On GitLab.com the rate isnt this high, so I think we'll gradually turn this feature on to also understand what does and doesn't work. (RPC/s is about 30 on that cluster)
Sure I've taken a shot at that below. I wasn't tracking this previously and built a graph that I think shows what you want but if not let me know and can generate another one.
The below graph shows the GRPC rate at the same time as the above graph shows the high CPU usage:
Grant Youngchanged title from Add & Test Gitaly Cluster and Praefect to the Reference Architectures to Test Gitaly Cluster and Praefect to evaluate adding to Reference Architectures
changed title from Add & Test Gitaly Cluster and Praefect to the Reference Architectures to Test Gitaly Cluster and Praefect to evaluate adding to Reference Architectures
After speaking with the team further the distributed reads feature is hopefully going to enter beta with 13.1 and then enter GA around 13.3 (although of course this is subject to change).
I strongly believe for the reference architectures and for our customers at the larger end of the scale that distributed reads are a must.
The reason for this is to deliver both true availability of large monorepos as well as performance. Testing has clearly shown that to be performant at this scale distributed reads across multiple gitaly nodes are crucial.
Without distributed reads a customer would need to deploy several large clusters. This will be undesirable due to the complexity and cost implications as well as not being able to deliver true availability of large monorepos. Additionally with distributed reads coming out relatively soon customers will likely want to switch to that and undergo a further configuration change.
As such, my recommendation for the reference architectures and for customers is to wait for distributed reads to be released (at least in Beta for keen customers).
Yeah I think it should be fine to discuss this with customers verbally for now as the team develops the feature. Once it enters Beta and I can test it further I'm happy to look at writing up something for reference.
@grantyoung I think gitlab-org/gitaly#2697 can be used to report if there are any problems with pressure on database.
There were a couple of updates, you can find the full list in gitlab-org&2013
Ok thanks @8bitlife. Sounds like there was nothing done specifically do to the database and how it's accessed? I'll see if I can do another load test on it all soon.
gitlab-org/gitaly#2944 has just been closed (nice one @8bitlife!) so I'll be looking to do another test with this real soon. I'm off next week so it will be the week beginning the 10th or 17th of August.
Unfortunately we've discovered a new import issue today - gitlab-org/gitlab#235949. This will could block this testing latest until it's fixed but hopefully I can proceed with a slightly older version instead (trying this now).
Edit - Was hoping to get a test done with a slightly older nightly build to avoid the above but still got hit by it. Will be trying with an even older build (but one that will contain the praefect db fix) starting next week.
Hey team -- from my discussion with @stanhu the findings here is the same issue with gitlab-org/gitlab#227215 where distributed reads are stalling the whole process.
The good is that we did test it and found an issue, the bad is that the benefit of this finding was not realized in time to prevent gitlab-org/gitlab#227215
I think it would be beneficial to have a mini-retrospective on how we can improve the flow of communication when an imminent performance degradation is about to take place if new features being rolled out.
Hey @meks. In retrospect I should've just raised an issue and I'm certainly kicking myself I didn't.
Since the feature was unreleased at the time and still under active development I assumed just letting the team know was enough. In addition I didn't know it was to be used in production so soon.
Certainly several learnings for myself here at least moving forward and I'm more than happy to go through a retrospective about how to ensure better communication and actions take place when we (Quality Enablement) find issues for other groups.
At the 10k rate here we can see the Postgres performance impact appears to be substantially reduced! Suggesting gitlab-org/gitaly#2944 has indeed been fixed.
Test against 13.3.0-pre 82d6547b2a5 nightly with fix at 1000 RPS
At the 1000 RPS throughput rate Postgres did get maxed out again though surprisingly. Praefect was also heavily hit and looks to need more specs. Will start to debug it now.
Test against 13.3.0-pre 82d6547b2a5 nightly with fix at 1000 RPS and higher Postgres \ Praefect specs
Test results were posted here initially that were inaccurate due to Gitaly Cluster distributed reads not working correctly. This is to be followed up separately.
After doubling the specs of Postgres and Praefect (to n1-standard-32 and n1-highcpu-4 respectively) the metrics are looking the same as before. Distributed reads again appears to have a very heavy hit on Postgres, even with gitlab-org/gitaly#2944 being fixed.
Test against 13.3.0-pre 82d6547b2a5 nightly with fix at 1000 RPS and even higher Postgres \ Praefect specs
Upped both Postgres and Patroni again to see how they would perform and with n1-standard-64 and n1-highcpu-8 respectively we seem to have found some room to breath. Although at this point if this is accurate the need to quadruple Postgres requirements is pretty much a non-starter so at this time it's looking like more improvements need to be made.
Full test against 13.3.0-pre 82d6547b2a5 nightly with fix at 1000 RPS and even higher Postgres \ Praefect specs
With the full run above only a few tests cause Postgres to spike heavily on a 64 vCPU box. On a 16 vCPU box on my count it would be 9 tests with many others causing postgres to be almost maxed.
While this isn’t like before where Postgres was 100% maxed no matter what specs it was on, this time it looks to require around 4x the recommended machine size (16 to 64 vCPU) compared to what we have on the environment today. This is with Praefect and GitLab using the same database although even if we split it out it would likely be the case that Praefect would need a bigger specced database box than GitLab itself.
I’m open to potentially recommending we up the Postgres specs for the Reference Architectures (I’ve been mulling this on and off for a while but there wasn’t a big enough reason) when we have praefect involved as that’s a good justifiable reason. But quadrupling is way too high and customers won’t accept it I fear. As such there's likely a further issue here to be address with Praefect and Postgres for heavy environments and I'll look to raise one tomorrow to progress further.