We've recently worked on several large rollouts of Elasticsearch indexing for customers. These are still taking many hours to do a few thousand projects. We may want to find out for sure what the bottleneck is and if there is some easy way to speed things up. It's possible we could be going much faster by just increasing sidekiq concurrency but we don't know for sure. Sidekiq is intentionally throttled for Elasticsearch to prevent overloading the cluster but then our cluster is much bigger now that it used to be and should handle much greater concurrency.
Right now with these taking ~15 hours to do a few hundred groups I'm a little hesitant to do larger batches because it is holding up queues for updates to these records during this period of time. We may be able to solve that my separating queues and there are some issues already to do that but ultimately we still will have projects half indexed for long periods of times and so speeding it up would be preferable if there is an easy way to do that.
Monitoring from last roll-out (2020-04-09 05:22:12 UTC - 2020-04-10 10:56 UTC)
This page may contain information related to upcoming products, features and functionality.
It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes.
Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.
@mwasilewski-gitlab@dgruzd so I'm trying to figure out what we can do next to speed up the initial indexing of many groups. I've got a bunch of graphs here from the last time we did this which took 30 hours. First question I have is really whether or not it would have been safe to just run elastic_indexer and elastic_commit_indexer with higher concurrency. Would this cause any issues for the ES cluster? Would it cause issue for Gitaly or Postgres or Redis? Or is this safe to just turn this up and we'll see faster throughput?
From a glance at the gitaly and Postgres charts I can't see any noticeable impact while this is running and haven't heard anything from SREs about some issues while it is running. I also can't think of any reason Redis would struggle with any increase in sidekiq workers here.
@DylanGriffith there shouldn't be any issues with the ES cluster since it has a lot of headroom. As for postgresql, we should check our PG connection pool limit and its current utilization.
@dgruzd good reminder about connection pool. We did recently run into a problem with this and there is much discussion in https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/9198#note_287817712 . I'm not sure how an increase in concurrency may impact the fixes added in that but it certainly would not be ideal to trigger that event again.
The write queue was almost empty throughout the process, so the ES cluster still has headroom. Besides, if we want to prioritize performance during initial indexing we can always add more nodes and the indexing capacity of the cluster will increase (the index consists of 60 shards).
Postgres
Would it cause issue for Gitaly or Postgres or Redis?
As @dgruzd pointed out, connection pool saturation would be a first indicator of saturation. There are three connection pools to consider: Postgres connection limit, pgbouncer and application side connection pool.
The db side connection limit should not be reached cause we have pgbouncer in front of Postgres.
So in simple words, pgbouncer is already queueing connections. By adding more connections (running more elasticsearch workers) we won't overload the database itself (since pgbouncer is shielding the db), but we might exacerbate connection saturation on pgbouncer that could lead to other sidekiq connections being rejected.
pgbouncer-sidekiq.int.gprd.gitlab.net resolves to a GCP ILB, which has pgbouncer machines as backends.
db-replica.service.consul. is a consul (dynamic) entry which resolves to all read-only postgres machines.
Looking at the transport layer (using ss and tcpdump), there are connections to both. So I guess the relevant question now is, how can we tell what the worker is utilizing the pgbouncer connections for? and can we quantify it? (e.g. there is only one write transaction per each job which means it will scale linearly)
I'm not a DB expert or the Gitlab application expert, but wasn't this problem caused by Sidekiq having more threads trying to access the DB then there were connections available in the connection pool in the application? Shouldn't this be fixed by: gitlab-com/gl-infra/scalability#152 (closed) ?
Quantifying impact on Gitaly is much harder because the distribution of projects among Gitaly nodes is not truly randomized and HA (praefect) is still not available. This means that if we add a lot of projects for indexing that happen to be on a single Gitaly node, we can overload that node causing all other projects to slow down/become unresponsive.
However, this is not a new risk at all. We also have monitoring and alerting in place. It should also be easy to estimate such risk prior to executing indexing, e.g. get a list of projects for indexing -> get their sizes and Gitaly nodes hosting them -> estimate impact on Gitaly nodes
I think that given the current situation this shouldn't be a stopper, we should do the estimations as a pre-flight step to minimise the risk, but overall Gitaly is doing ok, cpu:
Looking at the transport layer (using ss and tcpdump), there are connections to both. So I guess the relevant question now is, how can we tell what the worker is utilizing the pgbouncer connections for? and can we quantify it? (e.g. there is only one write transaction per each job which means it will scale linearly)
From memory and looking at the code the only occasions for writing to Postgres in these workers is:
@aamarsanaa FYI there is some discussion here about scaling up our Elasticsearch sidekiq indexing infrastructure. This may be of some interest to you and we haven't yet turned this into anything actionable but I hope to make this into something actionable soon.
in terms of actionable items, I think it boils down to:
ES:
is fine for now
let's watch the write queue and scale up the cluster if needed
Redis:
ok for now
let's watch cpu and slowlog, we don't really have a way to scale Redis
Gitaly:
ok for now
let's do estimations before the next round of indexing and watch the relevant dashboard
Postgres:
we need to figure out what the worker is utilizing the pgbouncer (write) connections for and whether we can quantify it
What's the state of the db connection pool in the application? is it safe to add more sidekiq workers/increase concurrency?
let's watch the pgbouncer queued connections and figure out how can we tell if connections are being rejected (I don't think we have a chart on any dashboard for that)
@mwasilewski-gitlab thanks for the detailed analysis. I would like to add some points:
"I know that we've got lots of read-only Postgres replicas and that we're bypassing pgbouncer for read-only connections."
Why are we bypassing pgbouncer for read-only connections? we have 3 pgbouncer instances in from of each node that is read-only. It is recommended to use them.
"we need to figure out what the worker is utilizing the pgbouncer (write) connections for and whether we can quantify it"
Are we talking of sidekiq connections? Another factor that we could consider here is the time that the connections are open in an idle state if we could manage then better would be great. And could we use a different database user to quantify better those connections?
Or add a /** SQL COMMENT ON THE BEGINNING OF THE SQL STATEMENT **/ that would be possible to identify those queries on the pg_stat_Statements?
"What's the state of the db connection pool in the application? is it safe to add more sidekiq workers/increase concurrency?"
We have space to increase connections on the read-only hosts and some on the production. Could you help us to quantify how much is required? Adding those workers on the pgbouncer pool, we would in the worst-case scenario increase the queue, in case we saturate. We should test and evaluate the bottleneck point.
"let's watch the pgbouncer queued connections and figure out how can we tell if connections are being rejected (I don't think we have a chart on any dashboard for that)"
Are we talking here about read-write?
What I see that we have spikes in production and maybe we should address with engineering the following point: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/9396
I would like to mention that it is happening at the web-api pool, but is generating a spike on the primary database, and the resource for both pools, web-api and sidekiq.
I don't have much context here, but it would be definitely important to understand if this is a r-o or also r-w pattern. In any case:
I agree with @Finotto pgbouncers need to be used. As we have already discussed, this should be enforced and at a later time might be forcibly enforced --and only approved to side-pass it on a case-by-case basis.
We could identify the queries in pg_stat_statements if they have a different application_name or user. I understand they come from sidekiq, so this might not be an option. Any setting on sidekiq to do this? Otherwise, the comment approach should work.
There is headroom to add more load on the r-o secondaries.
An increase on the r-w load would need to be more carefully discussed.
(2) What's the state of the db connection pool in the application? is it safe to add more sidekiq workers/increase concurrency?
It seems we have the following from above:
There is headroom to add more load on the r-o secondaries.
Great. Most of the load is reads. As I mentioned the writes are to a single column on a single table and only happen once per job and only certain jobs. Such writes are only happening in the workers/jobs that are doing basically no reads from the database so I assume that this means we won't be holding primary connections for long or putting much pressure on the primary though I'm no expert. Does that alleviate some concerns?
An increase on the r-w load would need to be more carefully discussed.
As mentioned above the writes are very small updates only on some jobs and only on jobs that don't read much (anything) from Postgres. That being said we'd still be talking about increasing the existing write load by some multiple if we increase the number of workers.
Is there more careful discussion to be had? There was some mention of adding comments to the SQL to find the queries. I'm not aware of any easy way to do this from the application side but regardless by looking at the code I'm confident that these updates are going to be only INSERT/UPDATE to the index_statuses table. In fact this will be the vast majority of write interactions with that index_statuses table though there may be some rare instances that happen from web workers I'm forgetting. Does that narrow it down enough to feel confident it's not a concerning workload?
(3) let's watch the pgbouncer queued connections and figure out how can we tell if connections are being rejected (I don't think we have a chart on any dashboard for that)
What's the next step here? Do we have to build a dashboard for this? Is there an existing issue or should we create one if that will be a blocker here?
@Finotto could you possibly provide some guidance on the above. I think I addressed your point in saying that we know what the write queries are and presumably how to find them as they are the majority of writes to index_statuses table.
Adding those workers on the pgbouncer pool, we would in the worst-case scenario increase the queue, in case we saturate. We should test and evaluate the bottleneck point.
Are you suggesting we can try increase the workers and observe if the pool grows. Can you point to some metrics where we can observe this or propose how/when we can run this experiment or additional safeguards before we could?
Also is there some additional details I can provide about the application code to clarify how the database interactions are occuring? All reads and writes are going through normal ActiveRecord models in a normal sidekiq worker so I'm thinking the behaviour of these workers being doubled would be much the same as any other sidekiq worker as we aren't intending on doing anything special with the DB but maybe something is configured differently at the infrastructure level that I'm not aware of. Additionally they are doing writes at the end of processing the job and so if I remember correctly the behaviour of holding primary DB connections would mean that most of the reading was actually happing on read replicas before we get to that write query.
@mwasilewski-gitlab action: Is pgbouncer working on r-o workloads now. We were under the impression that it was maybe only for r-w workloads? will look into this based on the above comments possibly suggesting it is used for r-o workloads as well
Just confirming nothing exceptional is happening in our ES sidekiq workers and since we have headroom we should be good to just double the processes even if sidekiq workers aren't yet using pgbouncer which seems to be the case based on checking config pointing directly to internal load balancer at the moment.
We believe everything will scale linearly due to no use of transactions
Write pgbouncer is queuing connections so it is "saturated" at times but not rejecting yet so we could try increase concurrency a little and see what happens
Since the current elasticsearch node is at high CPU we will want to add another node. Updating terraform config and rolling back is as simple as stopping 1 of the sidekiq workers. @mwasilewski-gitlab can do this today and write rollback steps into the next change request gitlab-com/gl-infra/production#2012 (closed) "if pgbouncer is overloaded during this CR we can shut down the 2nd sidekiq node"
This will have to be addressed before we can increase indexing concurrency. There was some discussion on the infra team in the past about increasing the number of connections in pgbouncer, I'll have to catch up with that.
The reason why the fact that we're connecting to patroni machines rather than to pgbouncer machines is relevant is because r-o traffic is not bottlenecked by the saturation on the pgbouncer tier (which r-w traffic is).
We recently changed the way we prioritize sidekiq jobs. We are also in the process of migrating sidekiq to kubernetes (part of the sidekiq workload is already running in k8s). So there are a lot of moving parts that could be contributing to the saturation of the async pool on pgbouncer (which is getting worse). So I don't think that increasing concurrency of elastic jobs at this point is a good idea. This comment summarizes it well: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/9966#note_332978721
@mwasilewski-gitlab is there any way to move forward on this now? I note the RCA is still open but I'm not sure if it's still a blocker and if so what are the action items we're waiting on.
I also note we completed the move to Kubernetes a couple days back. Do we need to wait longer to observe that or should we be safe to scale up now?
I also wanted to note that while we will be increasing traffic to the DB as I mentioned above it is mostly read traffic and the writes are only to a single column of a single table which is only written in this worker with no transactions so I should expect that the increase in writes would basically be negligible compared to all write traffic to the DB.
@mwasilewski-gitlab@DylanGriffith , I wonder whether we have tried to increase the concurrency as suggested in #214280 (comment 332665387). If not, can we try it in the next batch rollout? If it overloads pgbouncer, can we just fallback to one sidekiq node?
is there any way to move forward on this now? I note the RCA is still open but I'm not sure if it's still a blocker and if so what are the action items we're waiting on.
If it overloads pgbouncer, can we just fallback to one sidekiq node?
There were lots of problems (paging alerts, incidents) related to the DB this week, I don't have the full context on the state of things. I saw @albertoramos suggested increasing the connection pool: https://gitlab.slack.com/archives/CB7P5CJS1/p1589376479433500 , but I don't if that was deployed and what other work is in-flight related to DB.
@albertoramos could you please let us know if it's OK to increase concurency of the Elasticsearch workers (which will generate more write connections and transactions)?
I also note we completed the move to Kubernetes a couple days back. Do we need to wait longer to observe that or should we be safe to scale up now?
I'll be commenting on gitlab-com/gl-infra/scalability#377 (closed), but there's one thing I really need to bring up here: Sidekiq does not use the read-only load-balanced DB capability; all connections go through the 'main' (primary, read-write, postgres master) connection, which on GitLab.com is the sidekiq-specific pgbouncer fleet (pgbouncer-sidekiq.int.gprd.gitlab.net
I'm not 100% clear on the reason for this choice, just that it is the current situation. It may be an historical decision worth carefully revisiting (perhaps selectively?).
@cmiskell that is indeed relevant and new information to me. These jobs can do quite a large reads from the DB as they are loading data and synchronizing it to Elasticsearch. I'm not sure how it compares to other workloads though in terms of impact on DB performance. Ideally we'd be able to use read replicas for Sidekiq as well and I am not familiar with the decision to not use that either.
@DylanGriffith@dgruzd
I am not sure, just giving an opinion. Have we looked at decreasing the HEAP size? I see some patterns in the Heap usage that makes me think it's happening to infrequently. Of course too low would mean the GC would be too frequent a balance in the ideal size is key.
Another thought, what are our refresh Interval set to? Have we tried any other settings?
@JohnMcGuire It's not something I've looked into yet. Are you seeing something in the charts to indicate that? I agree that too large heap can slow things down but right now I think that all of our bottleneck is before we're even writing to Elasticsearch and that Elasticsearch is able to keep up happily with everything we're throwing at it for now so I'm not sure if we want to focus on that just yet.
We use 60s as default and -1 (disable automatic refreshes) for reindexing on gitlab.com. A few months ago we just used the default value (1s). 60s is our first iteration, so we might want to change that later.
Self-managed
I believe that self-managed customers use the default refresh_interval. We don't export that setting anywhere, it's only accessible through the Elasticsearch API.
Are you seeing something in the charts to indicate that?
@DylanGriffith I wouldn't want to confuse the topic if its working for us now.
But if I understand it correctly, We have a Max of 30.9GB and we don't see it going over 23GB. In several cases we see an increase in CPU usage at the same time as a decease in Heap size.
This would possibly be GC taking place. If this is correct than we might be underutilizing our heap (possibly too much old generation.)
As you scale up, it's something to keep an eye on.
Thanks @JohnMcGuire that's useful details to understand. Indeed this may well be something we want to look into. Our cluster is certainly over-provisioned today. We use the templates in Elastic Cloud and our cluster happens to be 11TB capacity but we are only storing 2TB of data in it at the moment. Of course there are many factors to capacity but by design we've chosen to over-provision because every time we add more groups we are adding more load (ie. we are constantly increasing usage and plan to do another large increase in usage soon).
That being said it's not clear if our heap size is going to be optimal even when the usage reaches the expected "larger amount" but probably the only way to tell will be to wait until then.
As for impacting this specific issue I'm not sure I can see much evidence of Elasticsearch indexing delays or queueing linked to large GC so I'm not sure this will need to be a focus just yet. That being said it is something that we will need to understand better as we gain maturity in managing Elasticsearch because we've put little focus on this to date.