Figure out how to speed up initial indexing of groups

added advanced search backstage [DEPRECATED] groupglobal search labels

marked this issue as related to #214279 (closed)

marked this issue as related to gitlab-com/gl-infra/production#1902 (closed)

marked this issue as related to gitlab-com/gl-infra/production#1907 (closed)

changed the description

@mwasilewski-gitlab @dgruzd so I'm trying to figure out what we can do next to speed up the initial indexing of many groups. I've got a bunch of graphs here from the last time we did this which took 30 hours. First question I have is really whether or not it would have been safe to just run elastic_indexer and elastic_commit_indexer with higher concurrency. Would this cause any issues for the ES cluster? Would it cause issue for Gitaly or Postgres or Redis? Or is this safe to just turn this up and we'll see faster throughput?

From a glance at the gitaly and Postgres charts I can't see any noticeable impact while this is running and haven't heard anything from SREs about some issues while it is running. I also can't think of any reason Redis would struggle with any increase in sidekiq workers here.

@DylanGriffith there shouldn't be any issues with the ES cluster since it has a lot of headroom. As for postgresql, we should check our PG connection pool limit and its current utilization.

@dgruzd good reminder about connection pool. We did recently run into a problem with this and there is much discussion in https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/9198#note_287817712 . I'm not sure how an increase in concurrency may impact the fixes added in that but it certainly would not be ideal to trigger that event again.

@DylanGriffith

ES

Would this cause any issues for the ES cluster?

The write queue was almost empty throughout the process, so the ES cluster still has headroom. Besides, if we want to prioritize performance during initial indexing we can always add more nodes and the indexing capacity of the cluster will increase (the index consists of 60 shards).

Postgres

Would it cause issue for Gitaly or Postgres or Redis?

As @dgruzd pointed out, connection pool saturation would be a first indicator of saturation. There are three connection pools to consider: Postgres connection limit, pgbouncer and application side connection pool.

The db side connection limit should not be reached cause we have pgbouncer in front of Postgres.

pgbouncer has a separate pool for sidekiq:

src: https://dashboards.gitlab.net/d/pgbouncer-main/pgbouncer-overview?orgId=1&from=now-6h%2Fm&to=now%2Fm&var-PROMETHEUS_DS=Global&var-environment=gprd&var-sigma=2&fullscreen&panelId=40

and here are waiting sidekiq connections to Postgres:

src: https://dashboards.gitlab.net/d/pgbouncer-main/pgbouncer-overview?orgId=1&from=now-6h%2Fm&to=now%2Fm&var-PROMETHEUS_DS=Global&var-environment=gprd&var-sigma=2&fullscreen&panelId=40

So in simple words, pgbouncer is already queueing connections. By adding more connections (running more elasticsearch workers) we won't overload the database itself (since pgbouncer is shielding the db), but we might exacerbate connection saturation on pgbouncer that could lead to other sidekiq connections being rejected.

I know that we've got lots of read-only Postgres replicas and that we're bypassing pgbouncer for read-only connections. This is further confirmed by: https://about.gitlab.com/handbook/engineering/infrastructure/production/architecture/#database-architecture. So we might actually have a lot of headroom here as well, but this would have to be further confirmed.

I think (and please correct me if I'm wrong) that the elasticsearch worker is only ever executing read-only transactions against Postgres.

The elastic worker is configured with:

gitlab_rails['db_host'] = "pgbouncer-sidekiq.int.gprd.gitlab.net"
gitlab_rails['db_load_balancing'] = {"discover"=>{"record"=>"db-replica.service.consul.", "nameserver"=>"127.0.0.1", "port"=>8600, "use_tcp"=>true, "record_type"=>"SRV"}}

pgbouncer-sidekiq.int.gprd.gitlab.net resolves to a GCP ILB, which has pgbouncer machines as backends.

db-replica.service.consul. is a consul (dynamic) entry which resolves to all read-only postgres machines.

Looking at the transport layer (using ss and tcpdump), there are connections to both. So I guess the relevant question now is, how can we tell what the worker is utilizing the pgbouncer connections for? and can we quantify it? (e.g. there is only one write transaction per each job which means it will scale linearly)

We did recently run into a problem with this and there is much discussion in https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/9198#note_287817712 . I'm not sure how an increase in concurrency may impact the fixes added in that but it certainly would not be ideal to trigger that event again.

I'm not a DB expert or the Gitlab application expert, but wasn't this problem caused by Sidekiq having more threads trying to access the DB then there were connections available in the connection pool in the application? Shouldn't this be fixed by: gitlab-com/gl-infra/scalability#152 (closed) ?

cc @Finotto @ahmadsherif

Gitaly

Quantifying impact on Gitaly is much harder because the distribution of projects among Gitaly nodes is not truly randomized and HA (praefect) is still not available. This means that if we add a lot of projects for indexing that happen to be on a single Gitaly node, we can overload that node causing all other projects to slow down/become unresponsive.

However, this is not a new risk at all. We also have monitoring and alerting in place. It should also be easy to estimate such risk prior to executing indexing, e.g. get a list of projects for indexing -> get their sizes and Gitaly nodes hosting them -> estimate impact on Gitaly nodes

I think that given the current situation this shouldn't be a stopper, we should do the estimations as a pre-flight step to minimise the risk, but overall Gitaly is doing ok, cpu:

src: https://dashboards.gitlab.net/d/gitaly-main/gitaly-overview?orgId=1&from=now-6h%2Fm&to=now%2Fm&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-sigma=2&fullscreen&panelId=47

memory:

src: https://dashboards.gitlab.net/d/gitaly-main/gitaly-overview?orgId=1&from=now-6h%2Fm&to=now%2Fm&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-sigma=2&fullscreen&panelId=45

fds:

src: https://dashboards.gitlab.net/d/gitaly-main/gitaly-overview?orgId=1&from=now-6h%2Fm&to=now%2Fm&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-sigma=2&fullscreen&panelId=46

disk read:

src: https://dashboards.gitlab.net/d/gitaly-main/gitaly-overview?orgId=1&from=now-6h%2Fm&to=now%2Fm&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-sigma=2&fullscreen&panelId=42

disk write (not that the sidekiq worker would be making writes, just showing the overall health of Gitaly):

src: https://dashboards.gitlab.net/d/gitaly-main/gitaly-overview?orgId=1&from=now-6h%2Fm&to=now%2Fm&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-sigma=2&fullscreen&panelId=44

cc @alejandro

Redis-sidekiq

Saturation points:

memory: 2%
cpu: 60%
fds: 1.5%

slowlog threshold (in microseconds):

> config get slowlog-log-slower-than
1) "slowlog-log-slower-than"
2) "10000"

there are individual responses exceeding the 10ms threshold:

src: https://dashboards.gitlab.net/d/redis-sidekiq-main/redis-sidekiq-overview?orgId=1&from=now-7d&to=now&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-sigma=2&fullscreen&panelId=46

All in all, I would expect saturation to first occur on CPU and I think that redis is doing very well, there's still a lot of headroom.

cc @igorwwwwwwwwwwwwwwwwwwww

other notes

see also: gitlab-com/gl-infra/readiness!15 (diffs)

@mwasilewski-gitlab

Looking at the transport layer (using ss and tcpdump), there are connections to both. So I guess the relevant question now is, how can we tell what the worker is utilizing the pgbouncer connections for? and can we quantify it? (e.g. there is only one write transaction per each job which means it will scale linearly)

From memory and looking at the code the only occasions for writing to Postgres in these workers is:

ElasticCommitIndexerWorker writes once per job to update the index status (ie. last SHA indexed)
ElasticIndexerWorker deletes from the index_statuses table if a project is deleted
ElasticIndexBulkCronWorker never writes to Postgres

Note that these are all single SQL statements without any additional transaction around them.

changed the description

added Category:Global Search devopssystems labels

@aamarsanaa FYI there is some discussion here about scaling up our Elasticsearch sidekiq indexing infrastructure. This may be of some interest to you and we haven't yet turned this into anything actionable but I hope to make this into something actionable soon.

@DylanGriffith @dgruzd @aamarsanaa

in terms of actionable items, I think it boils down to:

ES:

is fine for now
let's watch the write queue and scale up the cluster if needed

Redis:

ok for now
let's watch cpu and slowlog, we don't really have a way to scale Redis

Gitaly:

ok for now
let's do estimations before the next round of indexing and watch the relevant dashboard

Postgres:

we need to figure out what the worker is utilizing the pgbouncer (write) connections for and whether we can quantify it
What's the state of the db connection pool in the application? is it safe to add more sidekiq workers/increase concurrency?
let's watch the pgbouncer queued connections and figure out how can we tell if connections are being rejected (I don't think we have a chart on any dashboard for that)

@mwasilewski-gitlab thanks for the detailed analysis. I would like to add some points:

"I know that we've got lots of read-only Postgres replicas and that we're bypassing pgbouncer for read-only connections."

Why are we bypassing pgbouncer for read-only connections? we have 3 pgbouncer instances in from of each node that is read-only. It is recommended to use them.

"we need to figure out what the worker is utilizing the pgbouncer (write) connections for and whether we can quantify it"

Are we talking of sidekiq connections? Another factor that we could consider here is the time that the connections are open in an idle state if we could manage then better would be great. And could we use a different database user to quantify better those connections?

Or add a /** SQL COMMENT ON THE BEGINNING OF THE SQL STATEMENT **/ that would be possible to identify those queries on the pg_stat_Statements?

"What's the state of the db connection pool in the application? is it safe to add more sidekiq workers/increase concurrency?"

We have space to increase connections on the read-only hosts and some on the production. Could you help us to quantify how much is required? Adding those workers on the pgbouncer pool, we would in the worst-case scenario increase the queue, in case we saturate. We should test and evaluate the bottleneck point.

"let's watch the pgbouncer queued connections and figure out how can we tell if connections are being rejected (I don't think we have a chart on any dashboard for that)"

Are we talking here about read-write?

What I see that we have spikes in production and maybe we should address with engineering the following point: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/9396 I would like to mention that it is happening at the web-api pool, but is generating a spike on the primary database, and the resource for both pools, web-api and sidekiq.

cc: @adescoms @albertoramos @gerardo.herzig @adescoms @NikolayS @ahachete please add your thoughts I know you have been working deeply on the pgbouncer saturation.

I don't have much context here, but it would be definitely important to understand if this is a r-o or also r-w pattern. In any case:

I agree with @Finotto pgbouncers need to be used. As we have already discussed, this should be enforced and at a later time might be forcibly enforced --and only approved to side-pass it on a case-by-case basis.
We could identify the queries in pg_stat_statements if they have a different application_name or user. I understand they come from sidekiq, so this might not be an option. Any setting on sidekiq to do this? Otherwise, the comment approach should work.
There is headroom to add more load on the r-o secondaries.
An increase on the r-w load would need to be more carefully discussed.

@mwasilewski-gitlab with respect to your 3 action items:

(1) we need to figure out what the worker is utilizing the pgbouncer (write) connections for and whether we can quantify it

I've answered this here #214280 (comment 327824015)

(2) What's the state of the db connection pool in the application? is it safe to add more sidekiq workers/increase concurrency?

It seems we have the following from above:

There is headroom to add more load on the r-o secondaries.

Great. Most of the load is reads. As I mentioned the writes are to a single column on a single table and only happen once per job and only certain jobs. Such writes are only happening in the workers/jobs that are doing basically no reads from the database so I assume that this means we won't be holding primary connections for long or putting much pressure on the primary though I'm no expert. Does that alleviate some concerns?

An increase on the r-w load would need to be more carefully discussed.

As mentioned above the writes are very small updates only on some jobs and only on jobs that don't read much (anything) from Postgres. That being said we'd still be talking about increasing the existing write load by some multiple if we increase the number of workers.

Is there more careful discussion to be had? There was some mention of adding comments to the SQL to find the queries. I'm not aware of any easy way to do this from the application side but regardless by looking at the code I'm confident that these updates are going to be only INSERT/UPDATE to the index_statuses table. In fact this will be the vast majority of write interactions with that index_statuses table though there may be some rare instances that happen from web workers I'm forgetting. Does that narrow it down enough to feel confident it's not a concerning workload?

(3) let's watch the pgbouncer queued connections and figure out how can we tell if connections are being rejected (I don't think we have a chart on any dashboard for that)

What's the next step here? Do we have to build a dashboard for this? Is there an existing issue or should we create one if that will be a blocker here?

@Finotto could you possibly provide some guidance on the above. I think I addressed your point in saying that we know what the write queries are and presumably how to find them as they are the majority of writes to index_statuses table.

Adding those workers on the pgbouncer pool, we would in the worst-case scenario increase the queue, in case we saturate. We should test and evaluate the bottleneck point.

Are you suggesting we can try increase the workers and observe if the pool grows. Can you point to some metrics where we can observe this or propose how/when we can run this experiment or additional safeguards before we could?

Also is there some additional details I can provide about the application code to clarify how the database interactions are occuring? All reads and writes are going through normal ActiveRecord models in a normal sidekiq worker so I'm thinking the behaviour of these workers being doubled would be much the same as any other sidekiq worker as we aren't intending on doing anything special with the DB but maybe something is configured differently at the infrastructure level that I'm not aware of. Additionally they are doing writes at the end of processing the job and so if I remember correctly the behaviour of holding primary DB connections would mean that most of the reading was actually happing on read replicas before we get to that write query.

Discussing with @mwasilewski-gitlab:

@mwasilewski-gitlab action: Is pgbouncer working on r-o workloads now. We were under the impression that it was maybe only for r-w workloads? will look into this based on the above comments possibly suggesting it is used for r-o workloads as well
1. Just confirming nothing exceptional is happening in our ES sidekiq workers and since we have headroom we should be good to just double the processes even if sidekiq workers aren't yet using pgbouncer which seems to be the case based on checking config pointing directly to internal load balancer at the moment.
We believe everything will scale linearly due to no use of transactions
Write pgbouncer is queuing connections so it is "saturated" at times but not rejecting yet so we could try increase concurrency a little and see what happens
1. @mwasilewski-gitlab action: We will doubling the concurrency right away but we have a baseline from gitlab-com/gl-infra/production#1977 (closed) where we can compare the exact same process tomorrow with gitlab-com/gl-infra/production#2012 (closed)
  1. Since the current elasticsearch node is at high CPU we will want to add another node. Updating terraform config and rolling back is as simple as stopping 1 of the sidekiq workers. @mwasilewski-gitlab can do this today and write rollback steps into the next change request gitlab-com/gl-infra/production#2012 (closed) "if pgbouncer is overloaded during this CR we can shut down the 2nd sidekiq node"

@DylanGriffith

The async (sidekiq) pgbouncer queue was saturated today for a longer than usual period of time: gitlab-com/gl-infra/production#2016 (comment 332915531)

This will have to be addressed before we can increase indexing concurrency. There was some discussion on the infra team in the past about increasing the number of connections in pgbouncer, I'll have to catch up with that.

adding one more worker: https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/1656

@Finotto @ahachete @DylanGriffith

I think all of production is configured to go directly to db replicas (connect to patroni VMs rather than to pgbouncer VMs) for r-o: https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/blob/master/roles/gprd-base.json#L464-471

Just to make it clear, we're talking about bypassing the pgbouncer tier, not pgbouncers on patroni machines.

db-replica.service.consul. is a consul SRV record that resolves to current patroni replicas machines:

$ dig srv @localhost -p8600 db-replica.service.consul.

; <<>> DiG 9.10.3-P4-Ubuntu <<>> srv @localhost -p8600 db-replica.service.consul.
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 13357
;; flags: qr aa rd; QUERY: 1, ANSWER: 33, AUTHORITY: 0, ADDITIONAL: 12
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;db-replica.service.consul.	IN	SRV

;; ANSWER SECTION:
db-replica.service.consul. 0	IN	SRV	1 1 6434 patroni-04-db-gprd.node.east-us-2.consul.
db-replica.service.consul. 0	IN	SRV	1 1 6433 patroni-09-db-gprd.node.east-us-2.consul.
db-replica.service.consul. 0	IN	SRV	1 1 6434 patroni-03-db-gprd.node.east-us-2.consul.
db-replica.service.consul. 0	IN	SRV	1 1 6433 patroni-05-db-gprd.node.east-us-2.consul.
db-replica.service.consul. 0	IN	SRV	1 1 6434 patroni-12-db-gprd.node.east-us-2.consul.
db-replica.service.consul. 0	IN	SRV	1 1 6432 patroni-09-db-gprd.node.east-us-2.consul.
db-replica.service.consul. 0	IN	SRV	1 1 6434 patroni-05-db-gprd.node.east-us-2.consul.
db-replica.service.consul. 0	IN	SRV	1 1 6432 patroni-07-db-gprd.node.east-us-2.consul.
db-replica.service.consul. 0	IN	SRV	1 1 6432 patroni-02-db-gprd.node.east-us-2.consul.
db-replica.service.consul. 0	IN	SRV	1 1 6434 patroni-02-db-gprd.node.east-us-2.consul.
db-replica.service.consul. 0	IN	SRV	1 1 6433 patroni-02-db-gprd.node.east-us-2.consul.
db-replica.service.consul. 0	IN	SRV	1 1 6433 patroni-10-db-gprd.node.east-us-2.consul.
db-replica.service.consul. 0	IN	SRV	1 1 6432 patroni-01-db-gprd.node.east-us-2.consul.
db-replica.service.consul. 0	IN	SRV	1 1 6433 patroni-01-db-gprd.node.east-us-2.consul.
db-replica.service.consul. 0	IN	SRV	1 1 6432 patroni-05-db-gprd.node.east-us-2.consul.
db-replica.service.consul. 0	IN	SRV	1 1 6434 patroni-07-db-gprd.node.east-us-2.consul.
db-replica.service.consul. 0	IN	SRV	1 1 6434 patroni-10-db-gprd.node.east-us-2.consul.
db-replica.service.consul. 0	IN	SRV	1 1 6433 patroni-04-db-gprd.node.east-us-2.consul.
db-replica.service.consul. 0	IN	SRV	1 1 6432 patroni-10-db-gprd.node.east-us-2.consul.
db-replica.service.consul. 0	IN	SRV	1 1 6433 patroni-03-db-gprd.node.east-us-2.consul.
db-replica.service.consul. 0	IN	SRV	1 1 6432 patroni-03-db-gprd.node.east-us-2.consul.
db-replica.service.consul. 0	IN	SRV	1 1 6432 patroni-12-db-gprd.node.east-us-2.consul.
db-replica.service.consul. 0	IN	SRV	1 1 6434 patroni-01-db-gprd.node.east-us-2.consul.
db-replica.service.consul. 0	IN	SRV	1 1 6433 patroni-06-db-gprd.node.east-us-2.consul.
db-replica.service.consul. 0	IN	SRV	1 1 6434 patroni-08-db-gprd.node.east-us-2.consul.
db-replica.service.consul. 0	IN	SRV	1 1 6432 patroni-08-db-gprd.node.east-us-2.consul.
db-replica.service.consul. 0	IN	SRV	1 1 6432 patroni-06-db-gprd.node.east-us-2.consul.
db-replica.service.consul. 0	IN	SRV	1 1 6432 patroni-04-db-gprd.node.east-us-2.consul.
db-replica.service.consul. 0	IN	SRV	1 1 6433 patroni-08-db-gprd.node.east-us-2.consul.
db-replica.service.consul. 0	IN	SRV	1 1 6433 patroni-07-db-gprd.node.east-us-2.consul.
db-replica.service.consul. 0	IN	SRV	1 1 6434 patroni-09-db-gprd.node.east-us-2.consul.
db-replica.service.consul. 0	IN	SRV	1 1 6434 patroni-06-db-gprd.node.east-us-2.consul.
db-replica.service.consul. 0	IN	SRV	1 1 6433 patroni-12-db-gprd.node.east-us-2.consul.

;; ADDITIONAL SECTION:
patroni-04-db-gprd.node.east-us-2.consul. 0 IN A 10.220.16.104
patroni-09-db-gprd.node.east-us-2.consul. 0 IN A 10.220.16.109
patroni-03-db-gprd.node.east-us-2.consul. 0 IN A 10.220.16.103
patroni-05-db-gprd.node.east-us-2.consul. 0 IN A 10.220.16.105
patroni-12-db-gprd.node.east-us-2.consul. 0 IN A 10.220.16.112
patroni-07-db-gprd.node.east-us-2.consul. 0 IN A 10.220.16.107
patroni-02-db-gprd.node.east-us-2.consul. 0 IN A 10.220.16.102
patroni-10-db-gprd.node.east-us-2.consul. 0 IN A 10.220.16.110
patroni-01-db-gprd.node.east-us-2.consul. 0 IN A 10.220.16.101
patroni-06-db-gprd.node.east-us-2.consul. 0 IN A 10.220.16.106
patroni-08-db-gprd.node.east-us-2.consul. 0 IN A 10.220.16.108

;; Query time: 7 msec
;; SERVER: 127.0.0.1#8600(127.0.0.1)
;; WHEN: Tue Apr 28 18:53:34 UTC 2020
;; MSG SIZE  rcvd: 2210

This is consistent with this diagram: https://about.gitlab.com/handbook/engineering/infrastructure/production/architecture/#database-architecture

@Finotto @ahachete is this not consistent with what you would expected?

The reason why the fact that we're connecting to patroni machines rather than to pgbouncer machines is relevant is because r-o traffic is not bottlenecked by the saturation on the pgbouncer tier (which r-w traffic is).

@DylanGriffith

We recently changed the way we prioritize sidekiq jobs. We are also in the process of migrating sidekiq to kubernetes (part of the sidekiq workload is already running in k8s). So there are a lot of moving parts that could be contributing to the saturation of the async pool on pgbouncer (which is getting worse). So I don't think that increasing concurrency of elastic jobs at this point is a good idea. This comment summarizes it well: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/9966#note_332978721

cc @albertoramos for visibility

@mwasilewski-gitlab is there any way to move forward on this now? I note the RCA is still open but I'm not sure if it's still a blocker and if so what are the action items we're waiting on.

I also note we completed the move to Kubernetes a couple days back. Do we need to wait longer to observe that or should we be safe to scale up now?

I also wanted to note that while we will be increasing traffic to the DB as I mentioned above it is mostly read traffic and the writes are only to a single column of a single table which is only written in this worker with no transactions so I should expect that the increase in writes would basically be negligible compared to all write traffic to the DB.

@mwasilewski-gitlab @DylanGriffith , I wonder whether we have tried to increase the concurrency as suggested in #214280 (comment 332665387). If not, can we try it in the next batch rollout? If it overloads pgbouncer, can we just fallback to one sidekiq node?

@DylanGriffith @changzhengliu

is there any way to move forward on this now? I note the RCA is still open but I'm not sure if it's still a blocker and if so what are the action items we're waiting on.

If it overloads pgbouncer, can we just fallback to one sidekiq node?

There were lots of problems (paging alerts, incidents) related to the DB this week, I don't have the full context on the state of things. I saw @albertoramos suggested increasing the connection pool: https://gitlab.slack.com/archives/CB7P5CJS1/p1589376479433500 , but I don't if that was deployed and what other work is in-flight related to DB.

@albertoramos could you please let us know if it's OK to increase concurency of the Elasticsearch workers (which will generate more write connections and transactions)?

I also note we completed the move to Kubernetes a couple days back. Do we need to wait longer to observe that or should we be safe to scale up now?

Up to my best knowledge gitlab-com/gl-infra/delivery#855 (closed) and gitlab-com/gl-infra/production#2144 (closed) is the only concern. cc @jarv

@mwasilewski-gitlab , thank you for getting back to us. When I looked at the RCA, I got to know that Scalability team was trying to sort out of workflow to review the changes which potential could have impacts on the load of our infrastructure, gitlab-com/gl-infra/scalability#365 (closed). @rnienaber created another issue gitlab-com/gl-infra/scalability#377 (closed), and they will help us review this change.

@mwasilewski-gitlab @DylanGriffith @Finotto @ahachete

I'll be commenting on gitlab-com/gl-infra/scalability#377 (closed), but there's one thing I really need to bring up here: Sidekiq does not use the read-only load-balanced DB capability; all connections go through the 'main' (primary, read-write, postgres master) connection, which on GitLab.com is the sidekiq-specific pgbouncer fleet (pgbouncer-sidekiq.int.gprd.gitlab.net

Citation: https://gitlab.com/gitlab-org/gitlab/-/blob/7e18aef5df2fe45562652471a7166948c110d513/ee/lib/gitlab/database/load_balancing.rb#L88

I'm not 100% clear on the reason for this choice, just that it is the current situation. It may be an historical decision worth carefully revisiting (perhaps selectively?).

@cmiskell that is indeed relevant and new information to me. These jobs can do quite a large reads from the DB as they are loading data and synchronizing it to Elasticsearch. I'm not sure how it compares to other workloads though in terms of impact on DB performance. Ideally we'd be able to use read replicas for Sidekiq as well and I am not familiar with the decision to not use that either.

@DylanGriffith @dgruzd
I am not sure, just giving an opinion. Have we looked at decreasing the HEAP size? I see some patterns in the Heap usage that makes me think it's happening to infrequently. Of course too low would mean the GC would be too frequent a balance in the ideal size is key.

Another thought, what are our refresh Interval set to? Have we tried any other settings?

@JohnMcGuire It's not something I've looked into yet. Are you seeing something in the charts to indicate that? I agree that too large heap can slow things down but right now I think that all of our bottleneck is before we're even writing to Elasticsearch and that Elasticsearch is able to keep up happily with everything we're throwing at it for now so I'm not sure if we want to focus on that just yet.

Another thought, what are our refresh Interval set to? Have we tried any other settings?

@JohnMcGuire

GitLab.com

We use 60s as default and -1 (disable automatic refreshes) for reindexing on gitlab.com. A few months ago we just used the default value (1s). 60s is our first iteration, so we might want to change that later.

Self-managed

I believe that self-managed customers use the default refresh_interval. We don't export that setting anywhere, it's only accessible through the Elasticsearch API.

Are you seeing something in the charts to indicate that?

@DylanGriffith I wouldn't want to confuse the topic if its working for us now. But if I understand it correctly, We have a Max of 30.9GB and we don't see it going over 23GB. In several cases we see an increase in CPU usage at the same time as a decease in Heap size.
This would possibly be GC taking place. If this is correct than we might be underutilizing our heap (possibly too much old generation.)

As you scale up, it's something to keep an eye on.

@ali-gitlab , I am not sure if this is related to but tagging you so you know about it. #214482 (closed)

Thanks @JohnMcGuire that's useful details to understand. Indeed this may well be something we want to look into. Our cluster is certainly over-provisioned today. We use the templates in Elastic Cloud and our cluster happens to be 11TB capacity but we are only storing 2TB of data in it at the moment. Of course there are many factors to capacity but by design we've chosen to over-provision because every time we add more groups we are adding more load (ie. we are constantly increasing usage and plan to do another large increase in usage soon).

That being said it's not clear if our heap size is going to be optimal even when the usage reaches the expected "larger amount" but probably the only way to tell will be to wait until then.

As for impacting this specific issue I'm not sure I can see much evidence of Elasticsearch indexing delays or queueing linked to large GC so I'm not sure this will need to be a focus just yet. That being said it is something that we will need to understand better as we gain maturity in managing Elasticsearch because we've put little focus on this to date.

mentioned in issue gitlab-com/gl-infra/production#2012 (closed)

changed milestone to %13.1

added [deprecated] Accepting merge requests label

mentioned in issue gitlab-com/gl-infra/scalability#365 (closed)

mentioned in issue gitlab-com/gl-infra/scalability#377 (closed)

Figure out how to speed up initial indexing of groups

Monitoring from last roll-out (`2020-04-09 05:22:12 UTC` - `2020-04-10 10:56 UTC`)

Node monitoring on cluster

Instance 0

Instance 1

Instance 2

Instance 3

Instance 4

Instance 5

Index monitoring on cluster

Sidekiq detail monitoring

`ElasticIndexerWorker`

`ElasticCommitIndexerWorker`

Gitaly Monitoring

Postgres monitoring

Designs

Child items ...

Activity

ES

Postgres

Gitaly

Redis-sidekiq

other notes

GitLab.com

Self-managed

Figure out how to speed up initial indexing of groups

Monitoring from last roll-out (2020-04-09 05:22:12 UTC - 2020-04-10 10:56 UTC)

Node monitoring on cluster

Instance 0

Instance 1

Instance 2

Instance 3

Instance 4

Instance 5

Index monitoring on cluster

Sidekiq detail monitoring

ElasticIndexerWorker

ElasticCommitIndexerWorker

Gitaly Monitoring

Postgres monitoring

Relates to

Activity

ES

Postgres

Gitaly

Redis-sidekiq

other notes

GitLab.com

Self-managed

Monitoring from last roll-out (`2020-04-09 05:22:12 UTC` - `2020-04-10 10:56 UTC`)

`ElasticIndexerWorker`

`ElasticCommitIndexerWorker`