review the impact of enabling indexing for all paying users on gitlab.com infrastructure
I moved it here from the original issue: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8110 cause it's much easier to share work and discuss using MRs.
Merge request reports
Activity
@gitlab-com/gl-infra @gl-consultants-ongres (not sure if the handle works: @adescoms @ahachete @emanuel_ongres @fcsantiago @gerardo.herzig @jorsol_ongres @kadaffy @teoincontatto @sergio.ostapowicz ) @zj-gitlab
Could please as many people as possible verify this write up? Did I get something wrong? Does anyone have any other thoughts or ideas?
/cc @DylanGriffith
A good summation of the potential impacts. Some thoughts:
- I'm a little concerned regarding how gitlab-org/gitlab#33681 (closed) ends up being implemented; having a very long list of ids in the list of namespace ids may run into performance issues, and I hope we end up with a 'enable for certain plans' option and handle it that way, or verify how the code behaves with a list of thousands of ids. If it behaves badly, we could end up with some poor performance in unexpected places running rails.
- Are the indexing jobs small or big, e.g. is there a sidekiq job per commit (lots of quick small jobs), or do the jobs operate on bigger units like projects, branches, or batches of 'things' (fewer but longer-running jobs)? This defines a lot about how redis + sidekiq may be affected, and I'm not clear which one it is.
- It seems to me that ultimately we can control a lot of the potential impact by simply limiting the number of sidekiq-elasticsearch nodes + workers we have running, subject to the app not putting too many jobs into the queue too quickly (redis should be fine for some fairly large amount, but if we can't process quickly enough then obviously it will fill up and fall over eventually). Is there any backpressure mechanism (existing or planned) in the indexing scheduler, that can notice that the jobs are backed up and hold off adding more? Given this is a background processing task ("it will be done when it's done"), there's no harm in it taking longer because we're controlling the flow.
I'm a little concerned regarding how gitlab-org/gitlab#33681 (closed) ends up being implemented
Thanks I've explicitly called this out (mostly as a reminder to myself) to verify that this scales sensibly gitlab-org/gitlab#33681 (closed)
Are the indexing jobs small or big, e.g. is there a sidekiq job per commit (lots of quick small jobs), or do the jobs operate on bigger units like projects, branches, or batches of 'things' (fewer but longer-running jobs)?
There is a mix. When we enable it for the groups initially we will create a few large jobs that load lots of data from the DB and perform a few bulk operations against the API. After that we switch to incremental indexing which is now just going to be single jobs per update of any indexed DB model or per push of any branch. I do have some longer term ideas about batching things even for incremental updates in gitlab-org/gitlab#34086 (closed) but we won't get to that before we start this rollout and we'll use the results of the progressive rollout to inform it's value.
It seems to me that ultimately we can control a lot of the potential impact by simply limiting the number of sidekiq-elasticsearch nodes + workers we have running
I will also do some verification as part of gitlab-org/gitlab#33681 (closed) to ensure that we can disable for groups (ie. rollback) safely in case things become overloaded.
@adescoms @emanuel_ongres @gerardo.herzig Please review the MR from Michal.
Gitaly problems right now are around observibility.
- There's no dashboard for the indexer in Grafana right now, making me worry about the production readiness even in its current form
- There's AFAIK not a way to correlate a gRPC call to Gitaly to the indexer. gitlab-org/gitaly#2053 (closed)
@mwasilewski-gitlab how did you review the current impact the index jobs have?
There's no dashboard for the indexer in Grafana right now, making me worry about the production readiness even in its current form
I've created gitlab-org/gitlab#35519 (closed) to track
There's AFAIK not a way to correlate a gRPC call to Gitaly to the indexer. gitlab-org/gitaly#2053 (closed)
I've created gitlab-org/gitlab#35520 (closed) to track
- elastic/indexing_paid_users_on_gitlab_com.md 0 → 100644
1 As part of preparation for indexing namespaces of paying users on gitlab.com, we need to identify which parts of the infrastructure will be impacted as well as how to scale and monitor them. 2 3 The high level flow is: events in rails (e.g. user does a git push or creates an MR) -> sidekiq job of type `elastic_indexer` or `elastic_commit_indexer` is enqueued -> sidekiq worker picks up a job from the queue and executes it -> the sidekiq job talks to Gitaly AND/OR Postgres -> the sidekiq job sends the results from indexing data from Gitaly/DB to an Elastic cluster 4 5 The biggest risk here is that **we cannot scale Gitaly**. 6 7 Infrastructure components that will be impacted: 8 - Redis 9 - monitoring: 10 - redis-sidekiq overview: https://dashboards.gitlab.net/d/redis-sidekiq-main/redis-sidekiq-overview?orgId=1&from=now-6h&to=now 11 - https://dashboards.gitlab.net/d/wccEP9Imk/redis?orgId=1&refresh=1m 12 - scaling: 13 - only `redis-sidekiq-*` fleet will be impacted as that's what is used by sidekiq 14 - we don't really have a way of scaling redis clusters horizontally, so we can only resize the machines on which redis is running. We currently use `n1-standard-2` so there is a lot of space for scaling. It would have to be done individually (replica taken out of the cluster, resized, added to the cluster, master failed over). The orchestration of the entire process is manual (we do not have any mechanisms that would for example check the health of machines and add them to the cluster if healthy). 15 - we are currently at 50% cpu saturation on redis-sidekiq master (we can resize VMs), 2GB out of 8GB memory usage (we can resize VMs) and network bandwidth is at 20MBs (we get 2Gps per each vCPU core: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8278 ) 16 - scaling tests performed: no mentioned in issue gitlab-org/gitlab#35520 (closed)
- elastic/indexing_paid_users_on_gitlab_com.md 0 → 100644
9 - monitoring: 10 - redis-sidekiq overview: https://dashboards.gitlab.net/d/redis-sidekiq-main/redis-sidekiq-overview?orgId=1&from=now-6h&to=now 11 - https://dashboards.gitlab.net/d/wccEP9Imk/redis?orgId=1&refresh=1m 12 - scaling: 13 - only `redis-sidekiq-*` fleet will be impacted as that's what is used by sidekiq 14 - we don't really have a way of scaling redis clusters horizontally, so we can only resize the machines on which redis is running. We currently use `n1-standard-2` so there is a lot of space for scaling. It would have to be done individually (replica taken out of the cluster, resized, added to the cluster, master failed over). The orchestration of the entire process is manual (we do not have any mechanisms that would for example check the health of machines and add them to the cluster if healthy). 15 - we are currently at 50% cpu saturation on redis-sidekiq master (we can resize VMs), 2GB out of 8GB memory usage (we can resize VMs) and network bandwidth is at 20MBs (we get 2Gps per each vCPU core: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8278 ) 16 - scaling tests performed: no 17 - Sidekiq 18 - monitoring: 19 - overview (probably most up to date): https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?orgId=1&from=now-6h&to=now 20 - https://dashboards.gitlab.net/d/9GOIu9Siz/sidekiq-stats?orgId=1&refresh=30s&from=now-3h&to=now 21 - https://dashboards.gitlab.net/d/000000124/sidekiq-workers?orgId=1&refresh=5s 22 - logs of indexing jobs in the last 24h: https://log.gitlab.net/goto/4ea00ecca865157241befd3f67dfd5f8 23 - logs of indexing jobs per host: https://log.gitlab.net/goto/d24e5780eb99bdb2a734bd0de1102843 24 - the only unique values in sidekiq jobs logs are `job_id`s and they cannot be easily matched with a project/user Not 100% sure if this is referring to a different thing but we can certainly see all the details about the jobs by looking at the logs for
ElasticIndexerWorker
andElasticCommitIndexerWorker
inpubsub-sidekiq-inf-gprd-*
using https://log.gitlab.net/goto/c9291d68aa0e25292e30a3a064a2f794 . You can see:{"severity":"INFO","time":"2019-11-04T03:41:23.138Z","class":"ElasticIndexerWorker","args":["update","MergeRequest","...","merge_request_...","{\"changed_fields\"=>[\"merge_status\"]}"],"retry":2,"queue":"elastic_indexer","jid":"...","created_at":"2019-11-04T03:41:23.136146Z","correlation_id":"...","
The merge request id should make it simple to track down the project id or user. Also the
correlation_id
should link it all up to the original requests from the user.Edited by Dylan Griffithmaking a note here of what I found while working on: production#1316 (closed) ,
json.args
for many jobs will contain some identifiable information, e.g. project_id or as Dylan pointed out in case of the indexing jobs merge request id.correlation_id
works in some cases, so it's worth trying it as well, see: gitlab-org/gitlab#30157Also some sidekiq jobs have correlation_ids that do not match any logs in other log sources, to be determined if this is a separate issue from the one linked above
- elastic/indexing_paid_users_on_gitlab_com.md 0 → 100644
22 - logs of indexing jobs in the last 24h: https://log.gitlab.net/goto/4ea00ecca865157241befd3f67dfd5f8 23 - logs of indexing jobs per host: https://log.gitlab.net/goto/d24e5780eb99bdb2a734bd0de1102843 24 - the only unique values in sidekiq jobs logs are `job_id`s and they cannot be easily matched with a project/user 25 - scaling: 26 - a separate fleet of sidekiq workers handling indexing jobs was created, we can easily scale it using terraform, infra issue: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8109, change issue: https://gitlab.com/gitlab-com/gl-infra/production/issues/1261 27 - Changing the number of sidekiq workers that are processing indexing jobs is the only means of controlling the number of indexing jobs running at the same time. This means that this is the only mechanism we have for limiting impact on for example Gitaly. 28 - We might decide to scale up if the initial indexing is going too slowly or if after the initial indexing the sidekiq queue for elastic jobs is going up 29 - scaling tests performed: yes 30 - pgbouncer and Postgres 31 - monitoring: 32 - overview: https://dashboards.gitlab.net/d/000000144/postgresql-overview?orgId=1&from=now-2d&to=now 33 - pgbouncer overview: https://dashboards.gitlab.net/d/PwlB97Jmk/pgbouncer-overview?orgId=1&from=now-2d&to=now&var-prometheus=Global&var-environment=gprd&var-type=patroni 34 - pgbouncer on hosts: https://dashboards.gitlab.net/d/000000285/pgbouncer-detail?orgId=1&from=now-2d&to=now&var-prometheus=Global&var-environment=gprd&var-type=patroni 35 - replication overview: https://dashboards.gitlab.net/d/000000244/postgresql-replication-overview?orgId=1&from=1556204925361&to=1556226525361&var-environment=gprd&var-prometheus=Global&var-prometheus_app=Global 36 - scaling: 37 - As per: https://about.gitlab.com/handbook/engineering/infrastructure/production-architecture/#database-architecture we have only one write replica. However, elastic sidekiq jobs are not doing any writes so they will only talk to read-only replicas. At the moment of writing we have 11 read-only replicas. Right now the
ElasticCommitIndexerWorker
is writing toSELECT
orINSERT
on theindex_statuses
table and then alsoUPDATE
on theindex_statuses
tableI don't imagine this is cause for concern but maybe if there are large volumes of git pushes that are now triggering more frequent DB writes where there previously weren't any it may be problematic.
mentioned in issue gitlab-org/gitlab#34546 (closed)
- elastic/indexing_paid_users_on_gitlab_com.md 0 → 100644
1 As part of preparation for indexing namespaces of paying users on gitlab.com, we need to identify which parts of the infrastructure will be impacted as well as how to scale and monitor them. 2 3 The high level flow is: events in rails (e.g. user does a git push or creates an MR) -> sidekiq job of type `elastic_indexer` or `elastic_commit_indexer` is enqueued -> sidekiq worker picks up a job from the queue and executes it -> the sidekiq job talks to Gitaly AND/OR Postgres -> the sidekiq job sends the results from indexing data from Gitaly/DB to an Elastic cluster 4 5 The biggest risk here is that **we cannot scale Gitaly**. 6 7 Infrastructure components that will be impacted: 8 - Redis 9 - monitoring: 10 - redis-sidekiq overview: https://dashboards.gitlab.net/d/redis-sidekiq-main/redis-sidekiq-overview?orgId=1&from=now-6h&to=now 11 - https://dashboards.gitlab.net/d/wccEP9Imk/redis?orgId=1&refresh=1m 12 - scaling: 13 - only `redis-sidekiq-*` fleet will be impacted as that's what is used by sidekiq 14 - we don't really have a way of scaling redis clusters horizontally, so we can only resize the machines on which redis is running. We currently use `n1-standard-2` so there is a lot of space for scaling. It would have to be done individually (replica taken out of the cluster, resized, added to the cluster, master failed over). The orchestration of the entire process is manual (we do not have any mechanisms that would for example check the health of machines and add them to the cluster if healthy). 15 - we are currently at 50% cpu saturation on redis-sidekiq master (we can resize VMs), 2GB out of 8GB memory usage (we can resize VMs) and network bandwidth is at 20MBs (we get 2Gps per each vCPU core: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8278 ) 16 - scaling tests performed: no interesting read: https://slack.engineering/scaling-slacks-job-queue-687222e9d100 (although probably not directly relevant, Kafka was added in front of Redis to ensure write availability during queue build up, similarly to how we use GCP PubSub in front of ES)
One of the neat features of Redis I'm hoping to use for improving throughput of writes is sorted sets as part of gitlab-org/gitlab#34086 (closed) . When I did a quick bit of research I could find any similar technology in other queues. Given we have idempotent writes I think we'll have some options here but the sorted sets made the deduping while still in queue quite efficient and could reduce the amount of memory used by the unprocessed items.
Of course Redis won't ever be able to have the same write availability of Kafka but switching to bulk API and using sorted sets will reduce the overall workload of sidekiq. I hope we can keep Redis up to this task as we scale but if not I'd like to see what other ways to accomplish the same batching with a different queue technology.
Edited by Dylan Griffith
- elastic/indexing_paid_users_on_gitlab_com.md 0 → 100644
1 As part of preparation for indexing namespaces of paying users on gitlab.com, we need to identify which parts of the infrastructure will be impacted as well as how to scale and monitor them. 2 3 The high level flow is: events in rails (e.g. user does a git push or creates an MR) -> sidekiq job of type `elastic_indexer` or `elastic_commit_indexer` is enqueued -> sidekiq worker picks up a job from the queue and executes it -> the sidekiq job talks to Gitaly AND/OR Postgres -> the sidekiq job sends the results from indexing data from Gitaly/DB to an Elastic cluster 4 5 The biggest risk here is that **we cannot scale Gitaly**. 6 7 Infrastructure components that will be impacted: 8 - Redis 9 - monitoring: 10 - redis-sidekiq overview: https://dashboards.gitlab.net/d/redis-sidekiq-main/redis-sidekiq-overview?orgId=1&from=now-6h&to=now 11 - https://dashboards.gitlab.net/d/wccEP9Imk/redis?orgId=1&refresh=1m 12 - scaling: 13 - only `redis-sidekiq-*` fleet will be impacted as that's what is used by sidekiq 14 - we don't really have a way of scaling redis clusters horizontally, so we can only resize the machines on which redis is running. We currently use `n1-standard-2` so there is a lot of space for scaling. It would have to be done individually (replica taken out of the cluster, resized, added to the cluster, master failed over). The orchestration of the entire process is manual (we do not have any mechanisms that would for example check the health of machines and add them to the cluster if healthy). 15 - we are currently at 50% cpu saturation on redis-sidekiq master (we can resize VMs), 2GB out of 8GB memory usage (we can resize VMs) and network bandwidth is at 20MBs (we get 2Gps per each vCPU core: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8278 ) 16 - scaling tests performed: no another interesting read regarding redis being used for background job processing at scale: https://kirshatrov.com/2019/01/03/state-of-background-jobs/
related MR: gitlab-org/gitlab!18976 (merged)
mentioned in merge request gitlab-org/gitlab!18976 (merged)
mentioned in epic gitlab-org&1736 (closed)
mentioned in issue gitlab-org/gitlab#214280 (closed)
@mwasilewski-gitlab Should we close this MR?