Skip to content
Snippets Groups Projects

review the impact of enabling indexing for all paying users on gitlab.com infrastructure

Closed Michal Wasilewski requested to merge michalw/indexing-paid-users-on-gitlab-com into master
7 unresolved threads

I moved it here from the original issue: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8110 cause it's much easier to share work and discuss using MRs.

Edited by Michal Wasilewski

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • added 1 commit

    Compare with previous version

  • @gitlab-com/gl-infra @gl-consultants-ongres (not sure if the handle works: @adescoms @ahachete @emanuel_ongres @fcsantiago @gerardo.herzig @jorsol_ongres @kadaffy @teoincontatto @sergio.ostapowicz ) @zj-gitlab

    Could please as many people as possible verify this write up? Did I get something wrong? Does anyone have any other thoughts or ideas?

    /cc @DylanGriffith

    • A good summation of the potential impacts. Some thoughts:

      1. I'm a little concerned regarding how gitlab-org/gitlab#33681 (closed) ends up being implemented; having a very long list of ids in the list of namespace ids may run into performance issues, and I hope we end up with a 'enable for certain plans' option and handle it that way, or verify how the code behaves with a list of thousands of ids. If it behaves badly, we could end up with some poor performance in unexpected places running rails.
      2. Are the indexing jobs small or big, e.g. is there a sidekiq job per commit (lots of quick small jobs), or do the jobs operate on bigger units like projects, branches, or batches of 'things' (fewer but longer-running jobs)? This defines a lot about how redis + sidekiq may be affected, and I'm not clear which one it is.
      3. It seems to me that ultimately we can control a lot of the potential impact by simply limiting the number of sidekiq-elasticsearch nodes + workers we have running, subject to the app not putting too many jobs into the queue too quickly (redis should be fine for some fairly large amount, but if we can't process quickly enough then obviously it will fill up and fall over eventually). Is there any backpressure mechanism (existing or planned) in the indexing scheduler, that can notice that the jobs are backed up and hold off adding more? Given this is a background processing task ("it will be done when it's done"), there's no harm in it taking longer because we're controlling the flow.
    • @cmiskell

      I'm a little concerned regarding how gitlab-org/gitlab#33681 (closed) ends up being implemented

      Thanks I've explicitly called this out (mostly as a reminder to myself) to verify that this scales sensibly gitlab-org/gitlab#33681 (closed)

      Are the indexing jobs small or big, e.g. is there a sidekiq job per commit (lots of quick small jobs), or do the jobs operate on bigger units like projects, branches, or batches of 'things' (fewer but longer-running jobs)?

      There is a mix. When we enable it for the groups initially we will create a few large jobs that load lots of data from the DB and perform a few bulk operations against the API. After that we switch to incremental indexing which is now just going to be single jobs per update of any indexed DB model or per push of any branch. I do have some longer term ideas about batching things even for incremental updates in gitlab-org/gitlab#34086 (closed) but we won't get to that before we start this rollout and we'll use the results of the progressive rollout to inform it's value.

      It seems to me that ultimately we can control a lot of the potential impact by simply limiting the number of sidekiq-elasticsearch nodes + workers we have running

      I will also do some verification as part of gitlab-org/gitlab#33681 (closed) to ensure that we can disable for groups (ie. rollback) safely in case things become overloaded.

    • Please register or sign in to reply
  • @adescoms @emanuel_ongres @gerardo.herzig Please review the MR from Michal.

1 As part of preparation for indexing namespaces of paying users on gitlab.com, we need to identify which parts of the infrastructure will be impacted as well as how to scale and monitor them.
2
3 The high level flow is: events in rails (e.g. user does a git push or creates an MR) -> sidekiq job of type `elastic_indexer` or `elastic_commit_indexer` is enqueued -> sidekiq worker picks up a job from the queue and executes it -> the sidekiq job talks to Gitaly AND/OR Postgres -> the sidekiq job sends the results from indexing data from Gitaly/DB to an Elastic cluster
4
5 The biggest risk here is that **we cannot scale Gitaly**.
6
7 Infrastructure components that will be impacted:
8 - Redis
9 - monitoring:
10 - redis-sidekiq overview: https://dashboards.gitlab.net/d/redis-sidekiq-main/redis-sidekiq-overview?orgId=1&from=now-6h&to=now
11 - https://dashboards.gitlab.net/d/wccEP9Imk/redis?orgId=1&refresh=1m
12 - scaling:
13 - only `redis-sidekiq-*` fleet will be impacted as that's what is used by sidekiq
14 - we don't really have a way of scaling redis clusters horizontally, so we can only resize the machines on which redis is running. We currently use `n1-standard-2` so there is a lot of space for scaling. It would have to be done individually (replica taken out of the cluster, resized, added to the cluster, master failed over). The orchestration of the entire process is manual (we do not have any mechanisms that would for example check the health of machines and add them to the cluster if healthy).
15 - we are currently at 50% cpu saturation on redis-sidekiq master (we can resize VMs), 2GB out of 8GB memory usage (we can resize VMs) and network bandwidth is at 20MBs (we get 2Gps per each vCPU core: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8278 )
16 - scaling tests performed: no
  • Keep in mind, redis is single threaded. Upping machine size is only great for storing more data in redis. If we are hitting redis too hard (we've done this in the past) we start to exhaust CPU on the primary and it will failover.

  • Please register or sign in to reply
  • 9 - monitoring:
    10 - redis-sidekiq overview: https://dashboards.gitlab.net/d/redis-sidekiq-main/redis-sidekiq-overview?orgId=1&from=now-6h&to=now
    11 - https://dashboards.gitlab.net/d/wccEP9Imk/redis?orgId=1&refresh=1m
    12 - scaling:
    13 - only `redis-sidekiq-*` fleet will be impacted as that's what is used by sidekiq
    14 - we don't really have a way of scaling redis clusters horizontally, so we can only resize the machines on which redis is running. We currently use `n1-standard-2` so there is a lot of space for scaling. It would have to be done individually (replica taken out of the cluster, resized, added to the cluster, master failed over). The orchestration of the entire process is manual (we do not have any mechanisms that would for example check the health of machines and add them to the cluster if healthy).
    15 - we are currently at 50% cpu saturation on redis-sidekiq master (we can resize VMs), 2GB out of 8GB memory usage (we can resize VMs) and network bandwidth is at 20MBs (we get 2Gps per each vCPU core: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8278 )
    16 - scaling tests performed: no
    17 - Sidekiq
    18 - monitoring:
    19 - overview (probably most up to date): https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?orgId=1&from=now-6h&to=now
    20 - https://dashboards.gitlab.net/d/9GOIu9Siz/sidekiq-stats?orgId=1&refresh=30s&from=now-3h&to=now
    21 - https://dashboards.gitlab.net/d/000000124/sidekiq-workers?orgId=1&refresh=5s
    22 - logs of indexing jobs in the last 24h: https://log.gitlab.net/goto/4ea00ecca865157241befd3f67dfd5f8
    23 - logs of indexing jobs per host: https://log.gitlab.net/goto/d24e5780eb99bdb2a734bd0de1102843
    24 - the only unique values in sidekiq jobs logs are `job_id`s and they cannot be easily matched with a project/user
    • Not 100% sure if this is referring to a different thing but we can certainly see all the details about the jobs by looking at the logs for ElasticIndexerWorker and ElasticCommitIndexerWorker in pubsub-sidekiq-inf-gprd-* using https://log.gitlab.net/goto/c9291d68aa0e25292e30a3a064a2f794 . You can see:

      {"severity":"INFO","time":"2019-11-04T03:41:23.138Z","class":"ElasticIndexerWorker","args":["update","MergeRequest","...","merge_request_...","{\"changed_fields\"=>[\"merge_status\"]}"],"retry":2,"queue":"elastic_indexer","jid":"...","created_at":"2019-11-04T03:41:23.136146Z","correlation_id":"...","

      The merge request id should make it simple to track down the project id or user. Also the correlation_id should link it all up to the original requests from the user.

      Edited by Dylan Griffith
    • making a note here of what I found while working on: production#1316 (closed) , json.args for many jobs will contain some identifiable information, e.g. project_id or as Dylan pointed out in case of the indexing jobs merge request id.

      correlation_id works in some cases, so it's worth trying it as well, see: gitlab-org/gitlab#30157

      Also some sidekiq jobs have correlation_ids that do not match any logs in other log sources, to be determined if this is a separate issue from the one linked above

    • perhaps logs for sidekiq jobs that do not have any other logs matching the correlation_id are from jobs started by cron?

    • Please register or sign in to reply
  • 22 - logs of indexing jobs in the last 24h: https://log.gitlab.net/goto/4ea00ecca865157241befd3f67dfd5f8
    23 - logs of indexing jobs per host: https://log.gitlab.net/goto/d24e5780eb99bdb2a734bd0de1102843
    24 - the only unique values in sidekiq jobs logs are `job_id`s and they cannot be easily matched with a project/user
    25 - scaling:
    26 - a separate fleet of sidekiq workers handling indexing jobs was created, we can easily scale it using terraform, infra issue: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8109, change issue: https://gitlab.com/gitlab-com/gl-infra/production/issues/1261
    27 - Changing the number of sidekiq workers that are processing indexing jobs is the only means of controlling the number of indexing jobs running at the same time. This means that this is the only mechanism we have for limiting impact on for example Gitaly.
    28 - We might decide to scale up if the initial indexing is going too slowly or if after the initial indexing the sidekiq queue for elastic jobs is going up
    29 - scaling tests performed: yes
    30 - pgbouncer and Postgres
    31 - monitoring:
    32 - overview: https://dashboards.gitlab.net/d/000000144/postgresql-overview?orgId=1&from=now-2d&to=now
    33 - pgbouncer overview: https://dashboards.gitlab.net/d/PwlB97Jmk/pgbouncer-overview?orgId=1&from=now-2d&to=now&var-prometheus=Global&var-environment=gprd&var-type=patroni
    34 - pgbouncer on hosts: https://dashboards.gitlab.net/d/000000285/pgbouncer-detail?orgId=1&from=now-2d&to=now&var-prometheus=Global&var-environment=gprd&var-type=patroni
    35 - replication overview: https://dashboards.gitlab.net/d/000000244/postgresql-replication-overview?orgId=1&from=1556204925361&to=1556226525361&var-environment=gprd&var-prometheus=Global&var-prometheus_app=Global
    36 - scaling:
    37 - As per: https://about.gitlab.com/handbook/engineering/infrastructure/production-architecture/#database-architecture we have only one write replica. However, elastic sidekiq jobs are not doing any writes so they will only talk to read-only replicas. At the moment of writing we have 11 read-only replicas.
  • 1 As part of preparation for indexing namespaces of paying users on gitlab.com, we need to identify which parts of the infrastructure will be impacted as well as how to scale and monitor them.
    2
    3 The high level flow is: events in rails (e.g. user does a git push or creates an MR) -> sidekiq job of type `elastic_indexer` or `elastic_commit_indexer` is enqueued -> sidekiq worker picks up a job from the queue and executes it -> the sidekiq job talks to Gitaly AND/OR Postgres -> the sidekiq job sends the results from indexing data from Gitaly/DB to an Elastic cluster
    4
    5 The biggest risk here is that **we cannot scale Gitaly**.
    6
    7 Infrastructure components that will be impacted:
    8 - Redis
    9 - monitoring:
    10 - redis-sidekiq overview: https://dashboards.gitlab.net/d/redis-sidekiq-main/redis-sidekiq-overview?orgId=1&from=now-6h&to=now
    11 - https://dashboards.gitlab.net/d/wccEP9Imk/redis?orgId=1&refresh=1m
    12 - scaling:
    13 - only `redis-sidekiq-*` fleet will be impacted as that's what is used by sidekiq
    14 - we don't really have a way of scaling redis clusters horizontally, so we can only resize the machines on which redis is running. We currently use `n1-standard-2` so there is a lot of space for scaling. It would have to be done individually (replica taken out of the cluster, resized, added to the cluster, master failed over). The orchestration of the entire process is manual (we do not have any mechanisms that would for example check the health of machines and add them to the cluster if healthy).
    15 - we are currently at 50% cpu saturation on redis-sidekiq master (we can resize VMs), 2GB out of 8GB memory usage (we can resize VMs) and network bandwidth is at 20MBs (we get 2Gps per each vCPU core: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8278 )
    16 - scaling tests performed: no
    • interesting read: https://slack.engineering/scaling-slacks-job-queue-687222e9d100 (although probably not directly relevant, Kafka was added in front of Redis to ensure write availability during queue build up, similarly to how we use GCP PubSub in front of ES)

    • One of the neat features of Redis I'm hoping to use for improving throughput of writes is sorted sets as part of gitlab-org/gitlab#34086 (closed) . When I did a quick bit of research I could find any similar technology in other queues. Given we have idempotent writes I think we'll have some options here but the sorted sets made the deduping while still in queue quite efficient and could reduce the amount of memory used by the unprocessed items.

      Of course Redis won't ever be able to have the same write availability of Kafka but switching to bulk API and using sorted sets will reduce the overall workload of sidekiq. I hope we can keep Redis up to this task as we scale but if not I'd like to see what other ways to accomplish the same batching with a different queue technology.

      Edited by Dylan Griffith
    • Please register or sign in to reply
  • 1 As part of preparation for indexing namespaces of paying users on gitlab.com, we need to identify which parts of the infrastructure will be impacted as well as how to scale and monitor them.
    2
    3 The high level flow is: events in rails (e.g. user does a git push or creates an MR) -> sidekiq job of type `elastic_indexer` or `elastic_commit_indexer` is enqueued -> sidekiq worker picks up a job from the queue and executes it -> the sidekiq job talks to Gitaly AND/OR Postgres -> the sidekiq job sends the results from indexing data from Gitaly/DB to an Elastic cluster
    4
    5 The biggest risk here is that **we cannot scale Gitaly**.
    6
    7 Infrastructure components that will be impacted:
    8 - Redis
    9 - monitoring:
    10 - redis-sidekiq overview: https://dashboards.gitlab.net/d/redis-sidekiq-main/redis-sidekiq-overview?orgId=1&from=now-6h&to=now
    11 - https://dashboards.gitlab.net/d/wccEP9Imk/redis?orgId=1&refresh=1m
    12 - scaling:
    13 - only `redis-sidekiq-*` fleet will be impacted as that's what is used by sidekiq
    14 - we don't really have a way of scaling redis clusters horizontally, so we can only resize the machines on which redis is running. We currently use `n1-standard-2` so there is a lot of space for scaling. It would have to be done individually (replica taken out of the cluster, resized, added to the cluster, master failed over). The orchestration of the entire process is manual (we do not have any mechanisms that would for example check the health of machines and add them to the cluster if healthy).
    15 - we are currently at 50% cpu saturation on redis-sidekiq master (we can resize VMs), 2GB out of 8GB memory usage (we can resize VMs) and network bandwidth is at 20MBs (we get 2Gps per each vCPU core: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8278 )
    16 - scaling tests performed: no
  • Michal Wasilewski marked as a Work In Progress

    marked as a Work In Progress

  • Michal Wasilewski unmarked as a Work In Progress

    unmarked as a Work In Progress

  • Another concern is that currently we are unable to "pause" the indexer. Whenever the integration is stopped the index needs to be reindexed from scratch. Stopping the sidekiq workers doesn't really solve the problem because the job queue is held in Redis.

  • yes, certainly, the MR is 8 months old, a lot of it is outdated at this point and we've gone through a similar "review the impact" exercises multiple times since then

  • Please register or sign in to reply
    Loading