Skip to content

Draft: Add cron job for scheduling seat refreshes

Vijay Hawoldar requested to merge vij-add-refresh-worker-cron into master

What does this MR do and why?

Adds a cron job to to perform Gitlab subscription seat refreshes via a limited capacity worker (added previously)

For more info/context, see below.

How to setup and validate locally

  1. Enable the feature flag:
      Feature.enable :limited_capacity_seat_refresh_worker
  2. Ensure you have at least one Namespace with a GitlabSubscription (i.e. purchase a plan for a group)
  3. Check how many you have that require a seat attribute refresh
       requiring_refresh_count = GitlabSubscription.where('last_seat_refresh_at = NULL OR last_seat_refresh_at <= ?', 1.day.ago ).count
    
       => 1 # or however many you have
  4. Optional: if the previous command returned 0 then we can update a subscription to qualify with:
      GitlabSubscription.last.update(last_seat_refresh_at: 2.months.ago)
  5. Enqueue the limited capacity job:
       GitlabSubscriptions::ScheduleRefreshSeatsWorker.new.perform
  6. Confirm the subscriptions were updated
       requiring_refresh_count = GitlabSubscription.where('last_seat_refresh_at = NULL OR last_seat_refresh_at <= ?', 1.day.ago ).count
    
       => 0 # if it was successful

Background

Every subscription for GitLab.com is represented by a GitlabSubscription .

The GitlabSubscription contains 3 key pieces of information:

  1. max_seats_used - the maximum number of billable seats the Namespace has used
  2. seats_in_use - the current number of billable seats the Namespace is using
  3. seats_owed - the number of seats the customer needs to pay for

To keep these attributes up to date, an existing worker runs every day at midnight UTC that:

  • iterates over every single GitlabSubscription
  • refreshes the seat attributes for each one
  • updates the DB records via a manual SQL UPDATE to be more performant (one UPDATE query for each batch of subscriptions)

The Problem

  1. The worker that runs each day has historically been prone to error (gitlab-com/gl-infra/scalability#1116 (closed)) due to timeouts
  2. The existing job is very long running, and so is at risk of being interrupted (e.g. pod or process restart), resulting in namespaces not having their seat attributes updated, and it’s time to run will only ever increase as we increase our number of subscriptions on GitLab.com
  3. The manual SQL means we bypass any callbacks defined in the model

The Solution

The solution is to replace the one job with Limited Capacity jobs: Sidekiq limited capacity worker.

Doing so will allow us to have:

  1. One quick running job per GitLabSubscription
  2. Loop over all GitlabSubscription without fear of interruption
  3. Use “normal” update methods and avoid bypassing the regular lifecycle hooks/callbacks

🎉

Recalculating the seat attributes is important for billing and usage statistics, so the plan is to add the new limited capacity worker behind a feature flag (rollout issue) so that we can have both running at the same time initially.

Once we have confirmed the new job is working as expected, we can remove the old job and the feature flag.

How will it work?

The limited capacity setup will essentially do the following:

  1. A cron job will schedule the seat attribute refresh every 6 hours
  2. The refresh worker will:
    1. Look for the next GitlabSubscription that has not been refreshed in the last 24 hours
    2. Immediately update the last refreshed timestamp (last_seat_refresh_on) so that it doesn’t get picked up by a parallel job
    3. Refresh the seats for that subscription
  3. The scheduler will queue a new job if there is remaining work and the maximum number of running jobs haven’t already been queued

The MRs

Replacing the existing job involves adding 2 workers and a database change. So to make it easier to review, it’s been split into the following MRs:

Title Link Stage
Add the required DB column !103937 (merged) in review
Add the new LimitedCapacity worker !104099 (merged) blocked
Add the scheduler worker !104705 (closed) 👈🏽 you are here

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Vijay Hawoldar

Merge request reports