CI Minutes Cost Factors: Test on Production

changed milestone to %13.0

added [Deprecated] Category:Runner devopsverify groupcloud connector workflowin dev labels

changed the description

marked the checklist item Finalize the test on Staging: After is !30164 (merged) is deployed to Staging, retest the flags as described in #214646 (closed) as completed

assigned to @ayufan

@ayufan Kamil, could you briefly review this?
It's quite similar to how we tested on staging, though.

marked the checklist item !30164 (merged) should be deployed to production as completed

@ayufan can you please kindly highlight which action points are needed for Quality to help monitor and facilitate testing. Is this in the new projects part of namespace N ?

cc @jo_shih since this is related to CI and minutes accounting.

@meks
After our conversation, I think that @ayufan suggested that we will use https://gitlab.com/gitlab-org as our test namespace?

Initially, I assumed that we just create some small namespace, will have public/private/internal projects in it, and check that after enabling the feature flag for that group we'll start accounting all three kinds of projects into the quota.

@alipniagov

I believe we should do dry-run on our small group first, but we need to be careful with enabling the ci_minutes_enforce_quota_for_public_projects as this will be system-wide setting: so it will affect all projects. I hope that we then leave the ci_minutes_enforce_quota_for_public_projects running for everyone, as this should effectively be no-op.

Then, I hope that we enable accounting to our group. And start to monitor from there.

Ideally, I would love us to try to enable these both feature flags for everyone for a period of 1h, and set final cost factors to ensure that all aspects do work on scale. The ideal moment would be end of month, as the quotas are being reset on a first day of a new month, but we clearly not gonna have that. Maybe we simply set the cost factor on shared runners to 0.01 which will make us consume 1s from quota for each 100s run. This should be enough for us to simulate this.

Maybe we could find an SQL query that would iterate and "deduct" spend minutes for projects marked as public from that period.

Let's figure out next week after doing small-scale test.

Thanks using https://gitlab.com/gitlab-org makes sense as long as it's easy to track. We can help coordinate the monitoring of the CI runs with the SETs from the devopsverify side from Quality since they should have some knowledge of how the runner works and can monitor the test runs in our MR pipelines. After reading the plan. I think we may need some SREs involved too? Quality does not have access to the runners off hand, only from a user perspective.

Depending on the previous, in the appropriate/all Shared Runners settings, set public_projects_minutes_cost_factor to 1.0. This would require admin/SRE access.

@jo_shih is this something that maybe Dan or Tiffany can help the team monitor and validate when this gets experimentally tested on production ?

@meks - Per Slack thread with @kwiebers , it looks like Engineering Productivity will assist with monitoring the pipelines.

@kwiebers - Should I pull in Tiffany or Zeff to lend some extra eyes?

@ayufan

Just to sum up your suggestion:

we enable ci_minutes_track_for_public_projects (FF1) for our small group
check that it accounts minutes (could be done via UI)
update all relevant runners with 0.01 as the public cost factor – will need SRE help
we may want to double-check that we only accumulate public projects for our small group only (I suggested to use another small group for clarity – which would only have public projects)

up to this point, we don't affect any "global" data

Then you suggest for for 1h
5) enable ci_minutes_track_for_public_projects (FF1) for gitlab-org
6) we enable ci_minutes_enforce_quota_for_public_projects (FF2) for all

Good point from @ayufan that ci_minutes_enforce_quota_for_public_projects (FF2) will affect everyone – even if we will enable it shortly: every group which hit the limit by running their private projects, will not be available to run CI on their public with this flag enabled. It will be visible to some of our customers.

We monitor gitlab-org accumulation and performance
We don't expect to hit the limit on our group – I think we should either find a group which at the limit or to prepare one – to check the FF2: this group will not be able to pick up public projects by runners.
We disable FF1 for gitlab-org and evaluate the results
The FF2 will stay enabled? @ayufan I still feel that we can disable it if needed, so it is not a full "no-op".

@kencjohnston @ayufan
Please correct me if I am wrong:
Unless some other announcements were made, enabling the ci_minutes_enforce_quota_for_public_projects (FF2) for everyone (the only way we could do it) will go against what was announced so far.
In our pricing page:

For users signing up after March 18, 2020, the minutes limit applies to all projects.  
For users who signed up prior to that, the minutes limit only applies to private projects.  
Public projects include projects set to “Internal” as they are visible to everyone on GitLab.com.

I would like to get more clarity on this, as this what could also block us from testing ^

The FF2 will stay enabled? @ayufan I still feel that we can disable it if needed, so it is not a full "no-op".

Yes, we should disable it.

Good point from @ayufan that ci_minutes_enforce_quota_for_public_projects (FF2) will affect everyone – even if we will enable it shortly: every group which hit the limit by running their private projects, will not be available to run CI on their public with this flag enabled. It will be visible to some of our customers.

We can find all namespaces that are over limit today, and check if they still run public projects. Maybe we simply find zero of such, so we would take this small risk that no-one will run into this criteria in this short period (1h).

I would assume that we might be fine, or just check how many will be affected, likely very small number.

Overall, having a proper messaging would make it easier for us, as we would switch into feature rollout mode that we would adhere to what is announced.

Anyway, these are my key points that I want to validate:

Accumulation works at scale (we need to enable it for our group to check this, this should be enough to measure the impact)
Queueing works at scale (we need to enable it globally to effectively test it, picking some early/early morning of EU of 1h should be enough for us, and we would still be sure that we do not break anyone)

@alipniagov @ayufan - Following along - if it makes sense to move some of the testing to after the announcement as part of the roll-out plan we should do so. However, I think you might have identified a work around for this specific task.

@kwiebers - Should I pull in Tiffany or Zeff to lend some extra eyes?

@jo_shih - I'd leave it up to you on that decision. The plan as I understand it seems to be short term and focused that performance works at a larger scale than has been tested.

I'm not sure if the accumulation of minutes is validated or has been for the rules described at #215642 (comment 337361668):

For users signing up after March 18, 2020, the minutes limit applies to all projects.  
For users who signed up prior to that, the minutes limit only applies to private projects.  
Public projects include projects set to “Internal” as they are visible to everyone on GitLab.com.

@kwiebers

I'm not sure if the accumulation of minutes is validated or has been for the rules described

We are going to test the "final" behavior, which is going to be announced soon: all public projects will be accumulated.
Right now, we don't accumulate public projects at all.
We don't implement/test accumulation based on the project creation date.

I've put this on @treagitlab 's radar. Please feel free to pull her in as needed.

I run this query to check a number of namespaces that are over-limit today, without taking into account if they do process public:

SELECT id, used_seconds, limit_seconds + extra_seconds as upper_limit_seconds FROM (
  SELECT
    namespaces.id,
    COALESCE(namespace_statistics.shared_runners_seconds, 0) AS used_seconds,
    namespaces.limit_seconds,
    namespaces.extra_seconds
  FROM (
    SELECT
      id,
      COALESCE(namespaces.shared_runners_minutes_limit, 2000, 0) * 60 AS limit_seconds,
      COALESCE(namespaces.extra_shared_runners_minutes_limit, 0) * 60 AS extra_seconds
    FROM namespaces
    WHERE shared_runners_minutes_limit > 0 OR shared_runners_minutes_limit IS NULL
  ) AS namespaces
  INNER JOIN namespace_statistics
  ON namespace_statistics.namespace_id = namespaces.id
) AS namespaces
WHERE
  used_seconds > limit_seconds + extra_seconds
ORDER BY id ASC
;

Results ~~removed~~

Wow, this is great. It would be growing, so the earlier we make the test – the better.

Additional query:

SELECT
  namespaces.id,
  namespaces.used_seconds,
  namespaces.limit_seconds + namespaces.extra_seconds as upper_limit_seconds,
  gitlab_subscriptions.trial_ends_on,
  plans.name AS plan_name
FROM (
  SELECT
    namespaces.id,
    COALESCE(namespace_statistics.shared_runners_seconds, 0) AS used_seconds,
    namespaces.limit_seconds,
    namespaces.extra_seconds
  FROM (
    SELECT
      id,
      COALESCE(namespaces.shared_runners_minutes_limit, 2000, 0) * 60 AS limit_seconds,
      COALESCE(namespaces.extra_shared_runners_minutes_limit, 0) * 60 AS extra_seconds
    FROM namespaces
    WHERE shared_runners_minutes_limit > 0 OR shared_runners_minutes_limit IS NULL
  ) AS namespaces
  INNER JOIN namespace_statistics
  ON namespace_statistics.namespace_id = namespaces.id
) AS namespaces
INNER JOIN gitlab_subscriptions
ON gitlab_subscriptions.namespace_id = namespaces.id
INNER JOIN plans
ON gitlab_subscriptions.hosted_plan_id = plans.id
WHERE
  used_seconds > limit_seconds + extra_seconds
ORDER BY id ASC
;

Results ~~removed~~

Another try:

SELECT
  namespaces.id,
  namespaces.used_seconds,
  namespaces.limit_seconds + namespaces.extra_seconds as upper_limit_seconds,
  gitlab_subscriptions.trial_ends_on,
  plans.name AS plan_name,
  EXISTS (
    WITH RECURSIVE base_and_descendants AS (
      ( SELECT nested.id FROM namespaces AS nested WHERE nested.parent_id = namespaces.id )
      UNION
      ( SELECT nested.id FROM namespaces AS nested, base_and_descendants WHERE nested.parent_id = base_and_descendants.id )
    )
    SELECT 1 FROM projects
    JOIN base_and_descendants ON projects.namespace_id = base_and_descendants.id
    WHERE projects.visibility_level = 20
  ) AS has_public_projects
FROM (
  SELECT
    namespaces.id,
    COALESCE(namespace_statistics.shared_runners_seconds, 0) AS used_seconds,
    namespaces.limit_seconds,
    namespaces.extra_seconds
  FROM (
    SELECT
      id,
      COALESCE(namespaces.shared_runners_minutes_limit, 2000, 0) * 60 AS limit_seconds,
      COALESCE(namespaces.extra_shared_runners_minutes_limit, 0) * 60 AS extra_seconds
    FROM namespaces
    WHERE shared_runners_minutes_limit > 0 OR shared_runners_minutes_limit IS NULL
  ) AS namespaces
  INNER JOIN namespace_statistics
  ON namespace_statistics.namespace_id = namespaces.id
) AS namespaces
INNER JOIN gitlab_subscriptions
ON gitlab_subscriptions.namespace_id = namespaces.id
INNER JOIN plans
ON gitlab_subscriptions.hosted_plan_id = plans.id
WHERE
  used_seconds > limit_seconds + extra_seconds
ORDER BY id ASC
;

Results ~~removed~~

Cool! Thanks 🙂

Another try. Let's look into a future. This looks for namespaces that used at least 50% of quota and do have PUBLIC projects:

SELECT * FROM (
  SELECT
    namespaces.id,
    namespaces.used_seconds,
    namespaces.limit_seconds + namespaces.extra_seconds as upper_limit_seconds,
    gitlab_subscriptions.trial_ends_on,
    plans.name AS plan_name,
    EXISTS (
      WITH RECURSIVE base_and_descendants AS (
        ( SELECT nested.id FROM namespaces AS nested WHERE nested.parent_id = namespaces.id )
        UNION
        ( SELECT nested.id FROM namespaces AS nested, base_and_descendants WHERE nested.parent_id = base_and_descendants.id )
      )
      SELECT 1 FROM projects
      JOIN base_and_descendants ON projects.namespace_id = base_and_descendants.id
      WHERE projects.visibility_level = 20
    ) AS has_public_projects
  FROM (
    SELECT
      namespaces.id,
      COALESCE(namespace_statistics.shared_runners_seconds, 0) AS used_seconds,
      namespaces.limit_seconds,
      namespaces.extra_seconds
    FROM (
      SELECT
        id,
        COALESCE(namespaces.shared_runners_minutes_limit, 2000, 0) * 60 AS limit_seconds,
        COALESCE(namespaces.extra_shared_runners_minutes_limit, 0) * 60 AS extra_seconds
      FROM namespaces
      WHERE shared_runners_minutes_limit > 0 OR shared_runners_minutes_limit IS NULL
    ) AS namespaces
    INNER JOIN namespace_statistics
    ON namespace_statistics.namespace_id = namespaces.id
  ) AS namespaces
  INNER JOIN gitlab_subscriptions
  ON gitlab_subscriptions.namespace_id = namespaces.id
  INNER JOIN plans
  ON gitlab_subscriptions.hosted_plan_id = plans.id
  WHERE
    used_seconds > (limit_seconds + extra_seconds) / 2
) AS namespaces
WHERE has_public_projects IS TRUE
ORDER BY id ASC
;

   id    | used_seconds | upper_limit_seconds | trial_ends_on | plan_name | has_public_projects 
---------+--------------+---------------------+---------------+-----------+---------------------
 2928513 |       147885 |              240000 |               | free      | t
 3221093 |        90830 |              120000 | 2019-07-21    | free      | t
 6298769 |        80917 |              120000 |               | free      | t
(3 rows)

This is also promising 👍

The are result from now:

   id    | used_seconds | upper_limit_seconds | trial_ends_on | plan_name | has_public_projects 
---------+--------------+---------------------+---------------+-----------+---------------------
 5166621 |       122446 |              120000 |               | bronze    | t

changed the description

marked the checklist item Infrastructure Track is ready as completed

marked the checklist item Make sure that the date & time work for everyone as completed

changed the description

I'm testing some SQL queries.

`public=0/private=1`

This is query that: is used today by Shared Runners Managers and GitLab Shared Runners Managers

```sql explain analyze SELECT "ci_builds".* FROM "ci_builds" INNER JOIN "projects" ON "projects"."id" = "ci_builds"."project_id" LEFT JOIN project_features ON ci_builds.project_id = project_features.project_id LEFT JOIN ( SELECT "ci_builds"."project_id", count(*) AS running_builds FROM "ci_builds" WHERE "ci_builds"."type" = 'Ci::Build' AND ("ci_builds"."status" IN ('running')) AND "ci_builds"."runner_id" IN ( SELECT "ci_runners"."id" FROM "ci_runners" WHERE "ci_runners"."runner_type" = 1 ) GROUP BY "ci_builds"."project_id" ) AS project_builds ON ci_builds.project_id=project_builds.project_id WHERE "ci_builds"."type" = 'Ci::Build' AND ("ci_builds"."status" IN ('pending')) AND "ci_builds"."runner_id" IS NULL AND "projects"."shared_runners_enabled" = TRUE AND "projects"."pending_delete" = FALSE AND (project_features.builds_access_level IS NULL or project_features.builds_access_level > 0) AND (projects.visibility_level=20 OR ( WITH RECURSIVE "base_and_ancestors" AS ( (SELECT "namespaces".* FROM "namespaces" WHERE (namespaces.id = projects.namespace_id)) UNION (SELECT "namespaces".* FROM "namespaces", "base_and_ancestors" WHERE "namespaces"."id" = "base_and_ancestors"."parent_id") ) SELECT 1 FROM "base_and_ancestors" AS "namespaces" LEFT JOIN namespace_statistics ON namespace_statistics.namespace_id = namespaces.id WHERE "namespaces"."parent_id" IS NULL AND ( COALESCE(namespaces.shared_runners_minutes_limit, 2000, 0) = 0 OR COALESCE(namespace_statistics.shared_runners_seconds, 0) < COALESCE((namespaces.shared_runners_minutes_limit + COALESCE(namespaces.extra_shared_runners_minutes_limit, 0)), (2000 + COALESCE(namespaces.extra_shared_runners_minutes_limit, 0)), 0) * 60 ) )=1 ) AND ( NOT EXISTS ( SELECT 1 FROM "taggings" WHERE "taggings"."taggable_type" = 'CommitStatus' AND "taggings"."context" = 'tags' AND (taggable_id = ci_builds.id) AND 1=1 ) ) ORDER BY COALESCE(project_builds.running_builds, 0) ASC, ci_builds.id ASC; ```

`public=1/private=1`

This is query that: will be used by Shared Runners Managers after change

```sql explain analyze SELECT "ci_builds".* FROM "ci_builds" INNER JOIN "projects" ON "projects"."id" = "ci_builds"."project_id" LEFT JOIN project_features ON ci_builds.project_id = project_features.project_id LEFT JOIN ( SELECT "ci_builds"."project_id", count(*) AS running_builds FROM "ci_builds" WHERE "ci_builds"."type" = 'Ci::Build' AND ("ci_builds"."status" IN ('running')) AND "ci_builds"."runner_id" IN ( SELECT "ci_runners"."id" FROM "ci_runners" WHERE "ci_runners"."runner_type" = 1 ) GROUP BY "ci_builds"."project_id" ) AS project_builds ON ci_builds.project_id=project_builds.project_id WHERE "ci_builds"."type" = 'Ci::Build' AND ("ci_builds"."status" IN ('pending')) AND "ci_builds"."runner_id" IS NULL AND "projects"."shared_runners_enabled" = TRUE AND "projects"."pending_delete" = FALSE AND (project_features.builds_access_level IS NULL or project_features.builds_access_level > 0) AND (( WITH RECURSIVE "base_and_ancestors" AS ( (SELECT "namespaces".* FROM "namespaces" WHERE (namespaces.id = projects.namespace_id)) UNION (SELECT "namespaces".* FROM "namespaces", "base_and_ancestors" WHERE "namespaces"."id" = "base_and_ancestors"."parent_id") ) SELECT 1 FROM "base_and_ancestors" AS "namespaces" LEFT JOIN namespace_statistics ON namespace_statistics.namespace_id = namespaces.id WHERE "namespaces"."parent_id" IS NULL AND ( COALESCE(namespaces.shared_runners_minutes_limit, 2000, 0) = 0 OR COALESCE(namespace_statistics.shared_runners_seconds, 0) < COALESCE((namespaces.shared_runners_minutes_limit + COALESCE(namespaces.extra_shared_runners_minutes_limit, 0)), (2000 + COALESCE(namespaces.extra_shared_runners_minutes_limit, 0)), 0) * 60 ) )=1 ) AND ( NOT EXISTS ( SELECT 1 FROM "taggings" WHERE "taggings"."taggable_type" = 'CommitStatus' AND "taggings"."context" = 'tags' AND (taggable_id = ci_builds.id) AND 1=1 ) ) ORDER BY COALESCE(project_builds.running_builds, 0) ASC, ci_builds.id ASC; ```

Staging:

public=0/private=1: https://explain.depesz.com/s/VVAC: plan 3.79ms, exec 0.84ms
public=1/private=1: https://explain.depesz.com/s/lMum: plan 3.42ms, exec 0.81ms

Production:

public=0/private=1: https://explain.depesz.com/s/tq8g: plan 3.32ms, exec: 163.86ms
public=1/private=1: https://explain.depesz.com/s/WOrC7: plan 3.42ms, exec: 141.34ms

I re-run public=1/private=1 multiple times there was a variability in execution time, between 140 and 170ms. Similar to the public=0/private=1.

It means that our changes does not appear to create a statistically noticeable difference in execution plans.

The timings are pretty bad, but we do not expect our changes to make a dent in them.

This shows a growing need to optimise CI Queueing mechanism, where one of the ways could be introduction of CI/CD Daemon.

Thanks for checking this, Kamil!

As per #216977 (closed), I've set the public and private factor to 1.0 for the following runners as of 17:10 UTC:

gitlab-docker-shared-runners-manager-central-01
gitlab-docker-shared-runners-manager-01
gitlab-docker-shared-runners-manager-02
gitlab-docker-shared-runners-manager-03
gitlab-shared-runners-manager-3.gitlab.com
gitlab-shared-runners-manager-4.gitlab.com
gitlab-shared-runners-manager-5.gitlab.com
gitlab-shared-runners-manager-6.gitlab.com
shared-runners-manager-3.gitlab.com
shared-runners-manager-4.gitlab.com
shared-runners-manager-5.gitlab.com
shared-runners-manager-6.gitlab.com

mentioned in issue gitlab-com/www-gitlab-com#7286 (closed)

marked the checklist item Create a group, which has private, public and internal projects with the pipelines as completed

marked the checklist item Enable ci_minutes_track_for_public_projects (FF1) for this group as completed

marked the checklist item Confirm that we started accumulating minutes for public projects after running the pipeline as completed

marked the checklist item Check the accumulation & queueing nodes metrics (we'll monitor the on the gitlab-org test) as completed

marked the checklist item Update shared runners with 1.0 as the public cost factor: #216977 (closed) as completed

marked the checklist item Enable enable ci_minutes_enforce_quota_for_public_projects (FF2). This could only be enabled globally, unlike the FF1 which is available per group as completed

marked the checklist item Enable ci_minutes_track_for_public_projects (FF1) for gitlab-org as completed

I updated the gitlab-org group to have enough minutes for our tests.

Namespace.find_by_full_path("gitlab-org").update(shared_runners_minutes_limit: 100000)

It seems that on gitlab-org since we enabled the feature flag till we disabled it (which was 47 minutes) we consumed: 75k minutes:

Namespace.find_by_full_path("gitlab-org").reset.namespace_statistics.shared_runners_minutes
=> 75753

It seems that our feature flag removal was not immediate. It took a significant amount of time, as we were still accounting the minutes after it, which resulted in some pipelines not being picked:

Namespace.find_by_full_path("gitlab-org").namespace_statistics.shared_runners_seconds/60
=> 110874

This is why I updated a limit to 1_000_000:

Namespace.find_by_full_path("gitlab-org").update(shared_runners_minutes_limit: 1_000_000)
=> true

Thanks!

I wonder why it's not immediate – some caching?

Yes, we have memory-cache, redis-cache, etc.

It seems that on gitlab-org since we enabled the feature flag till we disabled it (which was 47 minutes) we consumed: 75k minutes:

@ayufan - Are the 75k minutes supposed to be only builds run on SRM (assuming yes becasue of shared_runners_minutes? This is higher than I would have expected and make me doubt the Sisense reports I have been looking at: https://app.periscopedata.com/app/gitlab/564156/Engineering-Productivity---Pipeline?widget=8479582&udv=895971 🤔

@kwiebers @alipniagov

About the growing usage minutes (not the 75k). I got it wrong. The extra usage comes from the private projects, like security pipelines.

I do changed unlimited to limited quota. First 100k then 1M minutes. We should change to unlimited once again.

I did fallback to previous setting to make the quota message go:

Namespace.find_by_full_path("gitlab-org").update(shared_runners_minutes_limit: 0)
=> true
Namespace.find_by_full_path("gitlab-org").shared_runners_minutes_limit
=> 0

👍

Dashboards:

https://log.gprd.gitlab.net/goto/b1cb82f69fa119830bc6e98020c9dbc6: api/jobs/request
https://log.gprd.gitlab.net/goto/458f2e28ecf65bd848a3a997e9085860: BuildFinishedWorker

Thanks!

I looked at results of our test. Look at time from 08:00 to around 09:00. There's a delay between enabling FF and disabling it.

`api/jobs/request`

We don't see a noticeable difference in execution time or quantity of requests.

Unfortunately, we don't have DB duration time for these requests for unknown reason.

`BuildFinishedWorker` for `gitlab-org`

We see increase in durations for a given period. However, this is expected given that all builds now were accounted, so we had to execute additional SQL queries during that period, thus this increased the DB duration.

Summary

I consider the results of the tests to be successful. I believe we see an acceptable performance penalty connected with the need for accounting high-volume project which is gitlab-org. Take into account that during that period we consumed 75k minutes on a shared runners alone.

marked the checklist item Disable ci_minutes_enforce_quota_for_public_projects globally as completed

marked the checklist item Disable ci_minutes_track_for_public_projects for all projects as completed

marked the checklist item Confirm that we started accumulating minutes for public projects as completed

marked the checklist item Check the performance of the accumulation at scale – are there any visible patterns? as completed

marked the checklist item Monitor accumulation at scale: The UpdateBuildMinutesService timings from Sidekiq nodes as completed

marked the checklist item Monitor Queueing at scale: The RegisterJobService timings from API nodes as completed

changed the description

@kencjohnston @kwiebers @craig-gomes @alipniagov

I'm happy of the above results. I believe they do fall in my expectations.

1. Building confidence

I think we could re-run our test again now, for a longer period, like 4h doing exactly the same as we did today
Ensure that we have high enough quota assigned to gitlab-org, probably we need something like 10M to be on a safe side
Extend our test to gitlab-com and configure quota accordingly
Enable both feature flags as we did it today

2. Before communicated switch-over date:

Configure shared runners (general purpose, windows, gitlab and gitlab docker) for time being to be public=0/private=1
Ensure that we have high enough quota assigned to gitlab-org/gitlab-com
Enable system-wide ci_minutes_enforce_quota_for_public_projects feature flag on by default
Configure ci_minutes_track_for_public_projects to be enabled for all.
We do not expect any impact at this point

3. After communicated switch-over date:

Disable ci_minutes_track_for_public_projects as we would start accumulating public projects
Change general purpose shared runners to be public=1/private=1
We start enforcing limits on all projects, that if they run out of the quota they will not be able to run public/private projects
We start a gradual rollout of ci_minutes_track_for_public_projects => probably in 1%, 10%, 50%, 100% per-day
The gradual rollout will allow us to start accounting more and more projects, but enforce limits for all

This is amazing!

Thanks for outlining the plan @ayufan .

A couple of questions:

Building confidence: "Extend our test to gitlab-com and configure quota accordingly". Do you mean the https://gitlab.com/gitlab-com (another group) or whole service? I believe first, but just in case. UPD: ah, it is definitely the group, sorry for the noise 🙂
Building confidence: any suggestions about the timebox? Will EU working hours work for us?

marked the checklist item Check the metrics and the evaluate the performance hit, if it is present. Dashboards: #215642 (comment 340752468) as completed

marked the checklist item Plan the optimizations, if needed as completed

changed the description

mentioned in merge request !31858 (merged)

changed milestone to %13.1

I think we could close it in its current iteration.

closed

added Deliverable auto updated devopssystems labels and removed devopsverify label

added devopsdata stores label and removed devopssystems label

CI Minutes Cost Factors: Test on Production

Timeline

Key points to validate

Plan

Prerequisites

Runners cost factors update

Warm-up: Testing with a small group (May 8)

Testing with `gitlab-org` (May 12, EU morning)

Cleanup & assessment

After May 12

1. Building confidence

2. Before communicated switch-over date:

3. After communicated switch-over date:

Designs

Child items 0

Activity

`public=0/private=1`

`public=1/private=1`

`api/jobs/request`

`BuildFinishedWorker` for `gitlab-org`

Summary

1. Building confidence

2. Before communicated switch-over date:

3. After communicated switch-over date:

CI Minutes Cost Factors: Test on Production

Timeline

Key points to validate

Plan

Prerequisites

Runners cost factors update

Warm-up: Testing with a small group (May 8)

Testing with gitlab-org (May 12, EU morning)

Cleanup & assessment

After May 12

1. Building confidence

2. Before communicated switch-over date:

3. After communicated switch-over date:

Activity

public=0/private=1

public=1/private=1

api/jobs/request

BuildFinishedWorker for gitlab-org

Summary

1. Building confidence

2. Before communicated switch-over date:

3. After communicated switch-over date:

Testing with `gitlab-org` (May 12, EU morning)

`public=0/private=1`

`public=1/private=1`

`api/jobs/request`

`BuildFinishedWorker` for `gitlab-org`