Estimated date: Tue, 12.05 Estimated start time: around 7:00 GMT, the idea is to run the test in a less busy period (Asia finished their daily work, US/Europe didn't start) Estimated length: ~1 hour
Key points to validate
Accumulation works at scale (we need to enable it for gitlab-org to check this, this should be enough to measure the impact)
Queueing works at scale (we need to enable it globally to effectively test it, picking some early/early morning of EU of 1h should be enough for us, and we would still be sure that we do not break anyone)
Overall logic correctness
Plan
Prerequisites
Finalize the test on Staging: After is !30164 (merged) is deployed to Staging, retest the flags as described in #214646 (closed)
marked the checklist item Finalize the test on Staging: After is !30164 (merged) is deployed to Staging, retest the flags as described in #214646 (closed) as completed
@ayufan can you please kindly highlight which action points are needed for Quality to help monitor and facilitate testing. Is this in the new projects part of namespace N ?
cc @jo_shih since this is related to CI and minutes accounting.
@meks
After our conversation, I think that @ayufan suggested that we will use https://gitlab.com/gitlab-org as our test namespace?
Initially, I assumed that we just create some small namespace, will have public/private/internal projects in it, and check that after enabling the feature flag for that group we'll start accounting all three kinds of projects into the quota.
I believe we should do dry-run on our small group first, but we need to be careful with enabling the ci_minutes_enforce_quota_for_public_projects as this will be system-wide setting: so it will affect all projects. I hope that we then leave the ci_minutes_enforce_quota_for_public_projects running for everyone, as this should effectively be no-op.
Then, I hope that we enable accounting to our group. And start to monitor from there.
Ideally, I would love us to try to enable these both feature flags for everyone for a period of 1h, and set final cost factors to ensure that all aspects do work on scale. The ideal moment would be end of month, as the quotas are being reset on a first day of a new month, but we clearly not gonna have that. Maybe we simply set the cost factor on shared runners to 0.01 which will make us consume 1s from quota for each 100s run. This should be enough for us to simulate this.
Maybe we could find an SQL query that would iterate and "deduct" spend minutes for projects marked as public from that period.
Let's figure out next week after doing small-scale test.
Thanks using https://gitlab.com/gitlab-org makes sense as long as it's easy to track. We can help coordinate the monitoring of the CI runs with the SETs from the devopsverify side from Quality since they should have some knowledge of how the runner works and can monitor the test runs in our MR pipelines. After reading the plan. I think we may need some SREs involved too? Quality does not have access to the runners off hand, only from a user perspective.
Depending on the previous, in the appropriate/all Shared Runners settings, set public_projects_minutes_cost_factor to 1.0. This would require admin/SRE access.
we enable ci_minutes_track_for_public_projects (FF1) for our small group
check that it accounts minutes (could be done via UI)
update all relevant runners with 0.01 as the public cost factor – will need SRE help
we may want to double-check that we only accumulate public projects for our small group only (I suggested to use another small group for clarity – which would only have public projects)
up to this point, we don't affect any "global" data
Then you suggest for for 1h
5) enable ci_minutes_track_for_public_projects (FF1) for gitlab-org
6) we enable ci_minutes_enforce_quota_for_public_projects (FF2) for all
Good point from @ayufan that ci_minutes_enforce_quota_for_public_projects (FF2) will affect everyone – even if we will enable it shortly: every group which hit the limit by running their private projects, will not be available to run CI on their public with this flag enabled. It will be visible to some of our customers.
We monitor gitlab-org accumulation and performance
We don't expect to hit the limit on our group – I think we should either find a group which at the limit or to prepare one – to check the FF2: this group will not be able to pick up public projects by runners.
We disable FF1 for gitlab-org and evaluate the results
The FF2 will stay enabled? @ayufan I still feel that we can disable it if needed, so it is not a full "no-op".
@kencjohnston@ayufan
Please correct me if I am wrong:
Unless some other announcements were made, enabling the ci_minutes_enforce_quota_for_public_projects (FF2) for everyone (the only way we could do it) will go against what was announced so far.
In our pricing page:
For users signing up after March 18, 2020, the minutes limit applies to all projects. For users who signed up prior to that, the minutes limit only applies to private projects. Public projects include projects set to “Internal” as they are visible to everyone on GitLab.com.
The FF2 will stay enabled? @ayufan I still feel that we can disable it if needed, so it is not a full "no-op".
Yes, we should disable it.
Good point from @ayufan that ci_minutes_enforce_quota_for_public_projects (FF2) will affect everyone – even if we will enable it shortly: every group which hit the limit by running their private projects, will not be available to run CI on their public with this flag enabled. It will be visible to some of our customers.
We can find all namespaces that are over limit today, and check if they still run public projects. Maybe we simply find zero of such, so we would take this small risk that no-one will run into this criteria in this short period (1h).
I would assume that we might be fine, or just check how many will be affected, likely very small number.
Overall, having a proper messaging would make it easier for us, as we would switch into feature rollout mode that we would adhere to what is announced.
Anyway, these are my key points that I want to validate:
Accumulation works at scale (we need to enable it for our group to check this, this should be enough to measure the impact)
Queueing works at scale (we need to enable it globally to effectively test it, picking some early/early morning of EU of 1h should be enough for us, and we would still be sure that we do not break anyone)
@alipniagov@ayufan - Following along - if it makes sense to move some of the testing to after the announcement as part of the roll-out plan we should do so. However, I think you might have identified a work around for this specific task.
@kwiebers - Should I pull in Tiffany or Zeff to lend some extra eyes?
@jo_shih - I'd leave it up to you on that decision. The plan as I understand it seems to be short term and focused that performance works at a larger scale than has been tested.
I'm not sure if the accumulation of minutes is validated or has been for the rules described at #215642 (comment 337361668):
For users signing up after March 18, 2020, the minutes limit applies to all projects. For users who signed up prior to that, the minutes limit only applies to private projects. Public projects include projects set to “Internal” as they are visible to everyone on GitLab.com.
I'm not sure if the accumulation of minutes is validated or has been for the rules described
We are going to test the "final" behavior, which is going to be announced soon: all public projects will be accumulated.
Right now, we don't accumulate public projects at all.
We don't implement/test accumulation based on the project creation date.
This is query that: is used today by Shared Runners Managers and GitLab Shared Runners Managers
```sql
explain analyze
SELECT "ci_builds".*
FROM "ci_builds"
INNER JOIN "projects" ON "projects"."id" = "ci_builds"."project_id"
LEFT JOIN project_features ON ci_builds.project_id = project_features.project_id
LEFT JOIN (
SELECT "ci_builds"."project_id", count(*) AS running_builds
FROM "ci_builds"
WHERE "ci_builds"."type" = 'Ci::Build' AND
("ci_builds"."status" IN ('running')) AND
"ci_builds"."runner_id" IN (
SELECT "ci_runners"."id"
FROM "ci_runners"
WHERE "ci_runners"."runner_type" = 1
)
GROUP BY "ci_builds"."project_id"
) AS project_builds
ON ci_builds.project_id=project_builds.project_id
WHERE "ci_builds"."type" = 'Ci::Build' AND
("ci_builds"."status" IN ('pending')) AND
"ci_builds"."runner_id" IS NULL AND
"projects"."shared_runners_enabled" = TRUE AND
"projects"."pending_delete" = FALSE AND
(project_features.builds_access_level IS NULL or project_features.builds_access_level > 0) AND
(projects.visibility_level=20 OR (
WITH RECURSIVE "base_and_ancestors" AS (
(SELECT "namespaces".* FROM "namespaces" WHERE (namespaces.id = projects.namespace_id))
UNION
(SELECT "namespaces".* FROM "namespaces", "base_and_ancestors" WHERE "namespaces"."id" = "base_and_ancestors"."parent_id")
)
SELECT 1 FROM "base_and_ancestors" AS "namespaces"
LEFT JOIN namespace_statistics
ON namespace_statistics.namespace_id = namespaces.id
WHERE "namespaces"."parent_id" IS NULL AND (
COALESCE(namespaces.shared_runners_minutes_limit, 2000, 0) = 0 OR
COALESCE(namespace_statistics.shared_runners_seconds, 0) <
COALESCE((namespaces.shared_runners_minutes_limit + COALESCE(namespaces.extra_shared_runners_minutes_limit, 0)), (2000 + COALESCE(namespaces.extra_shared_runners_minutes_limit, 0)), 0) * 60
)
)=1
) AND (
NOT EXISTS (
SELECT 1 FROM "taggings" WHERE "taggings"."taggable_type" = 'CommitStatus' AND "taggings"."context" = 'tags' AND (taggable_id = ci_builds.id) AND 1=1
)
)
ORDER BY COALESCE(project_builds.running_builds, 0) ASC, ci_builds.id ASC;
```
public=1/private=1
This is query that: will be used by Shared Runners Managers after change
```sql
explain analyze
SELECT "ci_builds".*
FROM "ci_builds"
INNER JOIN "projects" ON "projects"."id" = "ci_builds"."project_id"
LEFT JOIN project_features ON ci_builds.project_id = project_features.project_id
LEFT JOIN (
SELECT "ci_builds"."project_id", count(*) AS running_builds
FROM "ci_builds"
WHERE "ci_builds"."type" = 'Ci::Build' AND
("ci_builds"."status" IN ('running')) AND
"ci_builds"."runner_id" IN (
SELECT "ci_runners"."id"
FROM "ci_runners"
WHERE "ci_runners"."runner_type" = 1
)
GROUP BY "ci_builds"."project_id"
) AS project_builds
ON ci_builds.project_id=project_builds.project_id
WHERE "ci_builds"."type" = 'Ci::Build' AND
("ci_builds"."status" IN ('pending')) AND
"ci_builds"."runner_id" IS NULL AND
"projects"."shared_runners_enabled" = TRUE AND
"projects"."pending_delete" = FALSE AND
(project_features.builds_access_level IS NULL or project_features.builds_access_level > 0) AND
((
WITH RECURSIVE "base_and_ancestors" AS (
(SELECT "namespaces".* FROM "namespaces" WHERE (namespaces.id = projects.namespace_id))
UNION
(SELECT "namespaces".* FROM "namespaces", "base_and_ancestors" WHERE "namespaces"."id" = "base_and_ancestors"."parent_id")
)
SELECT 1 FROM "base_and_ancestors" AS "namespaces"
LEFT JOIN namespace_statistics
ON namespace_statistics.namespace_id = namespaces.id
WHERE "namespaces"."parent_id" IS NULL AND (
COALESCE(namespaces.shared_runners_minutes_limit, 2000, 0) = 0 OR
COALESCE(namespace_statistics.shared_runners_seconds, 0) <
COALESCE((namespaces.shared_runners_minutes_limit + COALESCE(namespaces.extra_shared_runners_minutes_limit, 0)), (2000 + COALESCE(namespaces.extra_shared_runners_minutes_limit, 0)), 0) * 60
)
)=1
) AND (
NOT EXISTS (
SELECT 1 FROM "taggings" WHERE "taggings"."taggable_type" = 'CommitStatus' AND "taggings"."context" = 'tags' AND (taggable_id = ci_builds.id) AND 1=1
)
)
ORDER BY COALESCE(project_builds.running_builds, 0) ASC, ci_builds.id ASC;
```
marked the checklist item Update shared runners with 1.0 as the public cost factor: #216977 (closed) as completed
Aleksei Lipniagovmarked the checklist item Enable enable ci_minutes_enforce_quota_for_public_projects (FF2). This could only be enabled globally, unlike the FF1 which is available per group as completed
marked the checklist item Enable enable ci_minutes_enforce_quota_for_public_projects (FF2). This could only be enabled globally, unlike the FF1 which is available per group as completed
Aleksei Lipniagovmarked the checklist item Enable ci_minutes_track_for_public_projects (FF1) for gitlab-org as completed
marked the checklist item Enable ci_minutes_track_for_public_projects (FF1) for gitlab-org as completed
It seems that our feature flag removal was not immediate. It took a significant amount of time, as we were still accounting the minutes after it, which resulted in some pipelines not being picked:
I looked at results of our test. Look at time from 08:00 to around 09:00. There's a delay between enabling FF and disabling it.
api/jobs/request
We don't see a noticeable difference in execution time or quantity of requests.
Unfortunately, we don't have DB duration time for these requests for unknown reason.
BuildFinishedWorker for gitlab-org
We see increase in durations for a given period. However, this is expected given that all builds now were accounted, so we had to execute additional SQL queries during that period, thus this increased
the DB duration.
Summary
I consider the results of the tests to be successful. I believe we see an acceptable performance penalty connected with the need for accounting high-volume project which is gitlab-org. Take into account that during that period we consumed 75k minutes on a shared runners alone.
Building confidence: "Extend our test to gitlab-com and configure quota accordingly". Do you mean the https://gitlab.com/gitlab-com (another group) or whole service? I believe first, but just in case. UPD: ah, it is definitely the group, sorry for the noise 🙂
Building confidence: any suggestions about the timebox? Will EU working hours work for us?
marked the checklist item Check the metrics and the evaluate the performance hit, if it is present. Dashboards: #215642 (comment 340752468) as completed
Aleksei Lipniagovmarked the checklist item Plan the optimizations, if needed as completed
marked the checklist item Plan the optimizations, if needed as completed
Aleksei Lipniagovchanged the descriptionCompare with previous version