retrieve-tests-metadata only from 2-hourly scheduled pipeline

Why?

Because we need to make sure the metadata we're looking for does exist. Not all pipelines might have those artifacts available, even from master.

Background

In !46424 (merged) we're retrieving metadata from:

test_metadata_job_id=$(scripts/api/get_job_id --project "${project_path}" -q "status=success" -q "ref=master" -q "username=gitlab-bot" -Q "scope=success" --job-name "update-tests-metadata")

This is likely to retrieve only from the 2-hourly scheduled pipeline due to filtering against user @gitlab-bot, but it's not completely future proof.

Problem

Some master pipelines will have update-tests-metadata, but not necessarily has all the artifacts we want, because some of the tasks were conditional that don't run in regular master pipelines, like update_tests_mapping, which may not run together when update_tests_metadata is running.

The job has this rule:

.test-metadata:rules:update-tests-metadata:
  rules:
    - <<: *if-not-ee
      when: never
    - changes:
        - ".gitlab/ci/test-metadata.gitlab-ci.yml"
        - "scripts/rspec_helpers.sh"
    - <<: *if-dot-com-ee-schedule

A master pipeline can change scripts/rspec_helpers.sh and cause this job to run, but it does not run crystal for update_tests_mapping.

We do not want to use those artifacts unless we grab artifacts from different pipelines, which itself is quite complex and difficult to understand, therefore might not be worth it. Moreover, we cannot filter jobs/pipelines against artifacts exist or not.

Potential solution

One way to do it would be adding a new job called two-hourly-schedule, which only runs on the 2-hourly scheduled pipeline. We can iterate through and check pipelines only with this job available. It can just be a dummy job to indicate this, and does nothing itself.

Since we can iterate through pipelines and find which one has this job, it'll make it much easier to narrow down the job we're looking for.

Potential optimization

Since iterating through the pipelines can be slow if there are a lot of pipelines we don't care (to give some perspective, 7.88 seconds over 9 pipelines for me), we can consider add another job for all pipelines to save the pipeline id or job id from last update-tests-metadata we're looking for.

This way, instead of O(N) we narrow down to O(1).

The following discussion from !46424 (merged) should be addressed:

@godfat-gitlab started a discussion:

Here's a wild idea. We add a new job which has a rule for if-master-schedule-2-hourly, just like dont-interrupt-me, which doesn't do anything, but an indication for the scheduled 2-hourly pipeline. We can call it two-hourly-schedule.

And we look for pipelines which has this job. This may be a bit slow as it needs to iterate through the pipelines to find which one has this job though. It doesn't seem that we can filter with this.

Actually, I just realized we are already iterating, so maybe it won't be a big deal.

--

A even more wild idea... We can make update-tests-metadata in all master pipeline try to search for this 2-houly pipeline, and save the pipeline id in an artifact, so we can always just grab the artifact to find the pre-calculated pipeline id which has two-hourly-schedule.

If itself is the two-hourly-schedule pipeline, just save its own pipeline id into this artifact.

This is trying to offset the calculation from retrieving to uploading, which will avoid blocking tests from waiting for this search, but the overall time should be the same.

Perhaps some food for thoughts when we need to implement this.

Edited Nov 19, 2020 by Lin Jen-Shin