In our benchmarking test runs, we saw fluctuating execution times between 5-19mins. One of our highest priorities should be stability and reliability, which is why we should lay great emphasis on fixing these.
In the linked pipeline, out of 18 jobs only 2 have outlying times of 19 minutes (1,2) and 1 of 13 minutes (1). The rest are around 9 minutes pretty consistently.
Yes, however we also had runs where the fastest one was 5mins and most between 6-7mins. Some runs seem more stable than others but from my testing I would agree that about 80% seem to land between 7-9mins now
I would like to run these benchmarks continuously for a while and collect some more environmental information. Such as the AMI, the instance id, etc... See if performance differences are correlated.
Jobs fail about 50% of the time. It doesn't seem to be correlated with the runner manager, but rather with the instance. Three ~500 second builds and one ~700 second build. I need to see more jobs if I'm to correlate time with specific instances (or not).
I'm running the pipeline with 12 builds in parallel every 30 minutes overnight, so I'll collect a bunch more jobs to look at tomorrow.
Here is my script. I'm getting some information from the pipeline JSON, but also some information from the job logs (e.g. the name of the instance).
#!/bin/bashset -ecd XcodeBenchmarkPIPELINE_ID=$(glab -R josephburnett/XcodeBenchmark ci run | awk '{ print $4 }')#PIPELINE_ID="872194495"echo "Started pipeline $PIPELINE_ID"while truedo PIPELINE_STATUS=$(glab -R josephburnett/XcodeBenchmark ci get -p $PIPELINE_ID -o json | jq --raw-output '.status') if [[ "$PIPELINE_STATUS" == "success" || "$PIPELINE_STATUS" == "failed" ]] then break fi echo "Waiting for pipeline $PIPELINE_ID to finish with 'success' or 'failed'" sleep 60doneecho "Results of jobs"PIPELINE=$(glab -R josephburnett/XcodeBenchmark ci get -p $PIPELINE_ID -o json)JOB_IDS=$(echo "$PIPELINE" | jq '.jobs[].id')for ID in $JOB_IDSdo STATUS=$(echo "$PIPELINE" | jq ".jobs[] | select(.id==$ID) | .status") DURATION=$(echo "$PIPELINE" | jq ".jobs[] | select(.id==$ID) | .duration") TRACE=$(glab -R josephburnett/XcodeBenchmark ci trace $ID) INSTANCE=$(echo "$TRACE" | awk 'match($0, /Dialing instance (i\-[[:alnum:]]+)/, a) { print a[1] }' | uniq) RUNNER=$(echo "$TRACE" | awk 'match($0, /((green|blue)-(1|2).saas-macos-medium-m1.runners-manager.gitlab.com)/, a) { print a[1] }') RUNNER_COMMIT=$(echo "$TRACE" | awk 'match($0, /(\([[:alnum:]]{8}\))/, a) { print a[1] }' | uniq) JOB_LINE="$PIPELINE_ID $ID $INSTANCE $STATUS $DURATION $RUNNER $RUNNER_COMMIT" echo $JOB_LINE echo $JOB_LINE >> ~/resultsdone
Some of the instance are just slower than others. Jobs that run on the same instance run at roughly the same speed. E.g. i-0b8e504a7530a24e4 had 516 and 523. i-049101f3715da7725 had 752 and 758. i-0ad7fa610b569b6fe had 821 and 817.
My previous theory (some instances are just slow) was incorrect. Here is the performance of the successful jobs (94 out of 108). Performance on a given instance does vary quite a bit.
Some of the instances previous ran successful jobs, so they don't start out non-viable. @tmaczukin and @ajwalker have a theory that an instance stops working after running 42 jobs: #79 (comment 1397551311)
I've done some load testing focusing exclusively on CPU (test). The results vary quite a bit and we seem to suffer from a noisy neighbor problem. Whenever two jobs run on a single machine, they are twice as slow.
I'm trying to understand how the jobs in @gabrielengel_gl 's pipeline overlapped. Here is a visualization of the 869480618 pipeline:
My comments from slack:
most of the instances were working on 2 of your jobs at a time (you were right). but there are some interesting exceptions. a few instances didn’t show up until much later, had 2 jobs the whole time, and took forever
i-0eb9068acc2194889 and i-0e0a0c8831979e1dd and i-0de2db4c2516dd868 are interesting because they didn't show up until quite a bit later. I assume the system was autoscaling up. And their jobs ran very slowly. But i-0ea48226a6bf1b397 was also running slowly and it was there from the beginning.
Here is a view of the 869321676 pipeline that @gabrielengel_gl pointed me to. Looks similar. Although I removed an outlier job that ran like 3 hours later than all the rest (!?) maybe it was restarted manually.
It does look like Gabriel's pipelines saturate the capacity because some don't get started until much later. Would be interested to see if the same pattern shows up with fewer jobs overall.
Production versus Staging comparison -- May 24, 2023
I did an analysis comparing our production and staging performance with the xcodebuild benchmark. At the time of this writing, production has only one performance improvement rolled out (gp2 -> gp3 storage). Staging has a new AMI which promotes the operating system from Monterey to Ventura, which includes some improvements in the virtualization framework. In order to get an idea of the improvements we might expect, I ran several benchmarks in each environment. These are the results.
Test conditions
I ran the xcodebuild benchmark in 12 parallel jobs, 3 times, for each environment. A total of 6 pipelines, alternating between staging and production. The pipelines were triggered from my jburnett/mac-perf branch which includes only redirection of build logs to /dev/null (so I can scrape the results logged at the end) and changes to .gitlab-ci.yml to run a smaller test and point to the relevant environment.
I used the code at josephburnett/pipeline-lifeline to collect data from each run and generate a lifeline graph (.svg) and a histogram of duration (.txt). All data and artifacts are attached as pipelines.tar.gz to this post.
Production
Over all 3 runs, this is what job duration looked like in production:
Staging is much more inconsistent than production. This is unexpected because it includes a hopeful performance improvement. To really evaluate the performance improvement (promotion to Ventura) we will have to wait until it rolls out to production and run another production set of tests.
Jobs in production were not queued. But jobs in staging were, quite a bit. This is probably because of lower capacity. Theoretically, even if the jobs were queued for a while, they still should run just as fast once they get around to it. You can see that staging ran on fewer machines than production (see included pipeline-lifeline-staging.svg and pipeline-lifeline-production.svg).
In general, jobs always run faster when they are alone on a machine. Take for example Staging Run 1 (pipeline-lifeline-878246414.svg). Jobs that start together on a single machine run about 900-1000 seconds. Jobs that start later run about 500 seconds. This is unexpected because resources should be capped within a VM. Some adhoc testing has shown CPU oriented work runs much slower than expected when both VMs are loaded down.
Jobs that land on new machines, or machines that haven't run the xcodebuild benchmark recently run slower. You can see this again in the same Staging Run 1 (and others). The first jobs are usually slower than ones that start later.
Conclusions
The new operation system alone doesn't seems to dramatically improve consistency. Or overall performance. However there are two more improvements in the pipeline:
Direct use of the virtualization framework (replacement of Tart)
Use of local SSD instead of gp3 EBS volumes
These changes may have more dramatic effects.
We should rollout the new image and retest production. Then we should retest again with each successive improvement. Now we have a baseline and some tooling, so it should be pretty easy to do.pipelines.tar.gz
@ajwalker's initial, one-off test look promising. Here are results from running the benchmark without Tart, on Ventura with SSDs:
> AMI and nested VMs all Ventura, xcode benchmarks:> on the host: 123.829 sec> Inside nested VM: 163.354 sec> 2 nested running VMs concurrently: 238 sec each> This was using https://github.com/devMEremenko/XcodeBenchmark, not our fork
This is dramatically faster than our 500-1000 second runs (238 sec each). He's retesting again with Tart, Monterey and EBS just to make sure there aren't any differences. We'll post those results here for comparison.
[1] This image was assigned to our other runner set, but a blue/green deployment had not yet taken place. So this was production unaltered. So I don't know exactly why the long running jobs disappeared. More on that
Job durations were much tighter around 495-544 seconds. And looking at lifelines, the 303-399 second runs were always alone on the instance. None of the 900-1100 second runs showed up.
Conclusions
So execution time is now quite consistent and at a good level of performance. We should still try our subsequent improvements (replacement of Tart and local SSD) to see if it affects overall performance, but the consistency problem is now solved.
As noted above [1] the previous retest was actually without the Ventura and CPU stealing changes. That deployment has now taken place (thanks @tmaczukin) and so I've run another performance test. Here are the results:
Again, we have some solitary jobs that ran obscenely fast (315-416sec). These were likely alone on the instance. There are some solitary jobs that ran average (especially in Run 3) but there might have been jobs from another pipeline running on those instance. I just don't have that data. I also notice the third run was on more instances overall, indicating the system had scaled up (maybe)
I also notice an interesting bifurcation in the runtimes in the upper end.
(all three runs together)
This also appears in individual runs to some degree:
The overall pipeline execution remains quite consistent, so I believe this issue should remain closed.
However I wonder about the bifurcation. I would like to know what image each job ran on. My theory is that even with the blue/green deployment being complete, some of the instances are running different images. @tmaczukin do you have evidence for or against this theory?