Explore if the environment can be built and updated to the newest Nightly automatically (maybe an overlap with OpenShift Operator work gitlab-org&4986 (closed))
Prepare test data and the environment.
Create a schedule to run GPT against it. For example, weekly.
Yeah this is certainly worth exploring going forward. For it to be valuable though we'd need it to have the same conditions as our main test pipelines:
GitLab version is the same across components, in this case in the charts and omnibus.
Updates can happen seamlessly in CI
Monitoring works the same as well, we really need full monitoring to be able to investigate failures correctly. With the hybrid environments this is difficult to get Prometheus (omnibus) to be able to poll chart nodes.
While was testing 50k hybrid, I've searched if there is a solution to save costs with GKE and found that gcloud container clusters resize --num-nodes 0 can be used to resize specific node pool to 0. As far as I understand it cordons the node and makes it unschedulable, then eventually drains the pods. In our case we need to resize 3 times for now until gitlab-org/quality/reference-architectures#65 (closed) is closed. It took quite a lot of time to resize the nodes on 50k - about 1 hour and or even more. It's probably due to the fact that while GKE is draining the node, controller tries to reschedule pods but it can't since all nodes are cordoned. Not sure how to bypass this issue first. But overall when I resized the node back to its original size, pods were reinitialised and the environment worked fine. Another path may be to delete the release, resize the node pools to 0 after each test run. So that the next time GET will install the chart once again.