Expand the test framework changes to support other categories of chaos test focused perhaps on resource consumption rather than solely networking issues
Invest some time in investigating and finding a way to bridge a gap where we have been finding that the data we use for tests, simply doesn't exercise enough corner cases to find issues prior to getting to production.
@grantyoung - from our latest discussion: we will look into how to breakdown the RA proposal into OKRs. For example, setting up the learning path + FAQs in Q4 would be a great help in getting more team members involved in RA review
Add more verification of different replication types into gitlab-qa
Also for other ideas I was looking add possibly adding more varied builds into the Geo pipelines. Currently, we test 1k and 3k reference architecture. Whilst these are good for size, there are more variations, such as, file based object storage and external postgres/replication. Some discussion with the Geo team was done in https://gitlab.com/gitlab-org/geo-team/geo-ci/-/issues/2
@nwestbury - thanks Nick! in agreement we should look to add more varied builds into the Geo pipelines. We can call out specifically which ones for an OKR in Q4.
One more thought here was to add support for a HA tracking databases in GET. This may have some crossover with some of @grantyoung plans so will just leave it here as a thought for now.
@ksvoboda (This is just a cut and paste from our 1:1 doc): with regard to decomposing the work, I have some thoughts, but wasn’t sure if you wanted them on the issue, or in Ally, so I’ll write them here, and we can move them to where they best fit.
The work to allow the Data Stores tests to run in live environments was done last quarter.
Last week I ran some experiments. I have a draft MR that uses the GITLAB_QA_ACCESS_TOKEN explicitly instead of the default user’s credentials. This is because the default user doesn’t have credentials on live environments like Staging, Staging-Ref and Production. gitlab-org/gitlab!93972 (closed)
I then ran the same test, but set the GITLAB_QA_ACCESS_TOKEN variable to be the GITLAB_QA_ADMIN_ACCESS_TOKEN instead which worked.
I then tried to reproduce my results in CI (ongoing).
This tells me that one of two things is happening, either:
There is code executing that requires admin credentials, and we need to modify the test, or
The GITLAB_QA_ACCESS_TOKEN doesn’t have sufficient credentials to run the test and needs to be modified.
So, now, I’m going to audit the test while running with the GITLAB_QA_ACCESS_TOKEN credentials to verify the test does not require admin credentials somewhere.
Once that is done, if I find the test is correctly written, I’ll need to modify the GITLAB_QA_ACCESS_TOKEN in Staging, and likely Staging-Ref. But rotating an access token isn’t something I’ve done frequently and it has repercussions, so I’ll need to ask about how to do that safely so that no other tests are impacted negatively.
That said, I think the procedure is something like:
Log into Staging with the admin login in 1Password
Revoke the old GITLAB_QA_ACCESS_TOKEN
Create a new access token called GITLAB_QA_ACCESS_TOKEN and give it the proper api scope needed to run the tests
Change the 1Password entry to reflect the new token
I agree with the idea of expanding the suite of chaos tests being a good candiate for an OKR.
we could have the option of expanding the initial set of networking issues
we could consider the idea of expanding the test framework changes to support other categories of chaos test focused perhaps on resource consumption rather than solely networking issues
Yes, feature flags in testing have been a pain point. Note our general investment into feature flags in Gitaly (gitlab-org/gitaly#4459 (closed)); making consistent use of them in tests would nicely match this.
The other large/systemic issue we have is that Rails E2E tests reach down into Gitaly and depend on its internals. If we could add the new tests you're proposing so that they stop at the API, and subsequently retire the old E2E tests, that would reduce long-term toil/maintenance cost and make failures a lot easier to debug. (gitlab-org/gitaly#4438 (closed) touches on this as well.)
@john.mcdonnell - I've added increase of chaos tests , do we want to also include a goal for adding documentation around how to use/create chaos tests?
Also for the above discussion of some of the items being worked on within Gitaly, are there any we want to add to this issues description?
I’d defer any non functional / ‘performance’ tests to later, as that most likely will be a more complex task, as we’ll need to work with our existing performance test framework, which I’m much less familiar with, and feels like it would be too much to take on for a single quarter
Assist with RA related OKR for scalability (No specific plan at the moment, but as discussed we can parse through existing review requests and create troubleshooting/FAQ section based on this)
@ksvoboda thanks! Regarding the latest comment, I don't think we discussed GET issues I've split Distribution work into pieces and created gitlab-org&9057 to track all the efforts.
For Q4 OKR I'm suggesting to work on some of the identified above areas:
@ksvoboda thanks for today's discussion! Per our further conversation, updating the proposed Q4 OKR to focus on "Analyse and increase test coverage for GitLab upgrades" (related to recent degradation in FIPS upgrade gitlab-org/omnibus-gitlab#7275 (closed)). Updated #1270 (closed) and #1504 (closed) with identified steps and goals. Please let me know if something should be adjusted