FY23 Q4 - Enablement QE OKR planning

cloned from #1358 (closed)

changed due date to July 25, 2022

added FY23Q3 OKR-FY24Q2 labels

assigned to @vincywilson

changed title from FY23 Q3 - Enablement QE OKR planning to FY23 Q4 - Enablement QE OKR planning

changed the description

assigned to @ksvoboda

unassigned @vincywilson

changed due date to October 21, 2022

removed FY23Q3 label

added FY23Q4 label

reopened

Quality

GET

@grantyoung - Here is the issue to brainstorm Q4 OKRs. Lets utilize our upcoming 1:1 to discuss further

@grantyoung - from our latest discussion: we will look into how to breakdown the RA proposal into OKRs. For example, setting up the learning path + FAQs in Q4 would be a great help in getting more team members involved in RA review

Thanks @ksvoboda. Yeah based on our discussions the 3 OKR proposals that we had are:

Define plan and prepare for new Reference Architecture reviewer group including learning paths for new members and docs on common questions.
Explore and define Praefect Postgres HA for Reference Architectures after required Omnibus work was completed
Blog post on Performance testing approach.

Geo

@nwestbury - per our discussion in 1:1 here is the issue to brainstorm Q4 OKRs. Below is the potential OKR we discussed.

Add more replication into gitlab-qa

For this I think the wording is closer to

Add more verification of different replication types into gitlab-qa

Also for other ideas I was looking add possibly adding more varied builds into the Geo pipelines. Currently, we test 1k and 3k reference architecture. Whilst these are good for size, there are more variations, such as, file based object storage and external postgres/replication. Some discussion with the Geo team was done in https://gitlab.com/gitlab-org/geo-team/geo-ci/-/issues/2

@nwestbury - thanks Nick! in agreement we should look to add more varied builds into the Geo pipelines. We can call out specifically which ones for an OKR in Q4.

One more thought here was to add support for a HA tracking databases in GET. This may have some crossover with some of @grantyoung plans so will just leave it here as a thought for now.

Created some issues for these

#1512 #1511 (closed) gitlab-org/gitlab-environment-toolkit#562

Based on latest 1:1 we are confident to continue working on for Q4: #1511 (closed)

Going to move the other created issues into discussions for Q1 OKRs

Search

@ebanks - per our discussion in 1:1 here is the issue to brainstorm Q4 OKRs. Below is the potential OKR we discussed.

Run Elasticsearch tests in Staging environments

@ebanks - from our latest discussion: we will be moving forward with this KR into Q4. And breaking it down further into multiple tasks

@ksvoboda (This is just a cut and paste from our 1:1 doc): with regard to decomposing the work, I have some thoughts, but wasn’t sure if you wanted them on the issue, or in Ally, so I’ll write them here, and we can move them to where they best fit.

The work to allow the Data Stores tests to run in live environments was done last quarter.

Last week I ran some experiments. I have a draft MR that uses the GITLAB_QA_ACCESS_TOKEN explicitly instead of the default user’s credentials. This is because the default user doesn’t have credentials on live environments like Staging, Staging-Ref and Production. gitlab-org/gitlab!93972 (closed)

I ran the modified commit test locally and pointed the test at the staging environment using the GITLAB_QA_ACCESS_TOKEN listed here https://ops.gitlab.net/gitlab-org/quality/staging/-/settings/ci_cd and it failed.

I then ran the same test, but set the GITLAB_QA_ACCESS_TOKEN variable to be the GITLAB_QA_ADMIN_ACCESS_TOKEN instead which worked.

I then tried to reproduce my results in CI (ongoing).

This tells me that one of two things is happening, either:

There is code executing that requires admin credentials, and we need to modify the test, or
The GITLAB_QA_ACCESS_TOKEN doesn’t have sufficient credentials to run the test and needs to be modified.

So, now, I’m going to audit the test while running with the GITLAB_QA_ACCESS_TOKEN credentials to verify the test does not require admin credentials somewhere.

Once that is done, if I find the test is correctly written, I’ll need to modify the GITLAB_QA_ACCESS_TOKEN in Staging, and likely Staging-Ref. But rotating an access token isn’t something I’ve done frequently and it has repercussions, so I’ll need to ask about how to do that safely so that no other tests are impacted negatively.

That said, I think the procedure is something like:

Log into Staging with the admin login in 1Password
Revoke the old GITLAB_QA_ACCESS_TOKEN
Create a new access token called GITLAB_QA_ACCESS_TOKEN and give it the proper api scope needed to run the tests
Change the 1Password entry to reflect the new token
Change the CI pipeline setting to reflect the new change https://ops.gitlab.net/gitlab-org/quality/staging/-/settings/ci_cd

Just want to verify with someone more familiar with token rotations that the above is correct.

@ebanks - Thanks for putting this together! And for the discussion around this to help me better understand this okr

Gitaly

@john.mcdonnell - per our discussion in 1:1 here is the issue to brainstorm Q4 OKRs. Below are the potential OKRs we discussed.

Improve Gitaly Test coverage (gitlab-org&6803) - I saw this mentioned as a potential KR for Q3, is this something we want to continue to work towards in Q4?
Implement Chaos testing - For Q3 the POC for chaos testing is being completed. For Q4 we can expand test coverage + implementation of chaos testing.

@john.mcdonnell - from our latest discussion: We will look to add more tests utilizing the current POC of chaos testing. (pending demo to the team)

I agree with the idea of expanding the suite of chaos tests being a good candiate for an OKR.

we could have the option of expanding the initial set of networking issues
we could consider the idea of expanding the test framework changes to support other categories of chaos test focused perhaps on resource consumption rather than solely networking issues

/cc @ksvoboda

A second thing I'd like to invest some time in is Testing realistic data on preproductionTesting ... (#1510)

/cc @ksvoboda

Thank you @john.mcdonnell !

Yes, feature flags in testing have been a pain point. Note our general investment into feature flags in Gitaly (gitlab-org/gitaly#4459 (closed)); making consistent use of them in tests would nicely match this.
Yes, testing on "real, life-like" data (gitlab-org/gitaly#4468 (closed)) would be immensely useful.
The other large/systemic issue we have is that Rails E2E tests reach down into Gitaly and depend on its internals. If we could add the new tests you're proposing so that they stop at the API, and subsequently retire the old E2E tests, that would reduce long-term toil/maintenance cost and make failures a lot easier to debug. (gitlab-org/gitaly#4438 (closed) touches on this as well.)
Finally, there seem to have been a rise in flakiness in non-Gitaly tests that then break us. Example: https://gitlab.slack.com/archives/C3ER3TQBT/p1666353901725689

We're certainly willing to assist with the big ones; in fact these made it into our (WiP) OKRs, but we can't do them alone.

/cc @mjwood @pks-t @jcaigitlab

also /cc @proglottis as I believe you're working on "3" above.

@john.mcdonnell - I've added increase of chaos tests , do we want to also include a goal for adding documentation around how to use/create chaos tests?

Also for the above discussion of some of the items being worked on within Gitaly, are there any we want to add to this issues description?

/cc @wchandler who expressed interest in working on testing gitaly on real repos gitlab-org/gitaly#4468 (closed)

Hi @ksvoboda

Adding documentation to chaos tests sounds good to me as well.
From Testing realistic data on preproductionTesting ... (#1510) I think it would be good to

figure out what type of tests we want to run against the proposed data
- functional tests
- non functional / ‘performance’ tests
implement first set of functional tests against said data in a managed environment
We should hopefully be able to work on this in parallel to Test Gitaly on real data (gitlab-org/gitaly#4468 - closed) once we have an idea of how or what the dataset may look like.

I’d defer any non functional / ‘performance’ tests to later, as that most likely will be a more complex task, as we’ll need to work with our existing performance test framework, which I’m much less familiar with, and feels like it would be too much to take on for a single quarter

Enablement

@niskhakova - per our discussion in 1:1 here is the issue to brainstorm Q4 OKRs. Below are the potential OKRs we discussed.

Increase test coverage in Distribution based on work identified #1382 (closed)
Maturing performance tests - release GPT v3 gitlab-org&4300 + Define GPT roadmap
Assist with RA related OKR for scalability (No specific plan at the moment, but as discussed we can parse through existing review requests and create troubleshooting/FAQ section based on this)

@niskhakova - from our latest discussion: we would like to have an OKR focused around GET issues and prioritizing them

@ksvoboda thanks! Regarding the latest comment, I don't think we discussed GET issues I've split Distribution work into pieces and created gitlab-org&9057 to track all the efforts.

For Q4 OKR I'm suggesting to work on some of the identified above areas:

Increase FIPS test coverage - https://gitlab.com/gitlab-org/quality/quality-engineering/team-tasks/-/issues/1501 and #1503 (closed)
Increase test coverage for GitLab upgrades - #1270 (closed) and #1504 (closed)

Please let me know what you think

@niskhakova - ahh, sorry about that! I looked at the wrong note when writing that comment.

Increase FIPS test coverage - https://gitlab.com/gitlab-org/quality/quality-engineering/team-tasks/-/issues/1501 and #1503 (closed)

The above look good to me, thank you!

Increase test coverage for GitLab upgrades - #1270 (closed) and #1504 (closed)

For these do we want to focus on #1270 (closed) for Q4 specifically? #1504 (closed) would be a nice to have, if we have the time?

No worries, I thought maybe I misunderstood something

For these do we want to focus on #1270 (closed) for Q4 specifically? #1504 (closed) would be a nice to have, if we have the time?

Yes, that sounds good #1504 (closed) will worked on if time permits

@ksvoboda thanks for today's discussion! Per our further conversation, updating the proposed Q4 OKR to focus on "Analyse and increase test coverage for GitLab upgrades" (related to recent degradation in FIPS upgrade gitlab-org/omnibus-gitlab#7275 (closed)). Updated #1270 (closed) and #1504 (closed) with identified steps and goals. Please let me know if something should be adjusted

@niskhakova - Thank you for updating the issues! LGTM.

mentioned in issue #1261 (closed)

changed the description

@niskhakova / @ksvoboda I was under the impression that all tasks identified from the Distribution Spike will be candidates for Q4 OKR?

@vincywilson based on our latest discussion and degradation in FIPS upgrades, I suggested to switch to #1270 (closed) and #1504 (closed), please see #1488 (comment 1150076414). Techically this work should cover one of the identified spikes #1502.

If you believe we should work on https://gitlab.com/gitlab-org/quality/quality-engineering/team-tasks/-/issues/1501 and #1503 (closed) instead, I'm open for discussion

Thankyou @niskhakova for the input. I would still like to understand when would be a good time to consider https://gitlab.com/gitlab-org/quality/quality-engineering/team-tasks/-/issues/1501 and #1503 (closed) ? I might be missing something here.

@ksvoboda lets discuss in our 1:1 today.

@vincywilson thanks for clarification on this! My thinking with https://gitlab.com/gitlab-org/quality/quality-engineering/team-tasks/-/issues/1501 and #1503 (closed) is that for installation we have at least partial coverage with existing FIPS QA Nightly pipeline, but nothing for the upgrades. If we had some automation in place for this, we potentially could have caught upgrade degradation which customers saw recently on FIPS packages - gitlab-org/omnibus-gitlab#7275 (closed). But #1270 (closed) and #1504 (closed) will take some time to implement so wasn't sure if https://gitlab.com/gitlab-org/quality/quality-engineering/team-tasks/-/issues/1501 and #1503 (closed) will be feasible to also deliver in Q4.

As alternative solution, if https://gitlab.com/gitlab-org/quality/quality-engineering/team-tasks/-/issues/1501 and #1503 (closed) are higher priority than #1502 + #1504 (closed), we can switch to those and as a quick workaround spin up a consistent FIPS env and upgrade it to the latest nightlies so that we have some coverage for upgrades before #1504 (closed) is implemented.

Hope the above helps! Please let me know on your decisions

@ksvoboda Let's finalise and create OKRs in Ally during our 1:1 today.

marked this issue as related to #1577 (closed)

mentioned in issue #1577 (closed)

Closing this planning issue out - anything remaining was carried over into Q1 planning issue - thanks all!

closed

FY23 Q4 - Enablement QE OKR planning

Quality

GET

Enablement

Future Proposals

Designs

Child items ...

Activity

Quality

GET

Geo

Search

Gitaly

Enablement

FY23 Q4 - Enablement QE OKR planning

Quality

GET

Enablement

Future Proposals

Relates to

Activity

Quality

GET

Geo

Search

Gitaly

Enablement