New Staging Environments - Status update

changed the description

I've created this issue for us to provide async update on a weekly bases FYI @meks , @niskhakova , @zeffmorgan, @amyphillips, @pguinoiseau , @ibaum . Please feel free to provide your update for the week - Actively worked on or delivered. I assume this would be easier for Mek to communicate in Daily standups.

changed the description

Hi @vincywilson,

Please add labels to your issue, this aids categorization and locating issues in the future.

Thanks for your help!

You are welcome to help improve this comment.

added auto updated label

@vincywilson, please can you add a type label to this issue to help with issue discovery in issue reports.

Week of 2021-09-20

Staging Ref related update:

Blocked with Wire in release-tools for staging-ref environment issue (https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/14046) due to not having access to pre-release packages repo. Thanks @pguinoiseau for looking into this problem, he's working with the team to get access (https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/14046#note_682683388).
Working on automation to create QA test users when deploying the environment - #338974 (closed). To enable this adding a new feature to GET to support custom API calls - https://gitlab.com/gitlab-org/quality/gitlab-environment-toolkit/-/issues/275. Once this issue is closed, will update Staging Ref config file with the new taks for QA users.

added Engineering Allocation maintenancepipelines labels

added tooling (archive) label

@vincywilson @niskhakova thanks for this, very considerate and I appreciate it.

I am worried this can be too much paperwork for the team (number of issues). I am ok getting the latest from each issues e.g. https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/14046 and #338974 (closed) mentioned here.

How about we create a new discussion in the main epic and provide the latest update there?

Agreed that it brings more paperwork, but not sure what is the best way to provide updates. The reason for moving this out of the main epic was to reduce the potential noise for everyone. If that's not a concern, I'm fine with using the main epic for updates.

@meks could you please clarify what date do you usually need the latest update for Mixed env? So that we can post weekly results in line with this.

@meks did you by chance decide where you would like to have updates posted? Would you like me to create a new discussion in the main epic as discussed? If so, I'll just name it Quality Engineering Status Updates or something similar and we can post our updates in there. Also happy to add updates in the meeting doc itself if you prefer.

What would you all prefer? @niskhakova it doesn't appear the epic is getting too much more noise recently. I don't know if that's because discussions were moving off into the linked issues and sub-epics.

Let us know @meks and we'll get updates posted for you for the meeting Monday.

@niskhakova @zeffmorgan noted, let's keep it here. Also for SSoT let's make it inclusive to progress outside of QE as well. There is also the usecase from development and other updates from Infra.

If we can have an update before monday 7:30am PST that would be great.

Update from 2021-10-04

Wire in release-tools for staging-ref environment - in progress
- What's done: Great news here, we were able to configure Staging Ref environment to be able to upgrade using custom GitLab images from private GitLab repos - pre-release and dev.gitlab.com. Many thanks to @pguinoiseau for all the help here.
- Next steps: @pguinoiseau will be looking into adding a pipeline to trigger Staging Ref environment update pipeline from the Deployer. Please see discussion in https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/14046#note_691826021. It will be triggered in parallel with Staging Canary.
- Full details in https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/14046#note_691537563 and https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/14046#note_691549216
Create test accounts data on new Staging gstg-10k - closed
- What's done: QA users creation is now automated for Staging Ref. Issue is closed with a follow up for CustomersDot data.
- Next steps: Plan the work to set up and connect CustomersDot portal with Staging Ref - #342150 (closed)
- Full details in #338974 (comment 692601207).
Overal next steps plan:
- Coordinate with Fulfilment team to understand what's needed to set up CustomersDot with Staging Ref - #342150 (closed) and start work in this area.
- Proceed with wire in release-tools for staging-ref environment - https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/14046
- Look into SSL set up for Staging Ref - gitlab-com/gl-infra/delivery#2006 (moved)

@niskhakova to facilitate using this update issue as an SSoT, I'm creating a single discussion for QE-specific activity so our updates can be added as responses and grouped together under QE. Other teams can follow suit. What do you think?

Sounds great! I think it's helpful to group them together, will use it next time

Quality Engineering Status Updates

Add updates as comments to the discussion, with similar formats. Include timelines.

Update 2021-10-04

Explore adding new staging-canary environment
- Initial project creation completed
- Deployment order and test gaps created in each scenario require further investigation
  - gitlab-com/gl-infra/delivery#2004 (comment 692982106)
  - https://docs.google.com/presentation/d/1pj1vUI7EI1gBiKWjzRk6WEo39FXpnEvFNDqxgmqN0Yc/edit#slide=id.gebc27251a5_0_1071
Add mixed env project tests
- Debugging two consistent test failures. Refactor complete by 10-5 to stabilize. In review process by then.
Improve burn-in process for potential deployment-blocking QE tests
- Investigating changing our burn in process due to requirements uncovered with gitlab-com/gl-infra/delivery#2004 (closed)
Investigate shifting mixed environment testing left to achieve data coverage via API tests
- Ongoing test coverage analysis reveals a need to shift mixed environment coverage testing further left
- New approach for API testing set up by 10-4-2021

/cc @meks

Thanks @zeffmorgan do we know when the new mix-deployment tests will block deploys? Is this pending the two consistent test failures?

Update 2021-10-11

Wire in release-tools for staging-ref environment - in progress
- Last week Infrastructure part was blocked by another ongoing work to deploy-tools. This work was finished and the wiring process could proceed now.
Planning issue for CustomersDot portal for Staging Ref - in progress
- Ongoing discussion with Fulfillment team about what's needed to create a new CustomersDot env and how to set it up.
SSL set up for Staging Ref - in progress
- Clarifying how to get certificates for Staging Ref

Next steps:

Proceed with issues above
Start looking into GitLab-QA pipeline for Staging Ref if time permits

Update 2021-10-12

Add QA test config for staging-canary - in progress (testing phase)
- May not be complete until mid-next week due to required refactoring of QA project environments and SET availability
- This adds the blocking mechanism. They are currently non-blocking running in scheduled jobs.
Refactor mixed env jobs - in progress (testing phase)
- Same schedule as above.
- Adds ability to burn in mixed-env jobs without blocking deployments and updates reporting logic to account for staging-canary project
Additional refactoring of QA framework logic also needed. These are small changes that can also be completed by next week.

Update 2021-10-18

Wire in release-tools for staging-ref environment - asked for the latest update on this in https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/14046#note_706347858
Planning issue for CustomersDot portal for Staging Ref - in progress
- The discussion with Fulfillment team is ongoing, @vincywilson will be looking to schedule a call to sync quicker.
SSL set up for Staging Ref - closed
- Done. Staging Ref has SSL set up now.

Next steps:

Proceed with issues above
Start discussion about performance testing in #338978 (closed)
Start looking into GitLab-QA pipeline for Staging Ref

Update 2021-10-18

Please see previous update for schedule. This past week I was only in Tuesday where I worked on technical interviews and debugging. Schedule from 2021-10-12 still stands.

I have discovered some coupling that should be refactored but should be able to roll that into my existing work.

@zeffmorgan @vincywilson where do we stand on the mix deployment tests are they gating deployments now? If so how many tests are being run?

Update 2021-10-22

To answer previous questions, we are not yet blocking deployments. I am currently blocked by:

Refactor mixed-deployment jobs to explicitly limit domains - In progress
- Job was leaking through to projects like 'preprod' and they must be executed from staging or production
Refactor report issues to allow for additional pipeline - MR In Review
- Coupling issue in the GitLab QA gem I mentioned in last update
generate-allure-report job failing with insufficientPermissions: Insufficient Permission - Complete (needs one more test run - In Progress)

We only have one bidirectional test running at the moment as a scheduled job. However, once the above items are complete, we will have a SIGNIFICANT amount of coverage added that will give us a similar confidence level as our current SMOKE and RELIABLE tests for the Migration failure mode of mixed deployment.

Timeline: I expect to jump from minimal coverage to Migration failure mode coverage equivalent to that of all of our existing SMOKE/RELIABLE tests by Wednesday, 10/27, if not sooner. All of the above development work will be completed by today with the only delay being reviews and releasing the updated version of the Gitlab QA gem.

Thank you @zeffmorgan, this is quite a lot of technical verbosity to roll up into our standup update. I would appreciate if this can be simplified down to report up to the stakeholders. I am looking for a definitive date where can we confidently say that mix-deployment errors are being guarded/tested before we deploy to production.

I expect to jump from minimal coverage to Migration failure mode coverage equivalent to that of all of our existing SMOKE/RELIABLE tests by Wednesday, 10/27

This seems key. So we will be blocking deployments with mix-deployment test by 10/27? Is this also accounting for the below?

#340639 (closed) staging canary pipeline work
https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/14045 staging canacy deployment pipeline work

I 'd like us to wrap up staging-canary things first before putting the majority of the team on new staging, we need to coordinate with

@amyphillips on deployment reordering
@niskhakova on generating load

FYI @vincywilson @tpazitny

Update 2021-10-25

Wire in release-tools for staging-ref environment - @pguinoiseau is working on this, latest update in https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/14046#note_707180360.
Planning issue for CustomersDot portal for Staging Ref - discussion is in progress, however overall this work will be moved to Q4-Q1 based on Fulfillment team availability, see #342150 (comment 699980303).
Staging Performance testing - added an action plan for performance testing in Staging, please see #338978 (comment 712098451). Waiting for the feedback on it (@meks @vincywilson).
GitLab-QA pipeline for Staging Ref - added Staging Ref QA pipeline and working on adding a new GitLab QA scenario for this environment. By the end of the week it should be done. Al details are in the issue description.

Next steps:

Proceed with issues above and also prepare the list of issues that left to be done in Q4.

Update 2021-10-25

Thank you for that feedback @meks.

We are still on track for gaining the coverage I mentioned previously by mid-week this week. This is to gate deployments to canary.staging.gitlab.com. This will provide some coverage specifically for the migration failure mode in a mixed deployment environment. I am guessing that the sum of our existing smoke and reliable tests may cover from 20%-40% of our existing schema, which is what we want to account for within this specific failure mode, but that is a very rough estimate. Follow up analysis and additional tests will likely be warranted.

Due to the scope of the mixed deployment environment, we do not yet have coverage estimates or definitive dates to achieve a percent of coverage for the other failure modes. I'm working on a more definitive list of work that needs to be accomplished to address the entirety of this large project to help align everyone on the actual scope of work and will share that this week.

Update 2021-10-25

The deployment pipeline changes are quite substantial so I expect them to take several milestones to complete safely. At the moment we're setting up the epic - gitlab-com/gl-infra&608 (closed) and have started work to create a Gitaly node for gstg-canary - gitlab-com/gl-infra/delivery#2087 (closed)

@zeffmorgan @amyphillips thank you for the async updates, this is very helpful! Added summary to today's standup. cc @vincywilson

Update 2021-11-01

All initial dev work on QA side to introduce coverage for mixed deployment is complete, both for introducing existing tests and for mixed deployment environment (MDE) tests
There are a cascade of dependent MR's that are either merged or in the review process. I am currently shepherding the remaining MR's through the review process to expedite enabling tests within the new staging-canary environment. Once these are merged, hopefully today, we will be able to add all fo the above changes to deployer.
Reviews and analysis from last week helped determine a need for some refactoring to both simplify our existing approach and speed up our ability to reintroduce existing tests to benefit from them within the MDE context (their implementation can help discover some issues)

One caveat we hadn't really discussed before related to our pipeline deployment changes currently being worked on is that regardless of whether we execute MDE tests (or leverage existing smoke/reliable tests) to help with MDE coverage, we can provide no guarantee that we will catch MDE failures. This is because we cannot guarantee state of the system we are testing until the deployer changes are complete.

An additional caveat to this is that we could conceivably block deployments to staging-canary based on system state that will never be deployed to production. I think this is minimal, but should be noted.

Update 2021-11-01

Staging Performance testing - based on last week discussion we split this work into 3 areas - Emulate production load and Run performance tests against Staging Ref (#344223, #344224) and Emulate production load on Staging/Staging Canary(#338978 (closed)). Details in #338978 (comment 718652905).
Support GitLab-QA pipeline for the new Staging Ref - done. Added Staging Ref scenario and validated that initial QA pipelines are working. Next step will be to add QA pipeline trigger (#343936 (closed)) once https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/14046 is closed
Updated plan in epic with the list of issues for Q4.

Next steps:

Will start to look in Improve load testing for staging-canary this week
Will work on Create access tokens for Staging Ref automatically - this is a follow up for QA pipelines on Staging Ref and will require some changes to Ansible tasks for Staging Ref

Update 2021-11-01

Test data requirements have been finalized and categorized into common themes. The work needed is tracked in Epic: &7020.

Update 2021-11-08

Working on Create access tokens for Staging Ref automatically - created a draft MR to support this https://gitlab.com/gitlab-org/quality/gitlab-environment-toolkit-configs/staging-ref/-/merge_requests/7. While verifying that QA pipelines are working with new access tokens, noticed that maintain_log_in_mixed_env_spec is being triggered on Staging Ref which shouldn't happen. Would like to debug this further.
Started to look in Improve load testing for staging-canary, will continue this week.

Update 2021-11-08

Refactoring completed to enable blocking deployments. Waiting on a test to be quarantined to reduce pipeline noise before it is enabled. Next steps are to merge maintainer approved pipeline changes, announce pipeline changes, and enable deployer changes. This work is already in place and just needs to be toggled on once the test is quarantined.

Completed changes to enable blocking deployment:

Refactor report issues to allow for additional pipeline (merged)
Bump version to v7.14.0 (merged)
Refactor pipeline-common CI config to facilitate mixed deployment environment testing (merged)
Quarantine stale mde test due to refactoring (Set to merge when pipeline succeeds)

This Week

Refactor tests for mixed deployment environment new pipeline deployment approach
- Relocate and retag test for mixed deployment refactor is already approved and was waiting on other refactoring work to fall into place above.
Move mixed deployment environment tests to staging is also complete and awaiting review after above refactoring work complete.

I will also be introducing debug/triage documentation for mixed environment tests and supporting monitoring these tests as secondary DRI this week.

Update 2021-11-12

Create access tokens for Staging Ref automatically - closed as done. Access tokens will be created automatically if environment is rebuilt. All Staging Ref credentials are now stored in GitLab-QA and Engineering 1Password vaults.
Improve load testing for staging-canary - in progress. Made an initial analysis on what happened when cmbr was enabled and asking Andrew for help with Infrastructure related information (#338978 (comment 731111519)).
Add documentation about Staging Ref environment - updated description with plan on what documentation should be specifically added about Staging Ref.
Trigger Staging Ref QA pipeline during deployment process - everything is ready from Quality side to add trigger for QA pipelines: Staging Ref QA project was set up in #338977 (closed), this week made final tweaks for environment and GitLab QA and verified that both Smoke and Full suite finally passed(#343936 (comment 732181238)):
- !74231 (merged) - to account for Staging Ref in E2E tests. Before this change tests marked with only: :staging also ran against Staging Ref due to small issue in E2E context selector regex.
- Was stung by an actual bug in Gitlab - #24110 (comment 729703160). Created upstream fixes to GET - https://gitlab.com/gitlab-org/quality/gitlab-environment-toolkit/-/merge_requests/413, https://gitlab.com/gitlab-org/quality/gitlab-environment-toolkit/-/merge_requests/417 - to disable Registry in Sidekiq, Webservice and Toolkit charts as it affected some E2E tests and caused failures. See #343921 (comment 729663870) and #345607 (comment 732140059) for more details.
Extend chatops Feature Flag support to staging-ref environment - There was a discussion where it should be notified that FF changed, it will be done similarly to other environments - engineer runs chatops command in #staging-ref channel and notification will be sent to #qa-staging-ref(https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/14292#note_731260764).

Next steps:

Will proceed to work on Improve load testing for staging-canary - learn more about the current Staging architecture, try to run crawler with lower RPS and monitor gstg health
If time permits, will make an initial look into Provide GitLab team access to Staging Ref

Update 2021-11-15

Add Staging Canary QA pipeline config complete
- Moved announcements task to https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/14045
- QE initial setup complete, Infrastructure MR in review to enable deployer pipeline
- Initial changes in deployer are for testing reliability of test jobs and gathering performance information before blocking staging-canary
Add gstg-cny to pipeline schedule docs
- Documentation complete, waiting on https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/merge_requests/423 to merge to prevent confusion with Pipeline DRIs
- Pipeline DRIs notified/prepped for adding staging-canary
Add debug/triage documentation - in progress
Monitoring pipeline changes this week as AMER Primary DRI on-call

Update 2021-11-22

Unfortunately wasn't able to do progress on issues last week due to involvement in customer related incident
Wire in release-tools for staging-ref environment - thanks to @pguinoiseau Staging Ref upgrades are connected to Staging Canary deployment. Deployment is being reported to #qa-staging-ref Next step is to trigger QA pipeline during deployment process - #343936 (closed)

Next steps:

Will continue to work on Improve load testing for staging-canary
Will look into Provide GitLab team access to Staging Ref

Update 2021-11-22

We are currently burning in the additional test pipelines triggered by deployment for Staging-Canary to allow infrastructure to collect information on the impact in this environment
- These pipelines are triggered by deployer
These provide a level of coverage for the schema-related failure mode equivalent to our current level of smoke and reliable end-to-end tests. Once satisfied with the impact in addition to existing test jobs, infrastructure will change these to blocking jobs.
Job status is being reported to #qa-staging

Continue to monitor these pipelines
Continue to work on moving mixed deployment specific tests to staging configuration to enable MIXED_ENV_ONLY and prevent extensive refactoring of existing QA pipeline configurations

Update 2021-12-02

Sorry, forgot to share update earlier this week as missed my reminder due to FF day on Monday.

What was done:

Provide GitLab team access to Staging Ref - closed as done, configured Google OAuth in Staging Ref - GitLab team members are be able to sign in with their Google Account, see more details in #341605 (comment 743858635)
Extend chatops Feature Flag support to staging-ref environment - closed as done, now Staging Ref supports Chatops commands for FF thanks to Pierre's work. I helped to resolve failure with chatops command gitlab-com/chatops!267 (merged) and validated that it's now working

Next steps:

Focusing on Improve load testing for staging-canary - specifically planning to enable crawler once more and track down any 500 errors to understand what is the limiting factor for the current load testing on Staging.

2021-12-02 Update

No significant progress last week due to OOO time.

This week continuing to focus on API functionality to expand coverage.

Update 2021-12-06

This week I'm Primary DRI on-call for QA pipelines, so not sure how much progres will be able to do, it depends on pipelines.

What was done:

Improve load testing for staging-canary - enabled load emulation via cmbr on 2021-12-02 and for the last 5 days didn't see 5xx errors, QA pipelines were stable which is a very good sign. More details in #338978 (comment 752263239).

Next steps:

Continue work on Improve load testing for staging-canary - will monitor error rates on Staging further and will work on add agent cookie to avoid 429 errors on web calls.
Add documentation about Staging Ref environment - will start working on initial documentation draft for env.

Update 2021-12-06

Current:

Continuing work on adding API functionality for expanding test coverage
- Expanded due to gitlab-com/gl-infra/delivery#2113 (closed). This will enable more coverage for failure modes that involve both Gitaly and Praefect. Further exploration and planning required to implement this in a way that will not require rewriting specific tests but can leverage our existing suites.

Re-evaluate failure mode analysis with existing coverage and tooling to map out next actions

Update 2021-12-10

Next week on Monday and Tuesday I will be out, so posting update today instead of Monday.

What was done:

Improve load testing for staging-canary - load emulation is now running against Staging for more than a week and QA pipelines were stable. Analysed request errors in cmbr API and Web pipelines, updated cmbr to fix 439 and 503 erors, wrote plan for next improvements - #338978 (comment 764287105).
Trigger Staging Ref QA pipeline during deployment process - now QA pipelines are running against Staging Ref after each deployment Smoke pipelines are green, however there are 2 failures in Full - #348186 (closed), #348187 (closed) - needs further investigation.
Add documentation about Staging Ref environment - started to gather Staging Ref information in a single doc, but not finished.

Next steps:

Continue work on Improve load testing for staging-canary - will analyse 500 errors peaks in Web traffic on Staging - #338978 (comment 764378307). Proceed tweaking cmbr.
Add documentation about Staging Ref environment - proceed with this.
Trigger Staging Ref QA pipeline during deployment process - investigate QA failures and explore how to connect test session reports to release tasks.

Update 2021-12-13

gitlab-com/gl-infra/delivery#2113 (closed) ended up being a discussion around infra being ready to start on pipeline reordering. As a result, I changed gears last week and started on the following to unblock infra next steps:

Configure quality/staging-canary pipeline to run all test suites executed against staging
- Initial work from that ready for testing in gitlab-org/quality/staging-canary!2 (merged)

Currently testing these changes out and pairing with another SET with more experience around Praefect/Gitaly to make sure we don't miss anything with the new environment configuration. This is to be completed this week or sooner.

Update 2021-12-20

What was done:

Configure quality/staging-canary pipeline to run all test suites executed against staging changes are committed and under test
- We had to modify our approach to account for the single integration point we have with CustomersDot and Zuora sandbox to be able to shift Fulfillment tests into staging-canary as well
All QA test schedules are setup for staging-canary (manual schedules - see below)

Currently:

All QA test suites are being tested manually, first, staggering them with the existing jobs currently running against staging to avoid unnecessary saturation that could lead to flakiness. These tests should be completed today and the additional changes for the CustomersDot and Zuora integration merged.
This will give us a solid burn in period for infra's expected reordering work in January

Update 2021-12-20

What was done:

Improve load testing for staging-canary - set up dashboard in Kibana for 500 errors and analysed some errors that were seen on Staging with load testing. Great news that with crawler we found a real bug, more details on this in #338978 (comment 780654286).
Trigger Staging Ref QA pipeline during deployment process - investigated and resolved QA failures in full pipeline, specifically some tests ran that should not have been run against live environment like Staging Ref. Now Full QA pipeline is green. Test session link to Release task will be a bigger issue as it requires refactoring from Delivery team, see gitlab-com/gl-infra/delivery#2168 (closed)
Add documentation about Staging Ref environment - created draft MR for Staging Ref documentation

Next steps:

Continue work on Improve load testing for staging-canary - sync with Infra team to understand how error rate SLIs are working and what are the next steps to enable alerts on Staging for breaching SLIs now that load testing is running
Add documentation about Staging Ref environment - finish up with MR
Trigger Staging Ref QA pipeline during deployment process - will work on adding documentation about new QA pipeline for QE on-call
Set up SMTP for staging-ref environment - would like to take a look into this area if time permits

Update 2021-12-28

What was done:

Set up SMTP for staging-ref environment - configured Outgoing email for Staging Ref with Pierre's help #340554 (comment 794050707). Incoming email will be explored in #348970 (moved)
Improve load testing for staging-canary - synced with Infra team and created gitlab-com/gl-infra/scalability#1477 to track SLI monitoring for Staging, Scalability team will help with this. Created epic &7320 (closed) to track Load testing on Staging efforts.
Trigger Staging Ref QA pipeline during deployment process - created gitlab-com/www-gitlab-com!95983 (merged) to add Staging Ref to QA pipelines
Add documentation about Staging Ref environment - finished with gitlab-com/www-gitlab-com!95899 (merged) to add documentation about Staging Ref

Next steps for this week and the week of 2022-01-10 when I'm back ():

Continue work on Improve load testing for staging-canary - work on further improvements for crawler in #349074 (closed)
Add documentation about Staging Ref environment - waiting for approvals in gitlab-com/www-gitlab-com!95899 (merged) and after it's merged we can announce general availability for Staging Ref

Update 2021-12-28

What was done:

Continuation of testing Configure quality/staging-canary pipeline to run all test suites executed against staging
- Experienced some delays affecting our ability to execute tests due to time off and package-and-qa problems that have since been resolved, but were able to run enough tests to identify some necessary maintenance in gitlab-qa (issue referenced below)
- Except for the two identified test jobs that are problematic, testing and reporting have both been successful in the full QA suite of tests on staging-canary

Complete investigation into root cause of very small subset of specs and implement fix in gitlab-qa gem.
- Investigate erroneous QA failures in staging-canary

Out most of the past two weeks for holidays, PTO and F&F days.

What was done:

Completed investigation into fulfillment failures and applied fix. Closed investigate erroneous QA failures in staging-canary

Complete final tests and enable all scheduled jobs for staging-canary
Close out quality/staging-canary pipeline setup

Update 2022-01-10

What was done:

Was out for the past 2 weeks as well, some latest statuses can be viewed at #341427 (comment 796725980)
Improve load testing for staging-canary - new Staging admin user was provisioned https://gitlab.com/gitlab-com/team-member-epics/access-requests/-/issues/12758 and we switched to use it for tests. Also worked on adding support for web authorisation for crawler gitlab-com/gl-infra/cmbr!5 (merged).

Next steps:

Improve load testing for staging-canary - Continue work on further improvements for crawler in #349074 (closed)
Add documentation about Staging Ref environment - waiting for approvals in gitlab-com/www-gitlab-com!95899 (merged). Once its done, work on announcement for Staging Ref availability.

Update 2022-01-11

What was done:

Fixes for investigate erroneous QA failures in staging-canary completed
Testing for Configure quality/staging-canary pipeline to run all test suites executed against staging completed and passing
Additional coordination with infrastructure for Stage 2: Configure project/group in gstg to use gstg-cny gitaly/praefect nodes completed
Analyze potential MDE failure points => Frontend/Backend - Rest API incompatibility complete
Analyze potential MDE failure points => Backend - Alter database table complete

In progress:

Add full QA test suites to staging-canary out of dev and into review
Provide Gitaly-specific coverage in staging-canary is an added requirement based on Stage 2: Configure project/group in gstg to use gstg-cny gitaly/praefect nodes. MR in progress.
- Add canary specific storage test near completion and will be in 14.7
Configure quality/staging-canary pipeline to run all test suites executed against staging completed with the above MRs

Next steps:

Complete documentation in Add debug/triage documentation for mixed environment tests
Implement API-based canary cookie-switching functionality to enable more testing types
- Moving to 14.8 due to new test requirements in Provide Gitaly-specific coverage in staging-canary

Update 2022-01-17

What was done:

Add documentation about Staging Ref environment - closed as done.
Staging Ref general availability was announced company-wide https://gitlab.slack.com/archives/C0259241C/p1642097885142900
Improve load testing for staging-canary - merged support for web authorisation for crawler gitlab-com/gl-infra/cmbr!5 (merged).

Next steps:

Walk through existing Staging Ref issues and write down what will be the next iteration on this.
Define the access criteria for Staging Ref - discussion about how engineers can access Staging Ref machines and cluster, Infrastructure team suggests to use Teleport and it sounds like a great option.
Improve load testing for staging-canary - need to clean up what is left to be done.

Update 2022-01-18

What was done:

Configure quality/staging-canary pipeline to run all test suites executed against staging completed
- Add full QA test suites to staging-canary merged
Analyze potential MDE failure points => Backend - Alter database table analysis for two additional failure modes completed:
- Gitaly
- Praefect

In progress:

Provide Gitaly-specific coverage in staging-canary is an added requirement based on Stage 2: Configure project/group in gstg to use gstg-cny gitaly/praefect nodes. MR in progress.
- Add canary specific storage test
Complete documentation in Add debug/triage documentation for mixed environment tests

Next steps:

Implement API-based canary cookie-switching functionality to enable more testing types

Update 2022-01-24

What was done:

Improve load testing for staging-canary - working on adding custom cookie support for webcrawler to load gstg-cny specificially if needed. Cleaned up next step for load emulation on Staging. Now there is &7320 (closed) to track Quality work for load testing and pairing Infra epic gitlab-com/gl-infra&668 to track Infra team related work.
Planning issue for CustomersDot portal for Staging Ref - Fulfillment team clarified that we may be able to configure paid tiers without CustomersDot portal in #342150 (comment 816802062). We had a discussion about this and it appears that we now have a plan what to do.
Updated Next iteration section in the main epic &6401 with the latest information
Gather feedback and plan work for Staging Ref Geo - gathered feedback from Geo team about setting up Staging Ref Geo site. They all very supportive of this. Created Setup Geo site for Staging Ref (#350869 - closed) and Set up GitLab-QA pipeline for the new Staging R... (#350870 - closed) to track this new work. The current plan is to have 3k Geo site in EU.

Next steps:

Improve load testing for staging-canary - continue to work on cookie, finish up the issue.
Planning issue for CustomersDot portal for Staging Ref - need to learn if we can adjust gl_subdomain? regex to return true for Staging Ref - it may unblock us with CustomersDot data so that users will be able to set paid plan themselves.
Improve Staging Ref deployment stability - analyse deployment logs and see if we can resolve intermittent issues in deployment

Update 2022-01-25

Pipeline Triage DRI this week, so may have more limited availability for ongoing work this week. Started validation of Runner Helm install on GKE with new containerd base last week for GKE's rollout of containerd, which also impacted availability.

What was done:

Completed documentation in Add debug/triage documentation for mixed environment tests

In progress:

Provide Gitaly-specific coverage in staging-canary is an added requirement based on Stage 2: Configure project/group in gstg to use gstg-cny gitaly/praefect nodes.
- Add canary specific storage test MR in review
Implement API-based canary cookie-switching functionality to enable more testing types if time permits this week with pipeline triage.

Next steps:

Update infrastructure environment documentation to detail staging-canary

Update 2022-01-31

What was done:

Improve load testing for staging-canary - closed as done. We will continue to improve load testing on Staging in Q1 &7320 (closed).
Planning issue for CustomersDot portal for Staging Ref - Added support for hyphens in GitLab subdomain regex !79158 (merged) and checked how Paid Tiers work on Staging Ref after that. Unfortunately new issues appeared see #342150 (comment 825235459)
Improve Staging Ref deployment stability - added retries to test if it helps with intermittent issues. To reduce the noise for Delivery team deployment won't be sending failures to announcements while Staging Ref deployment stability is worked on.

Next steps:

Planning issue for CustomersDot portal for Staging Ref - will work on suggestions from Fulfillment team #342150 (comment 826288168) to resolve Paid Tiers errors.
Improve Staging Ref deployment stability - check if workaround helps and continue to search for the root cause of intermittent failures.

Update 2022-01-31

No significant progress to report this past week due to Pipeline Triage, containerd testing project for runner on GKE, and Q1 OKR planning for Runner Reference Architectures.

In progress:

Some fixes pushed for Add canary specific storage test MR in review
Update infrastructure environment documentation to detail staging-canary
Identify overlap with previous and existing flaky test work and stale areas due to existing pipeline work for Update existing staging infrastructure and remove flaky tests cause of underlying infrastructure and determine next steps

Next steps:

Implement API-based canary cookie-switching functionality to enable more testing types

Update 2022-02-07

What was done:

Planning issue for CustomersDot portal for Staging Ref - paid tiers are added using Plan seeder, GitLab QA groups was switched to Ultimate plan #342150 (comment 834755505).
Improve Staging Ref deployment stability - wasn't able to get response from team about intermittent 403, will continue to investigate.
Simulate load on Staging - Further improvements for crawler - closed as done
Simulate load on Staging - Streamline load simulation efforts on Staging - resolved 429 errors seen by crawler user with the help from Infra team in https://gitlab.com/gitlab-com/gl-infra/cmbr/-/issues/1

Next steps:

Add support for custom Ansible tasks in the common role & uninstall playbook - Will work on this issue to unblock Infra team efforts in monitoring of Staging Ref
Planning issue for CustomersDot portal for Staging Ref - continue working with Fulfillment team on default paid plan.
Improve Staging Ref deployment stability - Update Staging Ref CI pipeline to use GET docker images to improve stability.
Simulate load on Staging - Streamline load simulation efforts on Staging - analyse metrics generated by crawler and explore if static traffic generator is needed.

Update 2022-02-08

Complete:

Analyze potential MDE failure points => Gitaly communicates with gitaly-git2go
- Made sense to spend some more time on this while the Gitaly test was in flight
Add canary specific storage test

In progress:

Update infrastructure environment documentation to detail staging-canary
- Two MR's here are in review by infrastructure. This documentation made apparent a possible discrepancy with the function of our subdomain for staging-canary that infrastructure is investigating before pushing through.
Identify overlap with previous and existing flaky test work and stale areas due to existing pipeline work for Update existing staging infrastructure and remove flaky tests cause of underlying infrastructure and determine next steps
Implement API-based canary cookie-switching functionality to enable more testing types

Next steps:

I anticipate the API work to take a little time, so will keep focus on that until I see it nearing completion.

Update 2022-02-14

What was done:

Planning issue for CustomersDot portal for Staging Ref - closed as done, Piad Tiers are now available on Staging Ref Created a follow up to track work for Configuring CustomersDot portal for Staging Ref. Updated Staging Ref docs with guidance how to upgrade plans - https://about.gitlab.com/handbook/engineering/infrastructure/environments/staging-ref/#upgrade-paid-plans
Improve Staging Ref deployment stability - Updated Staging Ref CI pipeline to use GET docker images.
Provide Admin access and accounts in Staging Ref - marked as done all tasks except for SAML - added more test accounts that can be used by teams and guidance where to find their credentials - https://about.gitlab.com/handbook/engineering/infrastructure/environments/staging-ref/#pre-existing-test-accounts
Configure Sentry for Staging Ref - turned out configuring Sentry shouldn't take long and Infra team created a new Sentry project for Stagign Ref, I started to work on updates to configuration to enable Sentry config.
Add support for custom Ansible tasks in the common role & uninstall playbook - added custom tasks support to unblock Infra team efforts in monitoring of Staging Ref

Next steps:

Add gitlab version exporter to Staging Ref - finish up work on adding GitLab Version Exporter in https://gitlab.com/gitlab-org/quality/gitlab-environment-toolkit-configs/staging-ref/-/merge_requests/56
Configure Sentry for Staging Ref - work with Infra team to finish configuring Sentry
Planning issue for CustomersDot portal for Staging Ref - continue working with Fulfillment team on default paid plan.
Improve Staging Ref deployment stability - analyse and work on resolving intermittent 502 errors.
Simulate load on Staging - Streamline load simulation efforts on Staging - wasn't able to progress on this last week due to unexpected interruptions, will focus on this issue this week.
Explore switching to nodepool autoscaling on Staging Ref - if time permits will look into this issue as autoscaling is now supported in GET, this should help to reduce Staging Ref costs.

Update 2022-02-15

Out of office due to illness this past week, but still made a little progress. Just picking back up today.

Complete:

One of two MRs to complete Update infrastructure environment documentation to detail staging-canary is merged
- The second MR is approved, but not yet merged. Pushing those changes through next to close the above issue.

In progress:

Identify overlap with previous and existing flaky test work and stale areas due to existing pipeline work for Update existing staging infrastructure and remove flaky tests cause of underlying infrastructure and determine next steps
Implement API-based canary cookie-switching functionality to enable more testing types

Next steps:

Moving to targeted mixed deployment tests after internal tooling complete above with Refactor tests for mixed deployment environment new pipeline deployment approach

Update 2022-02-21

What was done:

Improve Staging Ref deployment stability - Switched to use another admin user in post configure to reslove 403 errors. So far no occurrences of user blocked error.
Configure Sentry for Staging Ref - Closed as done, now Staging Ref has Sentry
Add gitlab version exporter to Staging Ref - added gitlab version exporter to Staging Ref.
Explore switching to nodepool autoscaling on Staging Ref - switched to autoscaling mode for Staging Ref, however it turned out that node pools settings need further adjustement.
Simulate load on Staging - Streamline load simulation efforts on Staging - closed as done, based on analysis made #351228 (comment 847006928) will use cmbr crawler as a single source of load simulation.

This week is short for me as I'll be out since Wed, so not planning too much:

Setup Geo site for Staging Ref - start working on Geo site for Staging Ref
Explore switching to nodepool autoscaling on Staging Ref - look into how we can add taints to Webservice and Sidekiq node pools to improve autoscaling

Update 2022-02-22

Complete:

Completed Update infrastructure environment documentation to detail staging-canary
- Initial MR previously completed Add Staging-Canary environment information
- Additional MRs merged
  - Add Staging-Canary reference to deployment process
  - Fix cookie setting documentation for staging-canary

In progress:

Implement API-based canary cookie-switching functionality to enable more testing types
- Given our implementation, this is requiring broad changes and may be re-scoped to facilitate iteration
Identifying overlap with previous and existing flaky test work and stale areas due to existing pipeline work for Update existing staging infrastructure and remove flaky tests cause of underlying infrastructure and determine next steps

Next steps:

Targeted mixed deployment tests starting with Refactor tests for mixed deployment environment new pipeline deployment approach

Update 2022-03-01

In progress:

Implement API-based canary cookie-switching functionality to enable more testing types
- Continue to expect delivery in %14.9
Identifying overlap with previous and existing flaky test work and stale areas due to existing pipeline work for Update existing staging infrastructure and remove flaky tests cause of underlying infrastructure and determine next steps
- Continue to expect delivery in %14.9
Analysis of potential Sidekiq-related failure modes cc/ @a_mcdonald

Next steps:

Targeted mixed deployment tests starting with Refactor tests for mixed deployment environment new pipeline deployment approach

Update 2022-03-08

In progress:

Implement API-based canary cookie-switching functionality to enable more testing types
- ~60% complete
- Continue to expect delivery in %14.9
Analysis of potential Sidekiq-related failure modes cc/ @a_mcdonald
- ~33% complete
Identifying overlap with previous and existing flaky test work and stale areas due to existing pipeline work for Update existing staging infrastructure and remove flaky tests cause of underlying infrastructure and determine next steps
- ~50% complete
- Delivery may extend somewhat past %14.9

Next steps:

Targeted mixed deployment tests starting with Refactor tests for mixed deployment environment new pipeline deployment approach

Update 2022-03-15

Completed:

Analysis of potential Sidekiq-related failure modes

In progress:

Implement API-based canary cookie-switching functionality to enable more testing types
- ~70% complete
- Continue to expect delivery in %14.9
Identifying overlap with previous and existing flaky test work and stale areas due to existing pipeline work for Update existing staging infrastructure and remove flaky tests cause of underlying infrastructure and determine next steps
- ~55% complete (limited progress this past week as more issues to investigate became apparent)
- Delivery will extend past %14.9

Next steps:

Targeted mixed deployment tests starting with Refactor tests for mixed deployment environment new pipeline deployment approach

Update 2022-03-21

Thank you @niskhakova for the async update. Copying over from below as that section is for Test data - #341427 (comment 882315385)

What was done:

Explore switching to nodepool autoscaling on Staging Ref - closed as done. Now Staging Ref is able to autoscale correctly webservice and sidekiq nodes with the addition of taints to these node pools.
Account Creation capabilities in Staging Ref - enabled Sign up on Staging Ref using same settings as used on gstg.
Improve Staging Ref deployment stability - updated the issue with the latest status. 403 errors and other errors were resolved, the one that's left is transient 502 error in post-configure steps.
Not tracked in OKR but maybe helpful to highlight that Staging Ref is being now more used by engineers:
- Enabled Puma metrics server for Memory team testing activities
- Switch to gitlab-sshd daemon for infrastructure and Source Code teams testing activities

Next steps:

Account Creation capabilities in Staging Ref - waiting confirmation from the team if it's fine to close this issue now as done.
Setup Geo site for Staging Ref - continue work on GET configurations for Staging Ref EU in https://gitlab.com/gitlab-org/quality/gitlab-environment-toolkit-configs/staging-ref/-/merge_requests/66

Update 2022-03-22

In Review:

Mitigation approach for Sidekiq-related failure modes

In progress:

Implement API-based canary cookie-switching functionality to enable more testing types
- ~75% complete
- Missed %14.9 due to a hole in low level API call handling that is patched in !83419 (merged)
Identifying overlap with previous and existing flaky test work and stale areas due to existing pipeline work for Update existing staging infrastructure and remove flaky tests cause of underlying infrastructure and determine next steps
- ~55% complete (limited progress this past week as more issues to investigate became apparent)

Next steps:

Targeted mixed deployment tests starting with Refactor tests for mixed deployment environment new pipeline deployment approach

Update 2022-03-28

What was done:

Setup Geo site for Staging Ref - provisioned and configured Staging Ref EU, updated code in https://gitlab.com/gitlab-org/quality/gitlab-environment-toolkit-configs/staging-ref/-/merge_requests/66, worked with Nick on improvements to GET to support mixed Geo setup - when primary is Hybrid and secondary is Omnibus. Made dry run for changes in MR - #350869 (comment 888029177).
Account Creation capabilities in Staging Ref - closed as done.
Increase Staging Ref adoption and gather feedback - updated the issue with details when Staging Ref was used for feature testing.

Next steps:

Setup Geo site for Staging Ref - Merge Staging Ref Geo changes https://gitlab.com/gitlab-org/quality/gitlab-environment-toolkit-configs/staging-ref/-/merge_requests/66, finish up configuring 3k EU site and ensure CI pipeline works as expected.

Update 2022-03-29

What was done:

Mitigation approach for Sidekiq-related failure modes
Analyze potential MDE failure points => Backend - Store incompatible objects in database
Identified additional work required in API analysis, captured in:
- Identify staging-only tests and refactor for pipeline reordering
- Modify existing E2E tests utilizing airborne gem to pass environment appropriate canary cookie

In progress:

Add debug-triage documentation for targeted mixed environment tests
Modify existing E2E tests utilizing airborne gem to pass environment appropriate canary cookie
Implement API-based canary cookie-switching functionality to enable more testing types
- ~90% complete (adding tests)
Identifying overlap with previous and existing flaky test work and stale areas due to existing pipeline work for Update existing staging infrastructure and remove flaky tests cause of underlying infrastructure and determine next steps

Next steps:

Targeted mixed deployment tests starting with Refactor tests for mixed deployment environment new pipeline deployment approach

Update 2022-04-12

What was done:

Process improvements identified for Update existing staging infrastructure and remove flaky tests cause of underlying infrastructure

In review:

Add debug-triage documentation for targeted mixed environment tests
Implement API-based canary cookie-switching functionality to enable more testing types
- ~95% complete (addressing review items and pairing)

In progress:

Modify existing E2E tests utilizing airborne gem to pass environment appropriate canary cookie

Next steps:

Identify staging-only tests and refactor for pipeline reordering
Targeted mixed deployment tests starting with Refactor tests for mixed deployment environment new pipeline deployment approach

Update 2022-04-19

What was done:

Add debug-triage documentation for targeted mixed environment tests

No other progress due to availability during pipeline triage on-call duties.

In review:

Implement API-based canary cookie-switching functionality to enable more testing types
- ~95% complete (addressing review items and pairing)

In progress:

Modify existing E2E tests utilizing airborne gem to pass environment appropriate canary cookie

Next steps:

Identify staging-only tests and refactor for pipeline reordering
Targeted mixed deployment tests starting with Refactor tests for mixed deployment environment new pipeline deployment approach

Update 2022-04-25

Sorry for the lack of updates from my side, the progress was slow as we were evacuating and unfortunately I still have limited availability. But there were some good updates during this time:

What was done:

Setup Geo site for Staging Ref - Geo EU site was configured and deplpyment to Staging Ref goes through the primary and secondary sites over a month now.
Set up GitLab-QA pipeline for the new Staging Ref Geo site - worked on Geo QA scenario for Staging Ref. Turned out scheduled run involved more work than initially expected, as it also requires to update pipeline-common, more detailed information #350870 (comment 921328869)
Configure an asset proxy for Staging-ref - Infrastructure team made a lot of progress on this, however this work is now on pause as we need to have asset proxy in k8s since Staging Ref is hybrid and we first need to wait for migrating camoproxy for GitLab.com into Kubernetes. More details in #356044 (comment 905217050).

Next steps:

Setup Geo site for Staging Ref - the rest of the work to Configure Unified URL is blocked by https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15528, Infra team is working on it
Set up GitLab-QA pipeline for the new Staging Ref Geo site - release new GitLab QA version and hopefully finish up with all work that's needed to configure Geo QA schedule. Also will update current gstg Geo QA configs as well as it looks like it's deprecated now.

Update 2022-04-26

No progress to report due to illness and PTO

In progress:

Next steps:

Identify staging-only tests and refactor for pipeline reordering
Targeted mixed deployment tests starting with Refactor tests for mixed deployment environment new pipeline deployment approach

Update 2022-05-03

Little progress due to team member availability (medical issue)

Complete:

Planning for deploy with confidence
- Identification of additional requirements

In progress:

Improve process of identifying and advocating for underlying infrastructure issues affecting reliability of staging infrastructure to reduce flaky tests
Implement API-based canary cookie-switching functionality to enable more testing types

Next steps:

Modify existing E2E tests utilizing airborne gem to pass environment appropriate canary cookie
Identify staging-only tests and refactor for pipeline reordering
Targeted mixed deployment tests starting with Refactor tests for mixed deployment environment new pipeline deployment approach

Update 2022-05-04

What was done:

Set up GitLab-QA pipeline for the new Staging Ref Geo site - closed as done. Configured Daily Geo tests schedule in https://ops.gitlab.net/gitlab-org/quality/staging-ref/-/pipeline_schedules and verified that Geo QA job runs correctly. Additionally updated current gstg Geo setup as well. More detailed information #350870 (comment 924940321)
Improve Staging Ref deployment stability - closed as done. Will continue to monitor pipeline stability and work on resolving of any issues in Staging Ref issue tracker in a business as usual way.
Define the access criteria for Staging Ref - worked with Infra team to add new Google group as editor to Staging Ref GCP project, now we have an ability to simplify access to the project for engineers - no need to create AR issue gitlab-com/www-gitlab-com!103522 (merged)
Analyze load simulation coverage on Staging - analysed coverage on Staging and created several issues to increase the coverage, closed as done

Next steps:

Increase load simulation coverage on Staging - will start to look what issue I can pick up to increase load coverage on Staging

Update 2022-05-10

Improve process of identifying and advocating for underlying infrastructure issues affecting reliability of staging infrastructure to reduce flaky tests
- This issue is 90% complete with ETA of 5/10/2022. Process improvements are mostly in place.
  - Improve test environment reliability and reduce flaky/transient test failures
  - Issues are created and categorized with related infrastructure issues
  - Adding process documentation to better track ongoing items as they occur

In progress:

Implement API-based canary cookie-switching functionality to enable more testing types

Next steps:

Modify existing E2E tests utilizing airborne gem to pass environment appropriate canary cookie
Identify staging-only tests and refactor for pipeline reordering
Targeted mixed deployment tests starting with Refactor tests for mixed deployment environment new pipeline deployment approach

Update 2022-05-24

Completed:

Improve process of identifying and advocating for underlying infrastructure issues affecting reliability of staging infrastructure to reduce flaky tests

In progress:

Implement API-based canary cookie-switching functionality to enable more testing types

Next steps:

Modify existing E2E tests utilizing airborne gem to pass environment appropriate canary cookie
Identify staging-only tests and refactor for pipeline reordering
Targeted mixed deployment tests starting with Refactor tests for mixed deployment environment new pipeline deployment approach

Update 2022-06-07

In progress:

Implement API-based canary cookie-switching functionality to enable more testing types
- Some unknown technical requirements made it necessary to reach out for additional assistance from some other engineers. Have one more pairing session on 6/8 (possibly 6/7) that should enable us to get this into final review and completed this week, before our 6/15 start of next efforts.

Next steps:

Review planning for next iteration of work to improve test environments to determine if any additional assistance is required
- Modify existing E2E tests utilizing airborne gem to pass environment appropriate canary cookie
- Identify staging-only tests and refactor for pipeline reordering
- Targeted mixed deployment tests starting with Refactor tests for mixed deployment environment new pipeline deployment approach

Update 2022-06-14

In progress:

Implement API-based canary cookie-switching functionality to enable more testing types
- I've had three different pairing sessions that have proved unfruitful and have enlisted the assistance of another engineer for another one today. I believe I have simplified this as much as possible and have one more path out of multiple options to debug. Should have this in review again today.

Next steps:

Review planning for next iteration of work to improve test environments to determine if any additional assistance is required
- Modify existing E2E tests utilizing airborne gem to pass environment appropriate canary cookie
- Identify staging-only tests and refactor for pipeline reordering
- Targeted mixed deployment tests starting with Refactor tests for mixed deployment environment new pipeline deployment approach

Update 2022-06-21

In progress/review:

Implement API-based canary cookie-switching functionality to enable more testing types
- Discovered a shared example that was failing that we thought was unrelated. Did a deep dive to insure and realized there was some monkey patching that cloudedended up passing an unexpected HTTP verb. I was able to address at the API level and moving this back into review, hopefully for the last time. I feel confident with this last change as I doubled my testing efforts to insure it was working appropriately and it's being reviewed now.
- Discovered within this work several opportunities to improve our framework, such as possibly refactoring the monkey patching (which was extremely difficult to find) and unnecessary complexities. Created follow-up issues such as in Refactor qa/qa/runtime/env.rb to improve readability and decrease complexity and duplication
Modify existing E2E tests utilizing airborne gem to pass environment appropriate canary cookie

Next steps:

Identify staging-only tests and refactor for pipeline reordering
Targeted mixed deployment tests starting with Refactor tests for mixed deployment environment new pipeline deployment approach

Update 2022-06-28

Delayed due to COVID issues. Will update this post ASAP.

Update 2022-07-05

Last week had limited availability due to contracting COVID (entire family) and pipeline triage.

In progress/review:

Identify staging-only tests and refactor for pipeline reordering
- 95% complete. Increased staging-canary coverage for 12 specs. Ready for review.
Modify existing E2E tests utilizing airborne gem to pass environment appropriate canary cookie

Blocked:

Discovered a bug in our framework during extensive testing of Implement API-based canary cookie-switching functionality to enable more testing types.
- Blocking issue Investigate handling of multiple QA_COOKIES
- Was able to identify other related work I could move forward

Next steps:

Targeted mixed deployment tests starting with Refactor tests for mixed deployment environment new pipeline deployment approach

Update 2022-07-12

On PTO for recovery and training, so updates are limited.

In review:

Identify staging-only tests and refactor for pipeline reordering
- 95% complete. Increased staging-canary coverage for 12 specs. Ready for review.

In progress:

Making progress on blocking issue Investigate handling of multiple QA_COOKIES
Modify existing E2E tests utilizing airborne gem to pass environment appropriate canary cookie

Blocked:

Discovered a bug in our framework during extensive testing of Implement API-based canary cookie-switching functionality to enable more testing types.

Next steps:

Targeted mixed deployment tests starting with Refactor tests for mixed deployment environment new pipeline deployment approach

Update 2022-07-26

DRI returning from PTO this week and will provide updates ~~by Thursday~~ as soon as possible.

UPDATE

I'm still OOO due to illness and may not be able to provide further updates until next week.

Update 2022-08-02

No progress made while I was out. I'm just back in the office from having to take some unexpected time for illness and am addressing a corrective action from a production incident that occurred while away. I will post another update this week as I get caught up and wrap up my current open items.

Update 2022-08-09

Coming back last week I ended up on pipeline triage again. It was a short turnaround due to the creation of the new 3-person schedule, so my efforts were focused there and catching up on runner-related incidents, while still recovering from an acute illness. I am back in this week, 100% and will finish off these couple remaining tasks.

Update 2022-08-15

This is really disappointing that my update didn't get posted yesterday evening when I told @vincywilson it was ready. Seems the system was struggling a bit in the 10PM EST hour. I'm working some later shifts due to sleep issues after COVID and didn't see her notification that the thread failed to update until it later today (also due to Clockwise turning off notifications, which I've now disabled). This is a second attempt.

Complete:

Identify staging-only tests and refactor for pipeline reordering

In progress:

Making progress on blocking issue Investigate handling of multiple QA_COOKIES
- When this is complete this week (getting a pairing assist as well), the blocked issue below should also be able to be pushed through.

Blocked:

Discovered a bug in our framework during extensive testing of Implement API-based canary cookie-switching functionality to enable more testing types.

Phase 2 starts after this.

Update 2022-08-22

With the blocking issue resolved, I can now clean up and push through the last remaining functional change to improve our mixed deployment testing approach. While it currently works without this improvement, this will add additional lower level coverage and enable future API-only testing for phase 2. I was able to unblock this without any in-depth functional refactoring by changing the approach to passing these cookies within the blocked MR, working around the incompletely documented third-party dependencies responsible for the malformed cookies.

I also identified the references that could be improved with this approach, documenting this in this issue that will also be complete this week.

Complete:

Blocking issue resolved: Investigate handling of multiple QA_COOKIES

In progress:

Phase 2 starts after this.

Update 2022-08-30

Complete: 85% Confidence: 95%

I've been able to address needed changes and am finishing up documentation requests from reviews. They will be in tomorrow when I return (out today for child medical appointment out of town) and then I can complete updating the few references.

In progress:

Implement API-based canary cookie-switching functionality to enable more testing types

Update references for passing multiple cookies in end-to-end tests

These improvements wrap up planned staging-canary work. The pipelines and test environment work for staging-canary are working as planned and providing mixed deployment environment coverage.

Update 2022-09-06

Complete: 100% Confidence: 99%

I've had a significant number of surprises, so I've leaned heavily on my reviewers and they are finishing up. I greatly simplified this approach as well, to make the work more maintainable for future iterations.

In Review:

These improvements wrap up planned staging-canary work. The pipelines and test environment work for staging-canary are working as planned and providing mixed deployment environment coverage.

Nice work @zeffmorgan @vincywilson

Test data Status Thread

Update 2021-10-07

(I'm the only Geo team member at the moment, so not creating a team specific thread)

Done:

Closed out #339992 (closed) and summarized results in #338976 (comment 692771563)

Todo:

There is some followup necessary to clarify some of the requirements. Particularly with some integrations that are setup in the current staging.

@vincywilson thank you for taking on the plan to implement the needed data in #338976 (comment 708259652)

Update 2022-02-01

Following issues from Test data requirements have been prioritized for Q1.

Update 2022-02-15

Provide Admin access and accounts in Staging Ref - marked as done all tasks except for SAML - added more test accounts that can be used by teams and guidance where to find their credentials - https://about.gitlab.com/handbook/engineering/infrastructure/environments/staging-ref/#pre-existing-test-accounts

Update 2022-02-21

What was done:

Test Data - Snowplow access in Staging Ref - added Snowplow configuration to Staging Ref and now events are coming to Snowplow.
Test Data - Provide Admin access and accounts in Staging Ref - Added SAML configuration to Staging Ref in with Sanad's help https://gitlab.com/gitlab-org/quality/gitlab-environment-toolkit-configs/staging-ref/-/merge_requests/64

Next steps:

Test Data - Snowplow access in Staging Ref - clarify with the team if we can automate Snowplow configuration change and get feedback if current setup works as expected.

Update 2022-03-15

What was done:

Test Data - Snowplow access in Staging Ref - added Snowplow configuration to Staging Ref and automated Snowplow configuration changes.
Test Data - Provide Admin access and accounts in Staging Ref - Slack announcement - https://gitlab.slack.com/archives/C02PF508L/p1647350235515419

Next steps:

Test Data - Configuration Needs - working with Infrastructure team on defining next steps.

Update 2022-03-21

What was done:

Account Creation capabilities in Staging Ref - enabled Sign up on Staging Ref using same settings as used on gstg.

Next steps:

Account Creation capabilities in Staging Ref - waiting confirmation from the team if it's fine to close this issue now as done.

Update 2021-12-09

Delivery are close to completing the refactor of QA triggers on the deployment pipeline. This work will allow gstg-cny and staging-ref tests to be triggered and results reported.
Following the QA triggers work we can make the change to move post-deployment migrations to the end of the pipeline. This step will enable us to re-order the deployment environments to deploy to gprd-cny ahead of staging.

Next steps

Before we can make any pipeline changes we need to confirm that we have testing in place for Gitaly, praefect, and Pages.

added typemaintenance label and removed tooling (archive) label

@meks Team has made immense progress on Staging Canary and is close to winding down all remaining efforts. Currently we have ~ 9 issues remaining, out of which 2 are required for completing this effort. Once the 2 issues (1, 2) are complete, we are thinking of closing out the Engineering Allocation for Staging canary and the rest of the work can continue as part of Quality Engineering's Deploy with Confidence effort. Please let us know your thoughts.

Thanks @vincywilson!

Make sense

Please cross-check things we will complete in phase 2 in &6401 before we close this out.
V2 of staging canary should be part of business-as-usual in our collaboration w/ delivery team.
Staging-ref should be part of business-as-usual in our deploy with confidence.

I suggest we have a process defined to inform the wider engineering team how to engage quality going forward and have it communicated broadly.

Staging Load Emulation - 2022-08-01

Infrastructure and Quality teams finished the first step of enhancing Staging Monitoring. With work done in Simulate load on Staging epic - &7320 (closed) - we now have a stable load for multiple services on Staging and Staging Canary. Full documentation can be found on https://about.gitlab.com/handbook/engineering/monitoring/staging-monitoring/

Next iterations rely on Infrastructure team to work on

Improving the quality of Staging SLO alerts gitlab-com/gl-infra&668
Halt deployment to Production if Staging SLI degrades gitlab-com/gl-infra&771

closed

New Staging Environments - Status update

Links

Status

x% Delivered

Currently working on

Delivered

Blocked

Demos

Other

Activity

Week of 2021-09-20

Update from 2021-10-04

Quality Engineering Status Updates

Update 2021-10-04

Update 2021-10-11

Update 2021-10-12

Update 2021-10-18

Update 2021-10-18

Update 2021-10-22

Update 2021-10-25

Update 2021-10-25

Update 2021-11-01

Update 2021-11-01

Update 2021-11-01

Update 2021-11-08

Update 2021-11-08

Completed changes to enable blocking deployment:

Next

This Week

Update 2021-11-12

Update 2021-11-15

Next

Update 2021-11-22

Update 2021-11-22

Update 2021-12-02

2021-12-02 Update

Update 2021-12-06

Update 2021-12-06

Update 2021-12-10

Update 2021-12-13

Update 2021-12-20

Update 2021-12-20

Update 2021-12-28

Update 2021-12-28

Update 2022-01-10

Update 2022-01-11

Update 2022-01-17

Update 2022-01-18

Update 2022-01-24

Update 2022-01-25

Update 2022-01-31

Update 2022-01-31

Update 2022-02-07

Update 2022-02-08

Update 2022-02-14

Update 2022-02-15

Update 2022-02-21

Update 2022-02-22

Update 2022-03-01

Update 2022-03-08

Update 2022-03-15

Update 2022-03-21

Update 2022-03-22

Update 2022-03-28

Update 2022-03-29

Update 2022-04-12

Update 2022-04-19

Update 2022-04-25

Update 2022-04-26

Update 2022-05-03

Update 2022-05-04

Update 2022-05-10

Update 2022-05-24

Update 2022-06-07

Update 2022-06-14

Update 2022-06-21

Update 2022-06-28

Update 2022-07-05

Update 2022-07-12

Update 2022-07-26