I've created this issue for us to provide async update on a weekly bases FYI @meks , @niskhakova , @zeffmorgan, @amyphillips, @pguinoiseau , @ibaum . Please feel free to provide your update for the week - Actively worked on or delivered. I assume this would be easier for Mek
to communicate in Daily standups.
3
Vincy Wilsonchanged the descriptionCompare with previous version
Agreed that it brings more paperwork, but not sure what is the best way to provide updates. The reason for moving this out of the main epic was to reduce the potential noise for everyone. If that's not a concern, I'm fine with using the main epic for updates.
@meks could you please clarify what date do you usually need the latest update for Mixed env? So that we can post weekly results in line with this.
@meks did you by chance decide where you would like to have updates posted? Would you like me to create a new discussion in the main epic as discussed? If so, I'll just name it Quality Engineering Status Updates or something similar and we can post our updates in there. Also happy to add updates in the meeting doc itself if you prefer.
What would you all prefer? @niskhakova it doesn't appear the epic is getting too much more noise recently. I don't know if that's because discussions were moving off into the linked issues and sub-epics.
Let us know @meks and we'll get updates posted for you for the meeting Monday.
@niskhakova@zeffmorgan noted, let's keep it here. Also for SSoT let's make it inclusive to progress outside of QE as well. There is also the usecase from development and other updates from Infra.
If we can have an update before monday 7:30am PST that would be great.
What's done: Great news here, we were able to configure Staging Ref environment to be able to upgrade using custom GitLab images from private GitLab repos - pre-release and dev.gitlab.com. Many thanks to @pguinoiseau for all the help here.
@niskhakova to facilitate using this update issue as an SSoT, I'm creating a single discussion for QE-specific activity so our updates can be added as responses and grouped together under QE. Other teams can follow suit. What do you think?
Please see previous update for schedule. This past week I was only in Tuesday where I worked on technical interviews and debugging. Schedule from 2021-10-12 still stands.
I have discovered some coupling that should be refactored but should be able to roll that into my existing work.
We only have one bidirectional test running at the moment as a scheduled job. However, once the above items are complete, we will have a SIGNIFICANT amount of coverage added that will give us a similar confidence level as our current SMOKE and RELIABLE tests for the Migration failure mode of mixed deployment.
Timeline: I expect to jump from minimal coverage to Migration failure mode coverage equivalent to that of all of our existing SMOKE/RELIABLE tests by Wednesday, 10/27, if not sooner. All of the above development work will be completed by today with the only delay being reviews and releasing the updated version of the Gitlab QA gem.
Thank you @zeffmorgan, this is quite a lot of technical verbosity to roll up into our standup update. I would appreciate if this can be simplified down to report up to the stakeholders. I am looking for a definitive date where can we confidently say that mix-deployment errors are being guarded/tested before we deploy to production.
I expect to jump from minimal coverage to Migration failure mode coverage equivalent to that of all of our existing SMOKE/RELIABLE tests by Wednesday, 10/27
This seems key. So we will be blocking deployments with mix-deployment test by 10/27? Is this also accounting for the below?
GitLab-QA pipeline for Staging Ref - added Staging Ref QA pipeline and working on adding a new GitLab QA scenario for this environment. By the end of the week it should be done. Al details are in the issue description.
Next steps:
Proceed with issues above and also prepare the list of issues that left to be done in Q4.
We are still on track for gaining the coverage I mentioned previously by mid-week this week. This is to gate deployments to canary.staging.gitlab.com. This will provide some coverage specifically for the migration failure mode in a mixed deployment environment. I am guessing that the sum of our existing smoke and reliable tests may cover from 20%-40% of our existing schema, which is what we want to account for within this specific failure mode, but that is a very rough estimate. Follow up analysis and additional tests will likely be warranted.
Due to the scope of the mixed deployment environment, we do not yet have coverage estimates or definitive dates to achieve a percent of coverage for the other failure modes. I'm working on a more definitive list of work that needs to be accomplished to address the entirety of this large project to help align everyone on the actual scope of work and will share that this week.
The deployment pipeline changes are quite substantial so I expect them to take several milestones to complete safely. At the moment we're setting up the epic - gitlab-com/gl-infra&608 (closed) and have started work to create a Gitaly node for gstg-canary - gitlab-com/gl-infra/delivery#2087 (closed)
All initial dev work on QA side to introduce coverage for mixed deployment is complete, both for introducing existing tests and for mixed deployment environment (MDE) tests
There are a cascade of dependent MR's that are either merged or in the review process. I am currently shepherding the remaining MR's through the review process to expedite enabling tests within the new staging-canary environment. Once these are merged, hopefully today, we will be able to add all fo the above changes to deployer.
Reviews and analysis from last week helped determine a need for some refactoring to both simplify our existing approach and speed up our ability to reintroduce existing tests to benefit from them within the MDE context (their implementation can help discover some issues)
One caveat we hadn't really discussed before related to our pipeline deployment changes currently being worked on is that regardless of whether we execute MDE tests (or leverage existing smoke/reliable tests) to help with MDE coverage, we can provide no guarantee that we will catch MDE failures. This is because we cannot guarantee state of the system we are testing until the deployer changes are complete.
An additional caveat to this is that we could conceivably block deployments to staging-canary based on system state that will never be deployed to production. I think this is minimal, but should be noted.
Refactoring completed to enable blocking deployments. Waiting on a test to be quarantined to reduce pipeline noise before it is enabled. Next steps are to merge maintainer approved pipeline changes, announce pipeline changes, and enable deployer changes. This work is already in place and just needs to be toggled on once the test is quarantined.
Create access tokens for Staging Ref automatically - closed as done. Access tokens will be created automatically if environment is rebuilt. All Staging Ref credentials are now stored in GitLab-QA and Engineering 1Password vaults.
!74231 (merged) - to account for Staging Ref in E2E tests. Before this change tests marked with only: :staging also ran against Staging Ref due to small issue in E2E context selector regex.
Will proceed to work on Improve load testing for staging-canary - learn more about the current Staging architecture, try to run crawler with lower RPS and monitor gstg health
We are currently burning in the additional test pipelines triggered by deployment for Staging-Canary to allow infrastructure to collect information on the impact in this environment
These pipelines are triggered by deployer
These provide a level of coverage for the schema-related failure mode equivalent to our current level of smoke and reliable end-to-end tests. Once satisfied with the impact in addition to existing test jobs, infrastructure will change these to blocking jobs.
Focusing on Improve load testing for staging-canary - specifically planning to enable crawler once more and track down any 500 errors to understand what is the limiting factor for the current load testing on Staging.
Continue work on Improve load testing for staging-canary - will monitor error rates on Staging further and will work on add agent cookie to avoid 429 errors on web calls.
Continuing work on adding API functionality for expanding test coverage
Expanded due to gitlab-com/gl-infra/delivery#2113 (closed). This will enable more coverage for failure modes that involve both Gitaly and Praefect. Further exploration and planning required to implement this in a way that will not require rewriting specific tests but can leverage our existing suites.
Next:
Re-evaluate failure mode analysis with existing coverage and tooling to map out next actions
Next week on Monday and Tuesday I will be out, so posting update today instead of Monday.
What was done:
Improve load testing for staging-canary - load emulation is now running against Staging for more than a week and QA pipelines were stable. Analysed request errors in cmbr API and Web pipelines, updated cmbr to fix 439 and 503 erors, wrote plan for next improvements - #338978 (comment 764287105).
gitlab-com/gl-infra/delivery#2113 (closed) ended up being a discussion around infra being ready to start on pipeline reordering. As a result, I changed gears last week and started on the following to unblock infra next steps:
Currently testing these changes out and pairing with another SET with more experience around Praefect/Gitaly to make sure we don't miss anything with the new environment configuration. This is to be completed this week or sooner.
We had to modify our approach to account for the single integration point we have with CustomersDot and Zuora sandbox to be able to shift Fulfillment tests into staging-canary as well
All QA test schedules are setup for staging-canary (manual schedules - see below)
Currently:
All QA test suites are being tested manually, first, staggering them with the existing jobs currently running against staging to avoid unnecessary saturation that could lead to flakiness. These tests should be completed today and the additional changes for the CustomersDot and Zuora integration merged.
This will give us a solid burn in period for infra's expected reordering work in January
Trigger Staging Ref QA pipeline during deployment process - investigated and resolved QA failures in full pipeline, specifically some tests ran that should not have been run against live environment like Staging Ref. Now Full QA pipeline is green. Test session link to Release task will be a bigger issue as it requires refactoring from Delivery team, see gitlab-com/gl-infra/delivery#2168 (closed)
Continue work on Improve load testing for staging-canary - sync with Infra team to understand how error rate SLIs are working and what are the next steps to enable alerts on Staging for breaching SLIs now that load testing is running
Experienced some delays affecting our ability to execute tests due to time off and package-and-qa problems that have since been resolved, but were able to run enough tests to identify some necessary maintenance in gitlab-qa (issue referenced below)
Except for the two identified test jobs that are problematic, testing and reporting have both been successful in the full QA suite of tests on staging-canary
Next:
Complete investigation into root cause of very small subset of specs and implement fix in gitlab-qa gem.
Improve load testing for staging-canary - working on adding custom cookie support for webcrawler to load gstg-cny specificially if needed. Cleaned up next step for load emulation on Staging. Now there is &7320 (closed) to track Quality work for load testing and pairing Infra epic gitlab-com/gl-infra&668 to track Infra team related work.
Pipeline Triage DRI this week, so may have more limited availability for ongoing work this week. Started validation of Runner Helm install on GKE with new containerd base last week for GKE's rollout of containerd, which also impacted availability.
Improve Staging Ref deployment stability - added retries to test if it helps with intermittent issues. To reduce the noise for Delivery team deployment won't be sending failures to announcements while Staging Ref deployment stability is worked on.
No significant progress to report this past week due to Pipeline Triage, containerd testing project for runner on GKE, and Q1 OKR planning for Runner Reference Architectures.
Two MR's here are in review by infrastructure. This documentation made apparent a possible discrepancy with the function of our subdomain for staging-canary that infrastructure is investigating before pushing through.
Configure Sentry for Staging Ref - turned out configuring Sentry shouldn't take long and Infra team created a new Sentry project for Stagign Ref, I started to work on updates to configuration to enable Sentry config.
Improve Staging Ref deployment stability - Switched to use another admin user in post configure to reslove 403 errors. So far no occurrences of user blocked error.
Improve Staging Ref deployment stability - updated the issue with the latest status. 403 errors and other errors were resolved, the one that's left is transient 502 error in post-configure steps.
Not tracked in OKR but maybe helpful to highlight that Staging Ref is being now more used by engineers:
Sorry for the lack of updates from my side, the progress was slow as we were evacuating and unfortunately I still have limited availability. But there were some good updates during this time:
What was done:
Setup Geo site for Staging Ref - Geo EU site was configured and deplpyment to Staging Ref goes through the primary and secondary sites over a month now.
Configure an asset proxy for Staging-ref - Infrastructure team made a lot of progress on this, however this work is now on pause as we need to have asset proxy in k8s since Staging Ref is hybrid and we first need to wait for migrating camoproxy for GitLab.com into Kubernetes. More details in #356044 (comment 905217050).
Set up GitLab-QA pipeline for the new Staging Ref Geo site - release new GitLab QA version and hopefully finish up with all work that's needed to configure Geo QA schedule. Also will update current gstg Geo QA configs as well as it looks like it's deprecated now.
Some unknown technical requirements made it necessary to reach out for additional assistance from some other engineers. Have one more pairing session on 6/8 (possibly 6/7) that should enable us to get this into final review and completed this week, before our 6/15 start of next efforts.
Next steps:
Review planning for next iteration of work to improve test environments to determine if any additional assistance is required
I've had three different pairing sessions that have proved unfruitful and have enlisted the assistance of another engineer for another one today. I believe I have simplified this as much as possible and have one more path out of multiple options to debug. Should have this in review again today.
Next steps:
Review planning for next iteration of work to improve test environments to determine if any additional assistance is required
Discovered a shared example that was failing that we thought was unrelated. Did a deep dive to insure and realized there was some monkey patching that cloudedended up passing an unexpected HTTP verb. I was able to address at the API level and moving this back into review, hopefully for the last time. I feel confident with this last change as I doubled my testing efforts to insure it was working appropriately and it's being reviewed now.
No progress made while I was out. I'm just back in the office from having to take some unexpected time for illness and am addressing a corrective action from a production incident that occurred while away. I will post another update this week as I get caught up and wrap up my current open items.
Coming back last week I ended up on pipeline triage again. It was a short turnaround due to the creation of the new 3-person schedule, so my efforts were focused there and catching up on runner-related incidents, while still recovering from an acute illness. I am back in this week, 100% and will finish off these couple remaining tasks.
This is really disappointing that my update didn't get posted yesterday evening when I told @vincywilson it was ready. Seems the system was struggling a bit in the 10PM EST hour. I'm working some later shifts due to sleep issues after COVID and didn't see her notification that the thread failed to update until it later today (also due to Clockwise turning off notifications, which I've now disabled). This is a second attempt.
With the blocking issue resolved, I can now clean up and push through the last remaining functional change to improve our mixed deployment testing approach. While it currently works without this improvement, this will add additional lower level coverage and enable future API-only testing for phase 2. I was able to unblock this without any in-depth functional refactoring by changing the approach to passing these cookies within the blocked MR, working around the incompletely documented third-party dependencies responsible for the malformed cookies.
I also identified the references that could be improved with this approach, documenting this in this issue that will also be complete this week.
I've been able to address needed changes and am finishing up documentation requests from reviews. They will be in tomorrow when I return (out today for child medical appointment out of town) and then I can complete updating the few references.
These improvements wrap up planned staging-canary work. The pipelines and test environment work for staging-canary are working as planned and providing mixed deployment environment coverage.
I've had a significant number of surprises, so I've leaned heavily on my reviewers and they are finishing up. I greatly simplified this approach as well, to make the work more maintainable for future iterations.
These improvements wrap up planned staging-canary work. The pipelines and test environment work for staging-canary are working as planned and providing mixed deployment environment coverage.
Test Data - Snowplow access in Staging Ref - clarify with the team if we can automate Snowplow configuration change and get feedback if current setup works as expected.
Delivery are close to completing the refactor of QA triggers on the deployment pipeline. This work will allow gstg-cny and staging-ref tests to be triggered and results reported.
@meks Team has made immense progress on Staging Canary and is close to winding down all remaining efforts. Currently we have ~ 9 issues remaining, out of which 2 are required for completing this effort. Once the 2 issues (1, 2) are complete, we are thinking of closing out the Engineering Allocation for Staging canary and the rest of the work can continue as part of Quality Engineering's Deploy with Confidence effort. Please let us know your thoughts.