Goal: Emulate activity on Staging by using load tool. This should help to increase traffic on Staging and set up alerts on increased error rate using SLI. Infrastructure team uses Grafana dashboards for SLI monitoring.
Task
Familiarize with the cmbr webcrawler and other similar tools.
Analyse if cmbr caused intermittent 500 errors in GitLab QA and if so, explore if it's possible to reduce its load further until stg-cny is built. - #338978 (comment 752263239)
Quality related work &7320 (closed) to streamline load simulation efforts
Infra team related work: gitlab-com/gl-infra&668 to improve the precision of staging service-level monitoring alerts and set up incident review processes
@meks@vincywilson can you please clarify what is the goal of performance testing against Staging Ref? Based on the goal, the implementation will be different. Adding more details on this below.
As far as I understand the goal is to emulate production load based on previous discussions (&6401 (comment 641084667)) but on the other hand, there was a point (&6401 (comment 655550297)) that GPT will be used to do performance testing before we release to production. So I think it'll be better to clarify the end goal.
My preference would be if we need to run performance tests, it would be better to use them to emulate production load/noise and not as a gateway for deployment. I've written down my thoughts on each of the approaches and what problems they have.
1. Emulate production load
Thoughts, problems and open questions:
We need to analyze what endpoints should be loaded and how many RPS should be used for Staging Ref using Production data
GPT may not be the best suited for this purpose as it's mostly used to run the test to get results and not to do a constant load. Perhaps it's better to explore other options (traffic-generator, cmbr, etc)
Clarify what data should be set up on the environment
We need to explore how to make performance set up (seed test data, trigger test run) automated if Staging Ref is rebuilt
2. Performance testing gateway before production
Thoughts, problems and open questions:
Full GPT run takes about 90 mins. By looking into deployment announcement channel there are times when a new gstg-cny deployment process is being started in less than 60-70 minutes. By the current design, Staging Ref will be wired with Staging Canary deployment and so if we trigger a full GPT run on each deployment, the results will be skewed due to an upgrade.
Performance tests will affect GitLab QA pipelines. The high load will bring more flakiness or bad response from the server. The future plan is to make Staging Ref QA pipeline blocking as well and such intermittent issues will bring more noise to the deployment process.
Performance results won't be relevant or "clean" as it's a noisy environment. Since it's a shared environment, someone can work on some features and generate additional data/load which will skew test results.
Related to the point above, we already have daily performance pipelines against all reference architecture where we have lab-like results. These already helped to catch performance degradations and Omnibus configuration issues. The new pipeline won't add much value except that it will run more frequently than daily, but there are caveats to this as mentioned above.
Thankyou @niskhakova for the 2 approaches and potential challenges that comes with it.
As far as I understand the goal is to emulate production load based on previous discussions (&6401 (comment 641084667)) but on the other hand, there was a point (&6401 (comment 655550297)) that GPT will be used to do performance testing before we release to production. So I think it'll be better to clarify the end goal.
Great call out Nailia. I will let @meks chime in on it so as to ensure that we are all aligned on the requirement.
Emulate production load
Given that Infra team is already running staging load test, we might be able to gather information on what should be the endpoints, RPS would be needed. That being said, if GPT is not the right tool for this task, I would suggest exploring other options.
Performance testing gateway before production
The purpose of Staging ref environments is that they can be spinned up and torn down on demand. In such case, we should be able to spin up a Staging ref 10K and potentially use that only for performance test while the GitLab QA test runs on another environment or current Staging environment provided we have the latest code deployed on both. This should help with reducing noise. My apologizes if am mistaken here.
These are my 2 cents but looking forward to hearing from Mek on his vision.
The purpose of Staging ref environments is that they can be spinned up and torn down on demand. In such case, we should be able to spin up a Staging ref 10K and potentially use that only for performance test while the GitLab QA test runs on another environment or current Staging environment provided we have the latest code deployed on both.
Yes, one-off performance test could be done, however we would need to make sure that no upgrade will be triggered during this time and no one is using the environment that could skew the results. I would think that if there is a necessity to run performance test against specific GitLab release, we may be better to spin up a clean new environment and perform tests there.
Update the current issue to be a discussion point to finalize our approach on load and performance testing for Staging environments.
Create an epic for load testing on Staging based on Goal#1 and Goal#2 listed below.
Create issues based on the steps below.
Close the current issue as done.
Iterate on Goal#1 first.
Work on Goal#2.
Goal#1. Load testing using to emulate production activity
Goal: Emulate activity on Staging and Staging Ref by using load tool. This should help to increase traffic on Staging and set up alerts on increased error rate using SLI. Infrastructure team uses Grafana dahsboards for SLI monitoring (Web/Api).
Thoughts: I think using webcrawler for this purpose has the nice benefit in that it can help to catch bugs that may be missed with performance load scenarios. The reason for this is that webcrawler goes through the URLs randomly and so it can highlight unexpected issues. If we used usual performance tool, we would need to specify hardcoded scenario with specific endopints. Perhaps we can explore using both complementary. Since cmbr has already showed that it's helpful https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10771#note_710535936, we can proceed with it as a quick iteration.
Familiarize with the cmbr webcrawler and other similar tools.
Tweak cmbr to resolve 429 errors, explore supporting multiple users for crawling.
Explore other tools and if it will be helpful to also create a custom GPT scenario for most popular endpoints that are used on Production.
Set up test data on Staging Ref. I think we can use GPT Data Generator as an initial step. We can automate its run when Staging Ref is rebuilt.
Set up webcrawler for Staging Ref - this is blocked until monitoring for Staging Ref is set up (technically we can add it, but if no one watches metrics, it won't make sense)
Goal#2. Explore running performance test against Staging Ref
Goal: Provide performance results for deployment. Set up CI schedule to trigger GPT run. Initially it can run overnight and if needed it can be triggered manually during the rest of the day.
Thoughts: I understand the concern that we need to validate GitLab releases before they reach Production. However still hesitant to add full performance tests to Staging Ref as we can't guarantee that results will be "clean" as it's a shared environment and we can't predict when a new deployment is triggered (it can start in the middle of 90mins GPT run), more details on that in #338978 (comment 706823559). It may be helpful to look into https://gitlab.com/gitlab-org/quality/team-tasks/-/issues/751 to create a lab-like environment for performance testing in hybrid environments.
Set up test data and make it automated so that if env is rebuilt the data should be reseeded as well.
As a first iteration explore triggering nightly performance tests against Staging Ref.
Validate test results.
Wire test results report to Release Task for the specific GitLab version.
@niskhakova Am in favor of having this issue as a discussion issue where we finalize load and perf testing approach. For now, i suggest we create 2 issues for load and perf testing and as and when we create more issues, we can think about promoting it to epic.
@niskhakova thanks for the approach, I am good with Goal#1. Load testing using to emulate production activity
Since staging ref will not be blocking deploys, I suggest we change this issue to run cmbr on existing staging not staging-ref as the first iteration.
existing staging already has monitoring
existing staging is blocking deploys
We want to stay objective and that we are increasing load to ensure quality,security,reliability.
Area to be discussed is, should this new load be toward staging-canary or staging need to coordinate with the pipeline re-ordering effort gitlab-com/gl-infra&608 (closed)
I suggest to start small and turn up the load. I support pausing any load testing on staging-ref for this work.
p.s. updated this issue to capture your detailed task but switched to staging-canary/staging instead. Please step on my toes and polish further.
Familiarize with the cmbr webcrawler and other similar tools.
Tweak cmbr to resolve 429 errors, explore supporting multiple users for crawling.
Explore other tools and if it will be helpful to also create a custom GPT scenario for most popular endpoints that are used on Production.
Set up test data on Staging & Staging-Canary. I think we can use GPT Data Generator as an initial step.
We can automate its run when Staging Ref is rebuilt.
Set up webcrawler for Staging & Staging-Canary
Additional monitoring for Staging & Staging-Canary is set up (technically we can add it, but if no one watches metrics, it won't make sense)
Analyse what would be the best approach to perform load testing on Staging Ref environment.
If these are application errors, let's create issues and have development/product prioritize. We have support on getting this improved and not to backoff on testing
Fine tune the crawler further e.g. gas stove dial start with 0.5 then increase to 1 and beyond
@meks thanks for the feedback and updating this issue. Will share some thoughts below:
Since staging ref will not be blocking deploys, I suggest we change this issue to run cmbr on existing staging not staging-ref as the first iteration.
My understanding is that it doesn't matter if Staging Ref doesn't block deployment in this case. When emulating activity we don't look at performance results, this traffic is used to monitor error rate and SLIs. When Staging Ref will have monitoring configured with https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/14047, it will fire an alert for the team if error rate increased and it will be investigated by on-call engineer.
Another problem is it looks like we can't run any type of load tests against current Staging, even with a small 1 RPS cmbr affected GitLab QA pipelines and deployment and it was turned off, please see discussion https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10771#note_718667622. It needs to be clarified and verified additionally if cmbr indeed affected QA or there was another reason, however since it was disabled there were no transient 500 errors.
We may need to wait for cny-stg environment to be rebuilt to be able to run any traffic against Staging to avoid breaking deployments currently.
So to keep it short we have 2 problems now which blocking load emulation:
Current Staging can't handle load and we need to wait for new more performant Stg-Cny to set up webcrawler for it, however Staging has monitoring set up.
Staging Ref can handle load but it doesn't have monitoring set up for SLIs
In the context of the current task Improve load testing for staging-canary suggesting to add some new steps:
Familiarize with the cmbr webcrawler and other similar tools.
Analyse if cmbr caused intermittent 500 errors in GitLab QA and if so, explore if it's possible to reduce its load further until cny-stg is built.
Tweak cmbr to resolve 429 errors, explore supporting multiple users for crawling.
Explore other tools and if it will be helpful to also create a custom GPT scenario for most popular endpoints that are used on Production.
Set up test data on Staging & Staging-Canary.
Set up webcrawler for Staging & Staging-Canary
Verify that SLI monitoring for Staging & Staging-Canary is set up
Validate that GitLab QA pipelines are not affected
Looked closer in cmbr and the period of time when it was enabled (~2021-09-20 8:45 UTC) and disabled (~2021-09-22 14:20 UTC):
Based on Grafana graphs (web, api) - during cmbr run kube_container_memory, kube_pool_cpu, kube_pool_max_nodes, kube_go_memory components were saturated to 100% or stopped sharing their metrics - see image below there is a distinctive gap:
Looking at cmbr CI config, it appears that crawler ran with 1 RPS. It's already quite a low throughput to start with, however it looks like there is a Delay configuration in LimitRule - maybe we can add 1 second delay with concurrency 1 so that we can check how it goes with 0.5 RPS?
@andrewn could you please review the above and help with these questions below:
Would it make sense to add more resouces to Staging to resolve the above issue with saturation? Is there another way to temprorarily fix it? Temporarily as hopefully in Q1 all staging traffic will be routed to Staging Canary which will have more resources (gitlab-com/gl-infra&608 (closed)).
What is the minimum RPS that will be enough to potentially trigger error rate ratio alert? Would it be helpful to try running 0.5 RPS using Delay configuration in LimitRule?
Who monitors error rate and other SLIs in Staging? Is there documentation how alerts are being triggered and how this monitoring process is set up? Tried to search through hadnbook, found Incident Management and Monitoring of GitLab.com, but don't see specific links to projects and alert configurations there.
Also encouraging to start small even with staging (gstg) and not wait for full staging-canary build out. If we are getting 500s and know that this can be fixed in the product e.g. refactor queries to be more performant or etc. We should bring up that visibility now and push for product excellence from Quality & Infra together.
@meks thanks for mentioning this! I agree that we need to push for performance improvement on queries, but unfortunately in this it's not very visible what specific 500 during cmbr run were caused by it and what were from other reasons. I think in this case we can try to start with smaller RPS and closely monitor staging for any instability and analyse 5xx error once they appear.
Related to above, it would be great to know what architecture and resources the current Staging has to better understand its limitation. If gstg is underprovisioned, it's probably more configuration issue than GitLab app problem. @andrewn could you please also share where to find this information about Staging architecture?
Based on Grafana graphs (web, api) - during cmbr run kube_container_memory, kube_pool_cpu, kube_pool_max_nodes, kube_go_memory components were saturated to 100% or stopped sharing their metrics - see image below there is a distinctive gap:
Looking at those graphs, I see the gap, which indicates a lack of data, but I don't see anything hitting 100%. The 60% saturation value for memory is relatively normal.
We should try and understand why there is a gap in the data, although theres a good chance that this is a Thanos/Prometheus problem, rather than an issue with the application. The source of those saturation metrics is Kubernetes observability data, so even if the application was misbehaving, we should still receive data (which we did not).
Would it make sense to add more resouces to Staging to resolve the above issue with saturation? Is there another way to temprorarily fix it? Temporarily as hopefully in Q1 all staging traffic will be routed to Staging Canary which will have more resources (gitlab-com/gl-infra&608 (closed)).
None of these saturation metrics appear to me to be outside their normal range. If anything, the 60% reported is quite low for memory.
What is the minimum RPS that will be enough to potentially trigger error rate ratio alert? Would it be helpful to try running 0.5 RPS using Delay configuration in LimitRule?
We could do this, but we're already at a rate well below what we need for good data. For Production, we don't alert on services running at below 1rps as the quality of the data is not good enough.
Who monitors error rate and other SLIs in Staging? Is there documentation how alerts are being triggered and how this monitoring process is set up? Tried to search through hadnbook, found Incident Management and Monitoring of GitLab.com, but don't see specific links to projects and alert configurations there.
At present, nobody does. Until we can get more traffic to staging, the quality of the signal is too low. This really gets to the crux of why I want to generate this traffic in that environment: once we improve the volume of traffic, I believe that our metrics data will improve and we'll be able to consider monitoring it.
Also encouraging to start small even with staging (gstg) and not wait for full staging-canary build out. If we are getting 500s and know that this can be fixed in the product e.g. refactor queries to be more performant or etc. We should bring up that visibility now and push for product excellence from Quality & Infra together.
From the point of view of the traffic generated by CMBR, we saw a few 500s, but these have been reported on already. The application generally remained relatively responsive and functional. The reason we turned it off was because QA jobs were failing. I think the key to understanding why the CI jobs were failing is to look into the actual failures on that side, specifically in gitlab-org/quality/testcase-sessions#28426 (closed). Was there a common pattern in these failures? If so, this might be the key to finding a quick solution to slightly improve the scalability of the Staging environment, so that we can 1) Run QA and 2) Generate enough traffic to provide a quality SLI signal.
Reading through the Slack thread, my gut feel is that this may be a Praefect or Gitaly issue. Perhaps if we could isolate the problem, we could provision one or two more Gitaly nodes in staging to resolve the problem.
Looking at the Gitaly nodes in staging, it seems like we may need to provision faster disks? The disk_sustained_write_iops saturation metric on some of the Gitaly nodes is pinned at 100%. If this is accurate, it would definitely be a problem.
What's interesting about this is that CMBR shouldn't be performing any operations on Gitaly that would cause a write on disk, so why do we see so much write activity?
What's interesting about this is that CMBR shouldn't be performing any operations on Gitaly that would cause a write on disk, so why do we see so much write activity?
Thinking about this statement a little more, there is in fact at least one operation that CMBR would invoke that would generate writes on a Gitaly node: GetArchive.
This is how many of those calls were made during the period.
@andrewn Many thanks for the additional context and analysis on this! Really appreciate the detailed information and links, it helps to learn more where to look at when analysing these types of errors.
Some major takeaways for me:
There are no SLIs on Staging currently and no one monitors it as the signal is too low We have SLIs and SLOs for staging, but we chose not to act on the signals
Gap in the data in #338978 (comment 731111519) caused by Prometheus/Thanos and not by application, however there is an issue with Gitaly disks on Staging
Rate below 1 RPS could not be very sufficient
Will list some action items based on discussion above:
2. Clarify if we need to provision faster disks as Gitaly nodes in Staging are saturated (#338978 (comment 748749746))
3. Turn off downloading archives from cmbr until Gitaly disks issue is resolved (#338978 (comment 748768951))
Regarding step 1 - I've tried to search for 500 errors when was doing initial analysis in #338978 (comment 731111519) but unfortunately couldn't fetch the required data as it was quite long ago and Kibana/Sentry search attempts weren't successful.
What do you think about enabling it once again to closely analyse any errors once they appear? I'm also a secondary on-call DRI this week for QA pipelines, so I will keep an eye on how it goes and alert our Primary DRIs when crawler is enabled. If it's ok with you, could you please clarify how did you trigger and monitored cmbr first time? I see there pipelines with 55mins duration in CI, is it how you ran it?
Also I was not able to make cmbr run against my local GitLab docker instance. I assume it can't resolve my local URL. I'm not familiar with Go language and Colly so probably I'm missing something here. Will continue to look into it and learn more.
There are no SLIs on Staging currently and no one monitors it as the signal is too low
This may seem like nitpicking, but to be clear, we have SLIs and SLOs for staging, but we chose not to act on the signals.
Will list some action items based on discussion above:
Thank you for this comprehensive list. This looks great!
For point 3, I would be happy to send an MR to do this.
Also I was not able to make cmbr run against my local GitLab docker instance. I assume it can't resolve my local URL. I'm not familiar with Go language and Colly so probably I'm missing something here. Will continue to look into it and learn more.
Update on 1. Analyse further 500 errors in QA report and see if there is some pattern action item mentioned in #338978 (comment 749637965):
After enabling cmbr yesterday and today (2021-12-02 - 2021-12-03), didn't see increased error rate or Staging QA pipeline instability. Not sure why would that change in comparison to the previous time, maybe something was improved. I'm planning to leave it running over the weekend and will analyse data further next week.
While was investigating latest 500 errors, found a real bug - #348698 (closed) cc @andrewn for good news!
The only problem for SLI - is that signal was small (see green arrow below), so wondering if we may need to add retries for 500 errors to boost the signal (as mentioned below #338978 (comment 764287105)).
@andrewn would retries for problematic page help to boost error ratio up? can you please clarify how it's being calculated or where I can find this information?
@niskhakova these are encouraging results, thank you for all the work you're doing on this.
would retries for problematic page help to boost error ratio up? can you please clarify how it's being calculated or where I can find this information?
Yes. We should definitely avoid any retries. I've checked cmbr and the Gocolly codebase and can't see any defaults that would lead to us doing retries though.
can you please clarify how it's being calculated or where I can find this information?
The web service SLI is an aggregate of all the component SLIs in the web service. These are displayed further down the dashboard.
For example, looking at this spike:
source, it will be composed of the error ratios from the four components of the web service: loadbalancer, workhorse, puma and imagescaler.
Not all these components have an error ratio, but looking at those that do, we see:
In staging imagescaler is getting almost no traffic. Since the aggregate error ratio uses a weighted average based on traffic, it's clear that the problem is in the workhorse component.
With that in mind, we can dig further, either by navigating logs or opening the collapsed row in Grafana titled 🔬 workhorse Service Level Indicator Detail(11 panels).
Generally for these kind of searches, the logs are the best place to go.
(note that since the original data I was searching on was > 7 days old and is no longer in ELK, I've selected another more recent Workhorse incident from staging, but the principal applies).
Once you've got to logs, it's possibly to do further analysis, via the Visualization or Lens modules in ELK to figure out which requests are leading to the incident.
I'd be happy to walk you through this in more detail if you like.
@andrewn many thanks for the deep dive into SLIs and how they work! I scheduled a meeting for us tomorrow to discuss this area further as it'll be probably quicker to sync on this on a call before holidays. Preparing questions in our meeting agenda for tomorrow
Sometimes it's related to the fact that crawling user is not an admin and can't navigate to some projects and data. I think it will be helpful to use an admin user to resolve this. Additionally it will help to crawl through more endpoints and potentially catch more errors. Lastly it will be beneficial to have a specific user account for this type of activity to help filter by its username. I created request access to create a new Staging admin user https://gitlab.com/gitlab-com/team-member-epics/access-requests/-/issues/12758.
404 errors in Web
Mostly related to the fact that user is not authenticated. We don't have ability to sign in via API currently, but we may explore signing in as user but it can be brittle.
Happening due to Web request limits for non-authorised user on Staging (500 requests per 60 sec). We need to authorise to get a higher limit (1000 requests).
Explore adding a retier on 5xx errors -> this should help to increase error rate if it's a stable error or confirm that it was an intermittent problem.
Investigate if we can authenticate for web requests
Explore skipping gitlab-qa-sandbox-group groups and projects as they are being removed quickly and muddy the logs with 404 errors
Update CI config - add check jobs for merge requests in cmbr
Explore supporting multiple users to bypass limit if concurrency is increased
Things to further explore about Staging and its load:
Nailia Iskhakovamarked the checklist item Analyse if cmbr caused intermittent 500 errors in GitLab QA and if so, explore if it's possible to reduce its load further until stg-cny is built. - #338978 (comment 752263239) as completed
marked the checklist item Analyse if cmbr caused intermittent 500 errors in GitLab QA and if so, explore if it's possible to reduce its load further until stg-cny is built. - #338978 (comment 752263239) as completed
cmbr caught scenario when page returned 500 error - it turned out to be a real bug in GitLab application using Error rate SLI to locate the time and Kibana logs to locate the error #338978 (comment 780654286)
Enabled cmbr pipeline and verified that GitLab QA pipelines were unaffected by load emulation - #338978 (comment 752263239)
Next we plan to continue our collaboration to further improve Staging load testing:
Quality related work &7320 (closed) to streamline load simulation efforts
Infra team related work: gitlab-com/gl-infra&668 to improve the precision of staging service-level monitoring alerts and set up incident review processes
@vincywilson would it be ok to close this issue based on the work we made?
Thankyou @niskhakova for the detailed summary of what has been achieved. I made some modifications in the description to represent the work and listed the next steps. I am comfortable with closing out this issue as 1st iteration of load testing in staging-canary.