Update readiness review for beta maturity

3c9abf9d · Sylvester Chin · 2709448d · 3c9abf9d
Commit 3c9abf9d authored 10 months ago by Sylvester Chin
--- a/sidekiq/sharding.md
+++ b/sidekiq/sharding.md
@@ -5,7 +5,7 @@ provided to it, Sidekiq then spins up an appropriate rails controller to process
 the job.  Various items place work into the queue which include GitLab cron jobs
 and user behavior on GitLab. GitLab uses a single Redis Sentinel as a storage backend for sidekiq queues. Gitlab uses multiple queues
 consumed by various groups of sidekiq workers deployed on GKE as k8s deployments, with each containing worker specific configuration.
-This configuration includes which queues that set of workers will poll 
+This configuration includes which queues that set of workers will poll
 and the concurrency level for that worker.

 Our goal is to horizontally scale Sidekiq through an application-layer router which
@@ -22,48 +22,63 @@ The work can be tracked in [Scalability epic 1218](https://gitlab.com/groups/git
 The diagram above represents the target state after the workload migration for catchall to the new `redis-sidekiq-catchall-a`.
 In a sharded state, the `catchall` K8s deployment polls from a separate Redis compared to the rest of the Sidekiq K8s deployments.

-### Monitoring and Alerting
+### Service Catalog

-_The items below will be reviewed by the Scalability:Practices team._
+_The items below will be reviewed by Scalability:Practices team._

- [ ] Link to the troubleshooting runbooks.
+- [ ] Link to the [service catalog entry](https://gitlab.com/gitlab-com/runbooks/-/tree/master/services) for the service. Ensure that the following items are present in the service catalog, or listed here:
+  - Link to or provide a high-level summary of this new product feature.
+  - Link to the [Architecture Design Workflow](https://about.gitlab.com/handbook/engineering/architecture/workflow/) for this feature, if there wasn't a design completed for this feature please explain why.
+  - List the feature group that created this feature/service and who are the current Engineering Managers, Product Managers and their Directors.
+  - List individuals are the subject matter experts and know the most about this feature.
+  - List the team or set of individuals will take responsibility for the reliability of the feature once it is in production.
+  - List the member(s) of the team who built the feature will be on-call for the launch.
+  - List the external and internal dependencies to the application (ex: redis, postgres, etc) for this feature and how the service will be impacted by a failure of that dependency.

-The [Sidekiq survival guide for SREs](https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/sidekiq/sidekiq-survival-guide-for-sres.md) is the most useful troubleshooting guide for most Sidekiq issues.
+Service catalogs
+- [Sidekiq](https://gitlab.com/gitlab-com/runbooks/-/blob/master/services/service-catalog.yml#L1748)
+- [redis-sidekiq-catchall-a](https://gitlab.com/gitlab-com/runbooks/-/blob/8421a42daac4555f907d3ad2cd5995e35e9ecf6f/services/service-catalog.yml#L1505). It is a shard of `redis-sidekiq`.

-For shard-specific troubleshooting, refer to the [sharding guide](TBD)
+The summary and design can be found in the [runbook doc](https://gitlab.com/gitlab-com/runbooks/-/merge_requests/7178).

- [ ] Link to an example of an alert and a corresponding runbook.
+This feature is owned by the Scalability group(PM: Sam Wiskow `@swiskow`, EM: Kennedy Wanyangu `@kwanyangu`). The subject matter experts are: `@schin1`, `@fshabir`, both will be available during the launch.

-There are no new alerts. As the new Redis is monitored as a shard of the `redis-sidekiq` service, the related alerts will have the `shard=catchall_a` label.
+Redis is the only external dependency that Sidekiq relies on. We also depend on feature flags (`redis-cluster-feature-flags` and `patroni`) for the roll out.

- [ ] Confirm that on-call SREs have access to this service and will be on-call. If this is not the case, please add an explanation here.
-
-Yes, the new Redis and Sidekiq k8s deployments are accessible using ssh for all SREs.
+Sidekiq will fail when the backing Redis fails. Feature flag will default to the definition-file which is `false` if the underlying stores fail. However if `patroni` and `redis-cluster-feature-flags` were to fail, the impact on sharded Sidekiq will be considerably less severe compared to the overall availability of gitlab.com.

-### Operational Risk
+### Infrastructure

 _The items below will be reviewed by the Scalability:Practices team._

- [ ] Link to notes or testing results for assessing the outcome of failures of individual components.
+- [ ] Do we use IaC (e.g., Terraform) for all the infrastructure related to this feature? If not, what kind of resources are not covered?

-We had a 2 phase rollout where we migrate a [subset of workers](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/17779), rollback, and [migrated the entire Sidekiq shard](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/17841).
+The redis-sidekiq-catchall-a is provisioned using IaC. The code is housed in [config-mgmt](https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/blob/1ccbe018420248e44f1112cb0e058d1052763582/environments/gprd/main.tf#L1128).

-The metrics we tracked (detailed in the rollout links) were able to surface non-critical bugs which were resolved.
+- [ ] Is the service covered by any DDoS protection solution (GCP/AWS load-balancers or Cloudflare usually cover this)?

- [ ] What are the potential scalability or performance issues that may result with this change?
+Sidekiq has its own deduplication logic which serves as a pseudo rate-limiting at the application layer. The application is protected by Rack middlewares which enforces application rate-limits.

-This change enables horizontal scalability of Sidekiq by sharding the workload across multiple Redis Sentinels. There should be no performance issues as a result of this change on Sidekiq.
+The source of jobs for Sidekiq is from Sidekiq workers and Rails webservices which are behind GCP load balancers and Cloudflare WAF.

- [ ] What are a few operational concerns that will not be present at launch, but may be a concern later?
+- [ ] Are all cloud infrastructure resources labeled according to the [Infrastructure Labels and Tags](https://about.gitlab.com/handbook/infrastructure-standards/labels-tags/) guidelines?

-One concern could be feature flag state corruption through a bug or a mis-toggle. This could happen in the future and result in jobs being directed back to `redis-sidekiq`.
+Yes. The newly provisioned Redis references the existing labels which `redis-sidekiq` already has. No new labels or tags are added.

- [ ] Are there any single points of failure in the design? If so list them here.
+### Operational Risk
+
+_The items below will be reviewed by the Scalability:Practices team._
+
+- [ ] List the top three operational risks when this feature goes live.
+
+One concern could be feature flag state corruption through a bug or a mis-toggle. This could happen in the future and result in jobs being directed back to `redis-sidekiq`. However, we are adding an option to "pin" the migration state using an environment variable which will reduce the risk of accidental toggles or feature-flag bugs.

 The `redis-sidekiq-catchall-a` can be considered a single point of failure, if majority of the VMs were to be destroyed or down for any reason, Sidekiq jobs for `default` and `mailers` cannot
 be enqueued correctly. However, we are already working with this risk now since all Sidekiq workloads are being served from `redis-sidekiq`.

- [ ] As a thought experiment, think of worst-case failure scenarios for this product feature, how can the blast-radius of the failure be isolated?
+A risk is that there is an undiscovered application bug which affects job routing. We have performed a succcesful migration on gstg but the workload is vastly different from gprd. As a result, we are mitigating this by (1) performing the migration on a small subset of workers before performing the full shard migration; (2) performing the migration using feature flags and increasing the percent-enabled gradually over 1.5 hours.
+
+- [ ] For each component and dependency, what is the blast radius of failures? Is there anything in the feature design that will reduce this risk?

 The worst case failure scenario would be an incorrect routing logic, causing jobs to be enqueued but not picked up by any Sidekiq workers.
 The blast radius of the failure is already isolated to 2 queues out of 10 queues. However, these 2 queues account for ~50% of the load.
@@ -71,60 +86,87 @@ The blast radius of the failure is already isolated to 2 queues out of 10 queues
 This can be resolved by using a temporary deployment as outlined in the [troubleshooting guide](TBD) to process the dangling jobs or to perform a one-time job migration
 across instance if the bug has been resolved.

-### Backup, Restore, DR and Retention
+We had a 2 phase rollout where we migrate a [subset of workers](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/17779), rollback, and [migrated the entire Sidekiq shard](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/17841).
+
+The metrics we tracked (detailed in the rollout links) were able to surface non-critical bugs which were resolved.
+
+### Monitoring and Alerting

 _The items below will be reviewed by the Scalability:Practices team._

- [ ] Are there any special requirements for Disaster Recovery for both Regional and Zone failures beyond our current Disaster Recovery processes that are in place?
+- [ ] Link to the [metrics catalog](https://gitlab.com/gitlab-com/runbooks/-/tree/master/metrics-catalog/services) for the service

-No special requirements needed.
+Sidekiq [metric-catalog](https://gitlab.com/gitlab-com/runbooks/-/blob/master/metrics-catalog/services/sidekiq.jsonnet) and redis-sidekiq [metric-catalog](https://gitlab.com/gitlab-com/runbooks/-/blob/master/metrics-catalog/services/redis-sidekiq.jsonnet).

- [ ] How does data age? Can data over a certain age be deleted?
+- [ ] Link to examples of logs on https://logs.gitlab.net

-The data in `redis-sidekiq-catchall-a` is fairly transient as they are either newly enqueued jobs, scheduled jobs (which will be removed in the near future) or retry/dead jobs which can
-persist for a longer period of time. However in the context of gitlab.com, we do not act on dead jobs using the `/admin/sidekiq` page.
+https://log.gprd.gitlab.net/app/r/s/UHgNr

-The data which remains static are not critical to operations. Such data include metrics, Sidekiq process metadata and cron metadata which will be re-populated on a Sidekiq deployment restart or fresh deployment.
+- [ ] Link to the [Grafana dashboard](https://dashboards.gitlab.net) for this service.

-### Performance, Scalability and Capacity Planning
+redis-sidekiq [dashboard](https://dashboards.gitlab.net/d/redis-sidekiq-main/redis-sidekiq3a-overview?orgId=1&from=now-6h&to=now&var-PROMETHEUS_DS=PA258B30F88C30650&var-environment=gprd)
+sidekiq [dashboard](https://dashboards.gitlab.net/d/sidekiq-main/sidekiq3a-overview?orgId=1&from=now-1h&to=now&var-PROMETHEUS_DS=PA258B30F88C30650&var-environment=gprd&var-stage=main&var-shard=catchall&var-shard=urgent-other)
+- [ ] Link to the troubleshooting runbooks.
+
+The [Sidekiq survival guide for SREs](https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/sidekiq/sidekiq-survival-guide-for-sres.md) is the most useful troubleshooting guide for most Sidekiq issues.
+
+For shard-specific troubleshooting, refer to the [sharding guide](TBD)
+
+- [ ] Link to an example of an alert and a corresponding runbook.
+
+There are no new alerts. As the new Redis is monitored as a shard of the `redis-sidekiq` service, the related alerts will have the `shard=catchall_a` label.
+
+- [ ] Confirm that on-call SREs have access to this service and will be on-call. If this is not the case, please add an explanation here.
+
+Yes, the new Redis and Sidekiq k8s deployments are accessible using ssh for all SREs.
+
+### Backup, Restore, DR and Retention

 _The items below will be reviewed by the Scalability:Practices team._

- [ ] Link to any performance validation that was done according to [performance guidelines](https://docs.gitlab.com/ee/development/performance.html).
+- [ ] Are there custom backup/restore requirements?

-There were no performance tests done as there are no significant changes in architecture components. A standard Sidekiq job is performed with the same
-amount of components (Rails -> Redis -> Sidekiq). The only additional computation is an extra hash look-up and using a different Redis Client for the job push to Redis.
+No. The sharded Sidekiq architecture shares the existing Sidekiq requirements. Sharding it does not introduce any new requirements or modify existing ones.

- [ ] Link to any load testing plans and results.
+- [ ] Are backups monitored?

-There was no load testing done for this as the Sidekiq load is not expected to change. (See above).
+We can track them by tracking  `bgsave` rate on [thanos](https://thanos-query.ops.gitlab.net/graph?g0.expr=sum(rate(redis_commands_total%7Benv%3D%22gstg%22,%20type%3D'redis-sidekiq',%20cmd%3D'bgsave'%7D%5B1m%5D))%20by%20(shard)&g0.tab=0&g0.stacked=0&g0.range_input=12h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D). However, we do not actively alert on Redis backups at the moment.

- [ ] Are there any potential performance impacts on the Postgres database or Redis when this feature is enabled at GitLab.com scale?
+- [ ] Was a restore from backup tested?

-We expect the existing `redis-sidekiq` instance to experience a drop in primary CPU utilization as the load is shared with `redis-sidekiq-catchall-a`.
+No. This process of restoring from back-up was not tested as Redis is not a new component introduced as part of the change.

-We do not expect any performance impacts on Postgres database as the volume of Sidekiq jobs remain unchanged. This architectural change does not introduce extra load.
+- [ ] Link to information about growth rate of stored data.

- [ ] Explain how this feature uses our [rate limiting](https://gitlab.com/gitlab-com/runbooks/-/tree/master/docs/rate-limiting) features.
+This can be tracked by the memory component growth [over 16w](https://thanos-query.ops.gitlab.net/graph?g0.expr=gitlab_component_saturation%3Aratio%7Benv%3D%22gprd%22%2C%20type%3D%27redis-sidekiq%27%2C%20component%3D%22memory%22%2C%20shard%3D%22default%22%7D&g0.tab=0&g0.stacked=0&g0.range_input=16w&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D).

-Sidekiq already uses any existing form of rate limiting. Sharding Sidekiq does not change that behaviour.
+- [ ] Are there any special requirements for Disaster Recovery for both Regional and Zone failures beyond our current Disaster Recovery processes that are in place?

- [ ] Are there retry and back-off strategies for external dependencies?
+No special requirements needed.

-There are no changes to any existing strategies already in place.
+- [ ] How does data age? Can data over a certain age be deleted?

- [ ] Does the feature account for brief spikes in traffic, at least 2x above the expected rate?
+The data in `redis-sidekiq-catchall-a` is fairly transient as they are either newly enqueued jobs, scheduled jobs (which will be removed in the near future) or retry/dead jobs which can
+persist for a longer period of time. However in the context of gitlab.com, we do not act on dead jobs using the `/admin/sidekiq` page.

-Yes. This feature will allow Sidekiq throughput (enqueue and dequeue) to better handle brief spikes in traffic. Sharding also bulkheads the `default` and `mailers` queue.
+The data which remains static are not critical to operations. Such data include metrics, Sidekiq process metadata and cron metadata which will be re-populated on a Sidekiq deployment restart or fresh deployment.

 ### Deployment

 _The items below will be reviewed by the Delivery team._

+- [ ] How are the artifacts being built for this feature (e.g., using the [CNG](https://gitlab.com/gitlab-org/build/CNG/) or another image building pipeline).
+
+The artifact is build as part of gitlab-rails using the CNG pipeline.
+
 - [ ] Will a [change management issue](https://about.gitlab.com/handbook/engineering/infrastructure/change-management/) be used for rollout? If so, link to it here.

 The gstg change issue is at https://gitlab.com/gitlab-com/gl-infra/production/-/issues/17779. The gprd change issues is at https://gitlab.com/gitlab-com/gl-infra/production/-/issues/17868.

+- [ ] Can the new product feature be safely rolled back once it is live, can it be disabled using a feature flag?
+
+Yes. There is a feature flag to control the routing behaviour between `redis-sidekiq` and `redis-sidekiq-catchall-a`. This was tested as part of the gstg rollout for [worker rollout](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/17779) and [entire shard migration](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/17841).
+
 - [ ] Are there healthchecks or SLIs that can be relied on for deployment/rollbacks?

 As the feature release will be done through a feature flag, the deployment/rollback are not required. For change rollback (feature flag disable), we can rely on
@@ -138,3 +180,100 @@ performs as expected.

 In general, deployment depends on gitlab.com as it uses `k8s-workloads/gitlab-com`. It uses GitLab CI/CD to perform helm apply to deploy newer revisions. In terms of the migration rollout,
 it is entirely performed using feature flags which is dependent on chatops and Gitlab Rails to perform the relevant updates.
+
+### Security Considerations
+
+_The items below will be reviewed by the Infrasec team._
+
+- [ ] Link or list information for new resources of the following type:
+  - AWS Accounts/GCP Projects: N.A.
+  - New Subnets: `projects/gitlab-production/regions/us-east1/subnetworks/redis-sidekiq-catchall-a-gprd`
+  - VPC/Network Peering: N.A.
+  - DNS names: static IPs for `redis-sidekiq-catchall-a-0{1/2/3}-db-{gstg/gprd}.c.gitlab-{staging-1/production}.internal`
+  - Entry-points exposed to the internet (Public IPs, Load-Balancers, Buckets, etc...): N.A.
+  - Other (anything relevant that might be worth mention): 3x GCE VMs
+
+View terraform apply output in https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/8061#note_196640.
+
+- [ ] Were the [GitLab security development guidelines](https://docs.gitlab.com/ee/development/secure_coding_guidelines.html) followed for this feature?
+
+Yes. However, note that the application-level Sidekiq router does not deal with user information, filenames, links, authorization/authentication, credentials, etc.
+
+- [ ] Was an [Application Security Review](https://handbook.gitlab.com/handbook/security/security-engineering/application-security/appsec-reviews/) requested, if appropriate? Link it here.
+
+No new application components (new gems or dependencies) are introduced.
+
+- [ ] Do we have an automatic procedure to update the infrastructure (OS, container images, packages, etc...). For example, using unattended upgrade or [renovate bot](https://github.com/renovatebot/renovate) to keep dependencies up-to-date?
+
+No new components are created, hence we leverage all existing automatic upgrade procedures like renovate bot for gem.
+
+- [ ] For IaC (e.g., Terraform), is there any secure static code analysis tools like ([kics](https://github.com/Checkmarx/kics) or [checkov](https://github.com/bridgecrewio/checkov))? If not and new IaC is being introduced, please explain why.
+
+Yes, we leverage checkov in config-mgmt.
+
+- [ ] If we're creating new containers (e.g., a Dockerfile with an image build pipeline), are we using `kics` or `checkov` to scan Dockerfiles or [GitLab's container](https://docs.gitlab.com/ee/user/application_security/container_scanning/#configuration) scanner for vulnerabilities?
+
+N.A. We are not creating new containers.
+
+### Identity and Access Management
+
+_The items below will be reviewed by the Infrasec team._
+
+- [ ] Are we adding any new forms of Authentication (New service-accounts, users/password for storage, OIDC, etc...)?
+- [ ] Was effort put in to ensure that the new service follows the [least privilege principle](https://en.wikipedia.org/wiki/Principle_of_least_privilege), so that permissions are reduced as much as possible?
+
+We leverage ACLs in Redis 6.0 for the new Redis instance. More information can be found in https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/3182#a-redis-vault where we define various Redis users and scope the allowed operations to the workload.
+
+- [ ] Do firewalls follow the least privilege principle (w/ network policies in Kubernetes or firewalls on cloud provider)?
+
+Yes.
+
+- [ ] Is the service covered by a [WAF (Web Application Firewall)](https://cheatsheetseries.owasp.org/cheatsheets/Secure_Cloud_Architecture_Cheat_Sheet.html#web-application-firewall) in [Cloudflare](https://gitlab.com/gitlab-com/runbooks/-/tree/master/docs/cloudflare#how-we-use-page-rules-and-waf-rules-to-counter-abuse-and-attacks)?
+
+Yes.
+
+### Logging, Audit and Data Access
+
+_The items below will be reviewed by the Infrasec team._
+
+- [ ] Did we make an effort to redact customer data from logs?
+
+This change does not introduce new data. All existing logic to redact data are retained.
+
+- [ ] What kind of data is stored on each system (secrets, customer data, audit, etc...)?
+
+In `redis-sidekiq-catchall-a`, jobs metadata is stored in a Redis list for a transitory period until it is dequeued by a worker. The jobs metadata are usually ids of database records and metadata related to the job.
+
+- [ ] How is data rated according to our [data classification standard](https://about.gitlab.com/handbook/engineering/security/data-classification-standard.html) (customer data is RED)?
+
+RED. Customer-related information such as organization/project/user ids are stored in the Redis instance for short periods of time.
+
+- [ ] Do we have audit logs for when data is accessed? If you are unsure or if using the central logging and a new pubsub topic was created, create an issue in the [Security Logging Project](https://gitlab.com/gitlab-com/gl-security/engineering-and-research/security-logging/security-logging/-/issues/new?issuable_template=add-remove-change-log-source) using the `add-remove-change-log-source` template.
+
+TBD
+
+ - [ ] Ensure appropriate logs are being kept for compliance and requirements for retention are met.
+
+No new logs are introduced. All existing Sidekiq logs are found in the `pubsub-sidekiq-inf-gprd` view on Kibana.
+
+ - [ ] If the data classification = Red for the new environment, please create a [Security Compliance Intake issue](https://gitlab.com/gitlab-com/gl-security/security-assurance/security-compliance-commercial-and-dedicated/security-compliance-intake/-/issues/new?issue[title]=System%20Intake:%20%5BSystem%20Name%20FY2%23%20Q%23%5D&issuable_template=intakeform). Note this is not necessary if the service is deployed in existing Production infrastructure.
+
+ The service is deployed in existing Production Infrastructure. Sidekiq is a fairly mature component. This change adds another Redis to horizontally scale the workload and does not introduce new information to Redis or logs.
+
+### Security
+
+_The items below will be reviewed by the InfraSec team._
+
+- [ ] Put yourself in an attacker's shoes and list some examples of "What could possibly go wrong?". Are you OK going into Beta knowing that?
+
+Sidekiq is not directly accessible by attackers since GitLab Rails application is the only source of job enqueues (ignoring console access using teleport which requires infrastructure approval). There are some ways things could go around but they are not unique to a sharded Sidekiq:
+
+1. DDoS attack / excessive load. We have various layers of rate-limits in place to prevent a single attacker from overloading the `redis-sidekiq`. SRE on-call can drop jobs if required.
+2. Infrastructure failure. Since Redis is critical to Sidekiq, an attacker may target the Redis instances which Sidekiq uses. However, we have Redis ACLs in place and these secrets are stored in Vault, only accessible using SSO by authorized GitLab team member accounts.
+
+I am ok going into Beta knowing that.
+
+- [ ] Link to any outstanding security-related epics & issues for this feature. Are you OK going into Beta with those still on the TODO list?
+
+N.A. There are no security related epics or issues for this feature.
+