Skip to content
GitLab
Next
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
  • P production
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
    • Locked Files
  • Issues 110
    • Issues 110
    • List
    • Boards
    • Service Desk
    • Milestones
    • Iterations
    • Requirements
  • Merge requests 6
    • Merge requests 6
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Container Registry
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • Code review
    • Insights
    • Issue
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Commits
  • Issue Boards
Collapse sidebar
  • GitLab.comGitLab.com
  • GitLab Infrastructure TeamGitLab Infrastructure Team
  • production
  • Issues
  • #3413
Closed
Open
Issue created Jan 25, 2021 by John Jarvis@jarvOwner14 of 18 checklist items completed14/18 checklist items

Enable the ActionCable feature in production

Production Change

Change Summary

This enables the first real-time feature for everyone on GitLab.com. This has already been enabled for the gitlab-com/www-gitlab-com and gitlab-org/gitlab projects.

There are two feature flags needed to enable this feature:

  • real_time_issue_sidebar: This change makes users viewing an issue establish a websocket connection. These are currently handled by our websocket k8s pods. There will be no traffic going through the websocket connections yet.
  • broadcast_issue_updates: This causes issue updates to broadcast a message to the websocket connections. When a client receives this message, they also make an API call so we might also want to monitor the API nodes for any large increase in traffic.

The two feature flags will be turned on gradually using percentages, 25%, 50%, 75% and 100%. Traffic will be routed to a new dedicated service for websockets with its own workhorse/puma pods and node pool.

References:

  • Previous issue to turn on ActionCable delivery#1210 (closed)
  • Incident from when we enabled it previously where we saw memory issues on Workhorse #2940 (closed)

Change Details

  1. Services Impacted - Websockets / ActionCable
  2. Change Technician - @jarv
  3. Change Criticality - C3
  4. Change Type - changeunscheduled
  5. Change Reviewer - @skarbek @bjk-gitlab
  6. Due Date - 2021-01-26
  7. Time tracking - 2h
  8. Downtime Component - none

Detailed steps for the change

  • Set ACTION_CABLE_IN_APP=true in production: gitlab-com/gl-infra/k8s-workloads/gitlab-com!650 (merged) note: this doesn't enable the feature, until we turn on the feature flag.

  • Enable the feature flag for 5% of requests

    /chatops run feature set real_time_issue_sidebar 5
    /chatops run feature set broadcast_issue_updates 5
  • Enable the feature flag for 25% of requests

    /chatops run feature set real_time_issue_sidebar 25
    /chatops run feature set broadcast_issue_updates 25
  • Enable the feature flag for 50% of requests

    /chatops run feature set real_time_issue_sidebar 50
    /chatops run feature set broadcast_issue_updates 50
  • Enable the feature flag for 75% of requests

    /chatops run feature set real_time_issue_sidebar 75
    /chatops run feature set broadcast_issue_updates 75
  • Enable the feature flag for 100% of requests

    /chatops run feature set real_time_issue_sidebar true
    /chatops run feature set broadcast_issue_updates true

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

  • Disable the feature flags
    /chatops run feature set real_time_issue_sidebar false
    /chatops run feature set broadcast_issue_updates false

Monitoring

Key metrics to observe

  • ActionCable Connections & Memory use
  • API Dashboard https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&from=now-1h&to=now&var-PROMETHEUS_DS=Global&var-environment=gprd
  • Websockets logs
    • Workhorse https://log.gprd.gitlab.net/goto/aa18e25f55328fd54da2766f5b0d2ed3
    • Rails https://log.gprd.gitlab.net/goto/00d252cb1d35967b96efd72aa628c894
  • Workhorse profiling https://console.cloud.google.com/profiler/workhorse-websockets;type=HEAP_ALLOC/alloc_space?project=gitlab-production&authuser=1
  • Sentry for websockets https://sentry.gitlab.net/gitlab/gitlabcom/?query=is%3Aunresolved+type%3Awebsockets&statsPeriod=24h
  • Google Console for websockets workload https://console.cloud.google.com/kubernetes/deployment/us-east1-b/gprd-us-east1-b/gitlab/gitlab-webservice-websockets/overview?authuser=1&project=gitlab-production&pageState=(%22savedViews%22:(%22i%22:%22253d89ca854e49578861a965702e1761%22,%22c%22:%5B%5D,%22n%22:%5B%5D),%22deployment_overview_active_revisions_table%22:(%22r%22:50))

Summary of infrastructure changes

  • Does this change introduce new compute instances?
  • Does this change re-size any existing compute instances?
  • Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?

Summary of the above

Changes checklist

  • This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities.
  • This issue has the change technician as the assignee.
  • Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
  • Necessary approvals have been completed based on the Change Management Workflow.
  • Change has been tested in staging and results noted in a comment on this issue.
  • A dry-run has been conducted and results noted in a comment on this issue.
  • SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
  • There are currently no active incidents.
Edited Jan 26, 2021 by John Jarvis
Assignee
Assign to
Time tracking