Discuss: Streamlining processes and knowledge transfer in Self-Managed Excellence projects
Problem
Workload for Self-Managed Excellence projects - GET, GPT, RA - has grown with more usage of SME internally and across customers. While work on self-service in RA was greatly beneficial to organize the process, same principles can be explored for the rest of SME projects as well as steps added to spread the issues and requests across QE team members.
Challenges:
- New feature requests from internal or external customers or review requests for RA or GPT results
- Unexpected issues raised by customers that require debugging - it’s difficult to plan work because of unknown issues that internal or external customers create
- No dedicated team for SME projects, SETs have counterpart group responsibilities on top of SME projects
- Not all breaking configuration changes are known ahead of time, dedicated process is needed to identify expectations for patch releases
- Limited capacity for working major improvements on SME projects (example - Pullumi work in GET or GPT v3 release)
- Limited knowledge sharing and no identified path to get familiar with SME projects
- Participation in working groups related to GitLab component changes (examples - Decomposed, ClickHouse) require knowledge in RA to identify problematic areas
- More knowledge about GitLab components and maintenance will be beneficial across QE department to catch maintenance issues early in the chain during Quad planning (example - increase minimum DB size for Security Scanning feature was learnt about late in the chain)
Metrics
https://app.periscopedata.com/app/gitlab/1101145/Self-Managed-Excellence-metrics
Issues created:
- GET: 12-20 issues per month
- RA: 6-12 issues per month
- GPT: 4-8 issues per month
Total: 22-40 issues per month across SME projects for team members to triage/investigate/review/fix. This data doesn’t represent all internal communication work (Slack, customer issues, support debugging).
Scope of work
The below is not exhaustive list of responsibilities in SME projects
Click to expand
GET:
- Feature work - general issues, internal customer issues, external customer issues
- Triage/investigate customer issues
- Review upcoming breaking changes in Omnibus and Charts
- Review merge requests - internal and external contributions
- Communication - questions and customer requests both internally and externally
- Release and roadmap planning
RA:
- RA review requests from customers
- RA questions
- RA maintenance work with regards to documentation
- RA maintenance work with regards to testing new components or changes to existing components
- Customer escalations / calls
GPT:
- Daily review performance pipelines - both GPT and GBPT
- Maintenance of performance environments
- Investigate performance degradation
- Customer requests to review GPT results
- Customer issues with test data or GPT setup
- Performance testing process initiative discussions (can GPT run against production, adding new endpoints) and so on
- Release and roadmap planning
- Validating performance improvements done by Dev teams
- GitLab version performance testing pipeline and test images maintenance
- Test data maintenance
Goal
Establish processes to expand SME group maintainers count as well as share knowledge across the team until there is a dedicated SME product group.
Proposals
- Introduce trainee program in QE Enablement and in future expand it to Quality if possible
- Reasoning: Increase knowledge sharing across team
- Process:
- As part of the onboarding process, SET creates maintainer trainee issue for GET, RA, GPT depending on team member preference.
- Create trainee issues for all team members in QE Enablement
- Identify expected time frames for customer issue triage Discuss options for SLAs for GET/GPT (gitlab-org/quality/quality-engineering/team-tasks#1803 - closed)
- Reasoning: To support maintainers and optimize their workflow, implementing a defined SRE process will provide better structure for handling incoming questions/issues. This ensures a more balanced allocation of resources and allows maintainers to focus on their ongoing tasks without frequent interruptions.
- Process:
- Decide on SLO target
- Explore adding automated comment for externally raised issues that issue will be triaged within X days. If it’s urgent ping maintainers.
- Document response targets in the project and investigating process - what labels should be used and guidelines for response
- Create labels to track customer support requests - Discussion: Decide on labels to organize issues... (gitlab-org/quality/quality-engineering/team-tasks#1795 - closed)
- Reasoning: Track support and question requests across all SME projects
- Process:
- Decide on labels
- Document process
- Update SME dashboard to pull the data from labels
- Introduce Help Desk to GPT, RA and GET channels https://gitlab.com/groups/gitlab-com/it/end-user-services/-/wikis/IT-Help-Slack-Issue-Creator/How-To-Use
- Reasoning: High volume of questions in SME projects and no way to track this work from metrics perspective.
- Process:
- Introduce Help Desk
- Apply SME labels
- Add help desk metrics to SME dashboard to track all work metrics accurately
- Explore on-call/DRI rotation for SME projects - to answer questions in SME channels and triage/investigate customer raised issues
- Reasoning: In order to minimize disruption and context switching for team members. Similar to weekly Distribution DRI or Database on-call
- Triage process might include:
-
qa-performance
test results review/degradation investigation, development team requests to test improvements or FF affecting performance - Triage customer raised issues in GET and GPT
- Triage requests raised in RA
- Answer or redirect questions in SME Slack channels
-
- Explore GET release planning rotation across maintainers and leading patch releases to spread the knowledge
- Reasoning: Increase team knowledge and spread the load
- Process:
- DRI Rotation for Releases across maintainers
Action items
- Please review the above proposals and add your thoughts for each of areas using structure - single proposal/single thread
- Please feel free to add more proposals
- Create an epic to track issues that were decided on
Sources:
gitlab-org/quality/quality-engineering/team-tasks#1613 (closed) gitlab-org/quality/quality-engineering/team-tasks#1565 (closed) https://app.periscopedata.com/app/gitlab/1101145/Self-Managed-Excellence-metrics &40 (closed)
https://about.gitlab.com/handbook/engineering/development/enablement/systems/distribution/workflow.html#distribution-dri https://about.gitlab.com/handbook/engineering/development/enablement/data_stores/database/#triage-rotation