Global Search Stability, Performance, and Scalability.
What are you trying to do? Articulate your objectives using absolutely no jargon.
Enable Gitlab Operations and infrastructure team to manage Global Search with less than one headcount needed.
How is it done today, and what are the limits of current practice?
Today the management of the Infrastructure is shared between operations and Global Search Group, while the Product is maturing to a state that requires less management.
What's new in your approach and why do you think it will be successful?
- The Global Search Elastic based for core/free/
- Need a disaster recovery plan
- Stability to not require indexing to be manually paused if the index stops for any reason
- Auto-scaling to accommodate additional usage peaks or additional Storage needs
- More efficiency in Storage to Usage ratio. Storage is charged in ES cloud
- Gitlab Admin console should have more flexibility and capability to remotely modify ES cluster when needed
- Performance and growth should be planned
- Search Abuse prevention should require less manual intervention
- Dashboard improvements for monitoring.
- Performance Testing framework
Who cares? If you're successful, what difference will it make?
- Reduce the overall cost to manage Search in SaaS
- Allow better efficiency in Adding features to Global Search
- More cost-effectively serve information currently using PG
- Reduce Global search engineering need to manage the Infrastructure (Shift left)
- Will improve Stability and performance for larger self-managed customers
What are the risks and the payoffs?
High-value, High- Availability, Higher quality iterations as the performance concerns decrease with this automation.
How much will it cost?
Infrastructure, and Headcount time. (This is included in the current priority and no additional funding is being requested. )
How long will it take?
-
1 quarter as the top priority. -
3 Quarters ongoing work.
What are the midterm and final "exams" to check for success?
-
Error Budget is consistently green based on manual changes and managing abuse - this was the case up till the changes to the Error budget reporting.
-
Updated the data layer in ES to be efficient as usage and content change and increase. -
Sharding changes to Notes index. -
Making each content type its own index. -
More efficient sharding strategy.
-
-
DR- Runbook in the event of a failure. -
DR- Restore from the confirmed good snapshot. -
Robust indexing pipeline- (queue-based, with WAL logs) -
High-Availability Multiple zone and regions for fallback and duplication -
Traning and documentation for the Operations team. -
Global Search Team spends less than 10% of the time in the milestone working on Performance and infrastructure needs and changes as the Operations team is properly enabled to take ownership.
How is would this fit into the GitLab DR plan?
Steps to Transition back to Ops as a primary.
Edited by Changzheng Liu