Netflix Ops for Everyone Else: NOFEE
Problem to solve
Operations is hard. Making world-class infrastructure that is able to withstand whole-region outages is incredibly hard, yet everyone deserves to have this kind of reliability. Let's make it easy to get Netflix-level of operations by just using some default best practices built into GitLab.
Target audience
Operations manager, CTO
Further details
Riffing on https://github.com/GIFEE/GIFEE.
It isn't a hard research problem but a large integration problem that takes a large investment that most companies aren't prepared for.
One primary goal would be to survive region outages. This implies things like database replication. We might have to be prescriptive about what kinds of services you can rely on. e.g. Postgres and Redis only.
GitLab already has great CI and CD, which is a foundation for this, but there's more to it than that. For example, Netflix is known for having a robust microservices architecture where the failure of an individual service degrades gracefully. e.g. if the recommendation engine fails, the UI just silently drops that component.
Make a list of features, also look at Spinnaker and link to their white papers:
- Canary
- Chaos money
- Environment
- Incremental deployments
- Feature flags