FY21-Q3 Infra KR: Dogfooding - Improve runbooks experience using Jupyter notebooks => 100%

DRI: Distinguished Engineer, Infrastructure

Initial Problem description

Consider if it is possible to use Jupyter notebooks as a means to improve our runbooks, and use GitLab integration.

Conclusion

Full details of the investigation are documented in Gap Analysis google doc.

After discussing the gap analysis with Product Manager responsible for Jupyter notebook integration, conclusion was that Jupyter notebook integration improvements as it currently stands is not a product priority. The feature will eventually be deprecated and replaced with a different approach.

Gap analysis opening impressions

Copying verbatim from the google doc:

Following the instructions in https://docs.gitlab.com/ee/user/project/clusters/runbooks/, the installation process feels quite rough and I realise that this is obviously quite different from what I had been expecting.

Integration: not integrated into the application, only deployed through the application. No permalink URLs, no project context within the runbooks, no CI variables exported to the jupyter kernel, et cetera.
Customization: very few customization possibilities. Only option appears to be hostname.
Boot time: notebook startup time is too slow to realistically be useful in an active incident situation.
Setup time: due to lack of customization, setup of any required pip/conda dependencies (for example prometheus client, elasticsearch client, pandas, etc) needs to be done at runtime. This is not only too slow for incidents, but adds additional risk in that our runbooks depend on pypi.org being available in order to execute runbooks.
Authentication: It’s unclear how the authentication works. This is not covered in the documentation as far as I can tell. While the notebook uses GitLab OAuth as an authentication provider, it does appear to be limited to particular groups - it seems that any GitLab.com user can gain access to the runbook and execute it.
Logs: Where are audit logs kept? Are they stored?
Lack of product transparency. As part of the install, I was asked to set up an ingress. No information is provided about what the ingress is (nginx?). While this is fine for demos, in my opinion, it’s unlikely that anyone will use such an opaque setup for any serious application. Likewise, the details of the Jupyter helm chart as it’s deployed are only available in the GitLab source code. There is no change control around what gets deployed through the GitLab Applications UI.
Lack of secrets management: it’s absolutely critical that this environment is not used as an attack vector against GitLab.com. For this reason, we need a way of ensuring that any access via this environment is protected, and any secrets used in the environment are protected. Additional security issues: self-signed certificates on the deployed site
Markdown: GitLab is driven by markdown, and markdown compliments GitLab by working very well with git source control. The integration offered does not support markdown as a first-class citizen, instead the notebooks are stored in a JSON blob, checked into git. This makes change control, conflict resolution and change annotation through git all very difficult. Scaling this up to allow changes from across the infrastructure dept, as we require for our runbooks would not work in this format.
Reliability: the kernel is frequently evicted from kubernetes while running tasks. It seems to be configured with memory settings that are too low. Since we cannot customize the configuration, this is a showstopper.
The Jupyter kernel is unaware of the GitLab project context: the integration is set up on a GitLab project, but when a kernel is launched from the project, the kernel is unaware of the project. I expected the project to be cloned into the kernel and the files in the project available. Manually cloning the project inside the kernel adds more setup complexity (and risk) before the operator can proceed.

Analysis of Runbook types (Runbooks for Mitigation vs Runbooks for Diagnosis)

GitLab.com is moving away from alerting on causal-alerts, towards symptom-based alerts. Symptom alerts focus on user-impacting symptoms over simple causes. To compare with a medical analogy, a causal alert is equivalent to a specific diagnosis - for example, “flu”. A symptom-based alert is equivalent to a medical symptom, for example, “high fever”. Causal alerts are easy to solve with “For cause X, execute Y” statements, but the number of possible causes makes it difficult to provide reliable coverage of an application, particularly when it's constantly changing.

Symptom based alerts require an extra step of diagnosis before a cause can be established, but are much more reliable to detect because they monitor user experience (requests that error and latency). Why symptom-based alerts are better than causal alerts are outside of the scope of this document (see the Monitoring Distributed Systems chapter in Google’s SRE Book for more details), but broadly they are more accurate, provide better coverage and less false alerts; at the cost of requiring diagnosis.

This is important because it adds more emphasis on the diagnosis phase of an incident, and less on the execution of work-arounds.

To paraphrase the Anna Karenina Principle: “All functioning distributed systems are alike; each malfunctioning distributed system is malfunctioning in its own way”. Having pre-canned execution scripts in our runbooks makes the assumption that we are experiencing the same causes of our incidents over and over. If this were the case, better to prevent the incident from occurring with corrective actions than having a Jupyter notebook that (once we manage to start it) allows an operator to manually apply a fix through Rubix.

Even if it were the case that pre-canned scripts were the best course of action, running them in Jupyter notebooks is far from the best place to do this. The security, logging, reliability and performance issues of this environment are far less mature than what we get from a simple SSH console host. The boring solution is in this case, much, much better.

As our symptom-based alert coverage expands, the biggest impact to our availability is the time it takes to diagnose the alert, not the time it takes to mitigate once diagnosis is complete.

The feature, as it is currently set, and based on Jupyter and Rubix focuses on reducing the mitigation time, but assumes that a small set of mitigation runbooks will provide coverage of a complex system. In my opinion, it would be better to focus, as Datadog have done in their runbooks product, on the diagnosis phase, by providing tooling to query and integrate data from observability sources such as ELK, Prometheus, BigQuery, Jaeger and other tools.

Completion Criteria

Create a set of feedback items for the dev team to assist with future enhancements.
- Gap Analysis documentation
Discuss alignment and action steps with responsible Product areas. => Recording https://youtu.be/axiL3uuNhks
Work through implementation of this functionality for a clearly scoped area. Provide additional feedback throughout the experience. gitlab-com/gl-infra/scalability#544 (closed)

Retrospection

Good

Used the opportunity to build new, greatly expanded Capacity Planning and Forecasting tool, Tamland: https://gitlab.com/gitlab-com/gl-infra/tamland.
- Sample report: https://gitlab-com.gitlab.io/gl-infra/tamland/saturation.html.
- This report is being used weekly and is being used to help with scheduling and prioritization of upcoming work.
- Associated Slack channel where reports are being discussed: https://gitlab.slack.com/archives/C01AHAD2H8W
Produced a GAP Analysis of missing features https://docs.google.com/document/d/1gZZWVMiaiRI6m-xdh_NAMxvt84BspJakXtY7CBC-Z_Q/edit
Good interaction and feedback for the Product management team, following our evaluation: https://youtu.be/axiL3uuNhks
Used the experience to provide more feedback into minimum criteria for something to be considered ready for dogfooding on GitLab.com: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/11511#note_436600393
While our first experience of using GitLab Managed Apps on GitLab.com demonstrated that this approach to managed apps is unlikely to work for GitLab.com and probably many enterprise clients, a second iteration, which shows more promise is underway: https://docs.gitlab.com/12.10/ee/user/clusters/applications.html#install-using-gitlab-cicd-alpha

Bad

This product feature was not ready for dogfooding, and this was clear within a few hours, or maybe days, of the start of our evaluation.
More time should have been spent upfront before committing to this OKR goal, in deciding whether or not this was an appropriate strategic goal for the quarter.

Try

We should ensure that the product feature meets the criteria for an MVC, before an OKR goal of using the feature on GitLab.com is set.
- From the handbook: "While an MVC may not have the robust functionality of a fully developed feature, it should still address fundamental user needs through a bug-free and highly usable experience. The minimal viable change should not be a broken feature."
- Minimum MVC criteria would include items such as those listed in https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/11511#note_436600393
As infrastructure, we should work with other stakeholders to ensure that we are included in user research for defining what the MVC for a feature looks like. As dogfood recipients, the Infrastructure team will be users, and as such should engage with the product team on the user research.

Edited Nov 05, 2020 by Andrew Newdigate