Jupyter Notebooks Dogfooding Walkthrough

{{References:

Documentation: https://docs.gitlab.com/ee/user/project/clusters/runbooks/

My installation notes: https://docs.google.com/document/d/1qHLaDOTtZXGEyYEqy16pOIhLjMPuQY-lWDHVs4r6SO8/edit Background: Expectations

My previous experience with using runbooks in an operational environment was using Datadog’s Notebook feature.

https://docs.datadoghq.com/notebooks/

My initial expectation was something along these lines, although I did not expect it to be as polished. Initial Impressions

Following the instructions in https://docs.gitlab.com/ee/user/project/clusters/runbooks/, the installation process feels quite rough and I realise that this is obviously quite different from what I had been expecting.

Integration: not integrated into the application, only deployed through the application. No permalink URLs, no project context within the runbooks, no CI variables exported to the jupyter kernel, et cetera. Customization: very few customization possibilities. Only option appears to be hostname.
Boot time: notebook startup time is too slow to realistically be useful in an active incident situation. Setup time: due to lack of customization, setup of any required pip/conda dependencies (for example prometheus client, elasticsearch client, pandas, etc) needs to be done at runtime. This is not only too slow for incidents, but adds additional risk in that our runbooks depend on pypi.org being available in order to execute runbooks. Authentication: It’s unclear how the authentication works. This is not covered in the documentation as far as I can tell. While the notebook uses GitLab OAuth as an authentication provider, it does appear to be limited to particular groups - it seems that any GitLab.com user can gain access to the runbook and execute it. Logs: Where are audit logs kept? Are they stored? Lack of product transparency. As part of the install, I was asked to set up an ingress. No information is provided about what the ingress is (nginx?). While this is fine for demos, in my opinion, it’s unlikely that anyone will use such an opaque setup for any serious application. Likewise, the details of the Jupyter helm chart as it’s deployed are only available in the GitLab source code. There is no change control around what gets deployed through the GitLab Applications UI. Lack of secrets management: it’s absolutely critical that this environment is not used as an attack vector against GitLab.com. For this reason, we need a way of ensuring that any access via this environment is protected, and any secrets used in the environment are protected. Additional security issues: self-signed certificates on the deployed site Markdown: GitLab is driven by markdown, and markdown compliments GitLab by working very well with git source control. The integration offered does not support markdown as a first-class citizen, instead the notebooks are stored in a JSON blob, checked into git. This makes change control, conflict resolution and change annotation through git all very difficult. Scaling this up to allow changes from across the infrastructure dept, as we require for our runbooks would not work in this format. Reliability: the kernel is frequently evicted from kubernetes while running tasks. It seems to be configured with memory settings that are too low. Since we cannot customize the configuration, this is a showstopper. The Jupyter kernel is unaware of the GitLab project context: the integration is set up on a GitLab project, but when a kernel is launched from the project, the kernel is unaware of the project. I expected the project to be cloned into the kernel and the files in the project available. Manually cloning the project inside the kernel adds more setup complexity (and risk) before the operator can proceed.

Taking a step back a little, part 1 Obviously many of these problems could be fixed, with some effort, but before we discuss the priority of gaps in this product, it's worth considering this approach more abstractly. Runbooks for Mitigation vs Runbooks for Diagnosis

GitLab.com is moving away from alerting on causal-alerts, towards symptom-based alerts. Symptom alerts focus on user-impacting symptoms over simple causes. To compare with a medical analogy, a causal alert is equivalent to a specific diagnosis - for example, “flu”. A symptom-based alert is equivalent to a medical symptom, for example, “high fever”. Causal alerts are easy to solve with “For cause X, execute Y” statements, but the number of possible causes makes it difficult to provide reliable coverage of an application, particularly when it's constantly changing.

Symptom based alerts require an extra step of diagnosis before a cause can be established, but are much more reliable to detect because they monitor user experience (requests that error and latency). Why symptom-based alerts are better than causal alerts are outside of the scope of this document (see the Monitoring Distributed Systems chapter in Google’s SRE Book for more details), but broadly they are more accurate, provide better coverage and less false alerts; at the cost of requiring diagnosis.

This is important because it adds more emphasis on the diagnosis phase of an incident, and less on the execution of work-arounds.

To paraphrase the Anna Karenina Principle: “All functioning distributed systems are alike; each malfunctioning distributed system is malfunctioning in its own way”. Having pre-canned execution scripts in our runbooks makes the assumption that we are experiencing the same causes of our incidents over and over. If this were the case, better to prevent the incident from occurring with corrective actions than having a Jupyter notebook that (once we manage to start it) allows an operator to manually apply a fix through Rubix.

Even if it were the case that pre-canned scripts were the best course of action, running them in Jupyter notebooks is far from the best place to do this. The security, logging, reliability and performance issues of this environment are far less mature than what we get from a simple SSH console host. The boring solution is in this case, much, much better.

As our symptom-based alert coverage expands, the biggest impact to our availability is the time it takes to diagnose the alert, not the time it takes to mitigate once diagnosis is complete.

The feature, as it is currently set, and based on Jupyter and Rubix focuses on reducing the mitigation time, but assumes that a small set of mitigation runbooks will provide coverage of a complex system. In my opinion, it would be better to focus, as Datadog have done in their runbooks product, on the diagnosis phase, by providing tooling to query and integrate data from observability sources such as ELK, Prometheus, BigQuery, Jaeger and other tools. Taking a step back a little, part 2 - Integrated Applications

A second issue that this work has brought to light for me is around GitLab’s current approach to integrating third-party applications through vendored helm-charts, baked into the product.

Having one-click deployments for half-a-dozen applications, built directly into GitLab is good for demos, but probably not suitable for anything other than the most basic real-world production deployment.

Having one or two configurable options in the GitLab UI will never suffice as a way of configuring integrated applications. Even if we added more options to the UI, there would always be others missing.

With this in mind, I propose a change to the way we integrate third-party applications into GitLab.

Approach at present:

Browse applications tab Click install for an application (eg “Prometheus”) Integrated helm deployer deploys a “fixed” helm chart to the kubernetes cluster.

Upsides: Simplicity

Downsides: Lack of transparency into what is being deployed Lack of change control, reviews, standard devops tooling practice. Lack of configurability Lack of future extensibility Extensible proposed approach: Auto-Devops for deployments

Much like we offer Auto-Devops for applications, we extend Auto-Devops to helm charts (and possibly other IaaS tools such as terraform, in future).

Browse applications tab (same as present) Click install for an application (eg “Prometheus”) This opens a MR to the current project repository for the helm chart (in this case Prometheus) The user is presented with the change and can edit it and customise it The change is reviewed and merged Autodevops for deployment CI job is executed on master, detecting a chart change has been pushed to the project, and deploys the change to the production environment.

Upsides: Transparency into what is being deployed into the application Relies on existing change control and review processes (merge requests), standard devops practice. Ability to customise the configuration in any way necessary, either initially, or later on Ability to extend the deployment in future

Downsides: Two step process instead of 1

What would constitute an MVP?

As discussed above, I would prefer a focus on diagnosis than on mitigation. Some ideas:

Customizable runtime, container image, baked in conda/pip packages etc This would be much easier and more flexible if we adopted the autodevops for deployments/helm approach discussed above, instead of providing UI for each configurable option. Jupyter runbook starts with master branch of associated repository checked out Using Myst or Jupyter Text to make markdown a first-class data format, for better change control, merge requests, etc Built in PromQL queries in the Juypter notebooks, using Cell Magic functions, with built in visualization and data tables, and/or…. Built in ELK queries in the Jupyter notebooks, using Cell Magic functions, with built in visualization and data tables. Example runbook: “Find the top 5 requests by IP over the last 10 minutes using this query”... executing the cell returns a table to the user. Secret management (CI variables?) Authentication and authorization policy (who can log in, execute cells, access secrets, etc) Logging (who ran what, when) Not required for MVP, but possibly worth thinking about:

GitLab-Pages-like automatic domain registration. Instead of using the current hack of 196.7.0.138.nip.io hostnames and self-signed certificates for https, use automatic DNS. Eg: gitlab-com.gitlab-runbooks.io/runbooks with automated Let’s Encrypt/ACME TLS certificate registration. Read-only access to anonymous users Read+execute access, but no modify of the runbook Walkthrough of Proof of Concept: Tamland

After considering the constraints of the current Jupyter notebook integration, I realised that using this feature for critical active incident situations would be out of the question, but producing a working prototype would still help my understanding of the problem domain, within the following constraints:

No trusted access or secrets at runtime No time-critical rendering Favour markdown over JSON Leverage python for its analytical and visualization features (ie, use runbooks for what they’re best suited for!)

Currently, we use Prometheus queries for capacity planning. This information is presented on the capacity planning dashboard in Grafana https://dashboards.gitlab.net/d/general-capacity-planning/. Unfortunately, Prometheus (by design) lacks the analytical capabilities to perform the powerful predictive requirements needed to perform capacity planning. Python, with Jupyter notebooks is a perfect environment for performing this type of analysis.

Using the historical saturation and utilization data that we collect in Prometheus (see https://www.youtube.com/watch?v=swnj6KTRg08 for talk on how we do this), we perform powerful regression analysis in Python using Facebook’s Prophet Library, which uses Stan, a platform for statistical modeling and high-performance statistical computation.

The proof of concept is called Tamland, named after the weather forecaster in a popular comedy movie. Tamland uses up to 365 days worth of data, which it loads from Thanos, to generate a forecast for RPS (request-per-second) and utilisation trends for each component of the GitLab architecture.

Example screenshots:

An RPS forecast for the workhorse component of the git service, up to 90 days into the future:

A saturation forecast for the Primary CPU component of the Redis Sidekiq host.

Implementation notes: The project can be found at https://gitlab.com/gitlab-com/gl-infra/tamland The GitLab Pages static site is at https://gitlab-com.gitlab.io/gl-infra/tamland/intro.html Jupyter-book is used for generating the notebooks Notebooks are stored in markdown format, not as JSON blobs The runbooks are converted into code in GitLab CI GitLab CI can be used for any secrets management Forecasts are executed using http://facebook.github.io/prophet/ Analysis can take up to 40 minutes to run, so it makes more sense for this to be scheduled rather than run real-time. The report is re-generated on a weekly basis on a Thursday morning, and a reminder is sent to #infra_capacity-planning a few hours before the weekly Infrastructure call, giving enough time for any concerns to be raised in that forum in the Capacity Planning segment of the call.
}}