SEG MLOps Update Jan 5th 2021 - Looking Ahead

All Weekly Demos: #16

Recording

Vision

Make GitLab a tool Data Scientists and Machine Learning Engineers love to use.

Mission

Explore and Collaborate with different teams to deliver features that improve the user experience for Data Scientists and Machine Learning Engineers, while increasing awareness within the company to this user groups

Looking Ahead

Time for New Year's Resolution! Today we have a different update, instead of focusing on what we have done, I am talking about what we want to do in the next few months. We want to tackle a few different things:

User Personas

Issue: #40

Data Scientists and Machine Learning Engineers were not considered part of the target audience of GitLab, and that is evidenced by the lack of User Personas and User Research on these groups. Creating these is essencial to create awareness within the company that these users do indeed exist.

Rendered Jupyter Diffs

Epic: gitlab-org/gitlab#343024 (closed)

On 14.5 we released Cleaner Jupyter Diffs, which was a great step forward on user experience for Data Scientists, and was really well received by them. But that was just the first step. While easier, it is still difficult to review notebooks given that images are shown as base64, hiding what they were meant to display, in addition to taking a lot of space from the diff, making it harder to find what matters.

What if we could review the rendered version of the notebook? What if we could comment the images, read the markdown, and the equations? These is what we are calling Rendered Jupyter Diffs, and it has been where I am spending most of my time now.

What we want to deliver is a diff experience that renders blocks of content, and each block maps to a point in the source code, where comments are assigned to. This includes:

Ability to toggle between raw and rendered diff (per user feedback)
Comments shared between the raw and the rendered display
Rendering the images
Rendering equations
Rendering Markdown
Rendering Code

Glyter: Jupyter + GitLab

Current Epic: &6

Data Scientists use pipelines in a different way. Given how much resource and time it usually takes to train a model, the pipelines are already import at the CREATE stage in MLOps, before any commit is made. Requiring a commit for every new version of the code considerably increases development time, and using yaml takes them away of their daily tools.

During prototyping, Jupyter is the tool of choice, and with glyter we aim to enable running jupyter notebooks on a pre configured repository, without needing to commit:

glyter run --repo=my_repo my_notebook.ipynb

I believe this is possible to be done without any changes on the current GitLab codebase, relying on the GitLab API + Dynamic Parent-Child pipelines. It's not going to be pretty, but it will work.

Analytics Repository

Epic: &7

When companies grow, so do their Data Science team. And with this growth, knowledge of past research done within the organization gets lost. This is further exacerbated considering that most of this knowledge resides within Jupyter Notebooks, and forgotten within some lost git repository. The impact is that the same work is repeated every few years.

A solution which I deployed with success in the past was an Analytics Repository, a wiki for Data Science. It takes the notebooks (markdown or R) and creates a feed for them, adding search and discussion on top of it, connecting the code to the results. The worked I did previously expanded on https://github.com/airbnb/knowledge-repo

This is the most exploratory item of the 4 shared here. We want to create an MVP using Gitlab pages that takes the notebooks within the repository, transform them and publish them. Perhaps even as part of Glyter.

Feedback?

So, what do you think? Which of these items are you excited about? Anything we should be exploring instead?

Edited Jan 05, 2022 by Eduardo Bonet