Implement a SSoT for Datasets Minimum Viable Product
Problem
- Developers are generally unfamiliar with BigQuery and/or don't have access.
- LangSmith and other tools have been serving needs that several teams have
- There needs to be a way for CEF and LangSmith to share data
- The current state of BigQuery is messy and some level of controls and documentation should exist to provide constraints to mitigate the unstructured situation
- There needs to be a way for teams to self serve adding datasets, updating them, and retrieving them
Solution
Migrating datasets that are inputs to CEF and those that are critical inputs for analysis in LangSmith into a GitLab repository. Since all members of GitLab have access to GitLab repos this mitigates point 1 and 5. In addition the migration, providing automated mechanisms to migrate data between the Git repo and BigQuery and LangSmith and back again. That is, where not easily possible to provide automated means then provide plentiful support documentation until moving to an automated tool can be done, probably in a future iteration.
Outstanding Work
There were several other really great ideas mentioned in this issue prior to the scoping down. Several of them were excellent initiatives but in the hopes of having an achievable deliverable we are focusing on this and moving those initiatives to future iterations. Additionally this will keep this issue from living forever.