Implement LangSmith to Repository Sync for Datasets
Description
Currently, we need to implement a system to sync datasets from LangSmith to our main repository. This is necessary to ensure that changes made in LangSmith, where people are actively working, are reflected in our main data storage.
Objectives
- Develop a mechanism to detect changes in LangSmith datasets.
- Create a process to sync these changes to the main repository.
- Ensure the sync process can handle both new datasets and modifications to existing ones.
- Consider versioning for datasets, possibly using tagged commits in the repository.
Considerations
- LangSmith doesn't currently have webhooks for dataset changes, so we may need to implement a polling system.
- We need to handle both CSV and JSONL formats, as LangSmith UI accepts CSV while our upload script uses JSONL.
Questions to address
- How frequently should we check for updates in LangSmith?
- How do we handle conflicts if changes are made in both LangSmith and the repository?
- Do we need to implement any approval process before syncing changes to the main repository?
Next steps
- Investigate LangSmith API capabilities for detecting dataset changes.
- Schedule a sync meeting with relevant team members to discuss the best approach.
Related issues
Edited by David O'Regan