Add data seeding for VR
Problem to solve
In order to evaluate VR quality before deployment, we need to be able to run VR eval in a MR or against a local branch. The Evaluation Runner provides functionality to do that but we need to seed the GDK with the required vulnerability data.
However the VR data is quite large and the seeding process may take a few hours. We cannot afford to spin up a fresh GDK and seed data for every evaluation pipeline invocation.
Proposal
Approach 1
https://gitlab.com/gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/evaluation-runner/-/merge_requests/138 and https://gitlab.com/gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/evaluation-runner/-/merge_requests/140
We need to build a pre-seeded remote GDK for provide a prebuilt docker with GDK seeded. Then for every commit we want to evaluate, just update the GDK without touching the seeded data.
- Create a persistent volume in GCE, and attach it to the VM instance running GDK.
- Use Direct Transfer API for group migration. The data will be persisted on the persistent volume we attached
- Termination of the GDK VM instance will not touch the volume
Approach 2
@tle_gitlab suggested to use DB snapshot. The key advantage of using a DB snapshot over persistent volume is that it won't accumulate states.
- Create a seeded DB snapshots (using the schema version name).
- Create a new job which can be run periodically.
- Seed the VR data using Bulk Imports API (GL prod -> Test GDK).
- Create a new DB snapshot.
- The eval job can build a new Docker container with the DB snapshots (need to verify how long the restore take).
Approach 3
After some testing, the postgres DB dump won't work and we need a full gitlab backup dump.