Add LLM summaries of flows
What does this MR do and why?
When Flows fail outside the scope of the Duo Agent Platform (e.g. if there is a network issue, or an invalid docker image), we dont have top-level visibility into what happened to the flow. In !232583 (merged) we added an error message to the summary` field of workflows with some details on the error.
In this MR, we create a service that calls an LLM and instructs it to examine the job logs from the failing session and create a short summary of why the failure occured. This uses the prompt added to the AI Gateway in gitlab-org/modelops/applied-ml/code-suggestions/ai-assist!5370 (merged)
This approach is similar to !230577 (merged)
References
How to set up and validate locally
- Make sure you have the latest code in the AI Gateway repo
- Run
Feature.enable :ai_summarize_workflow_sessions - Set up Duo Agent Platform
- Set up a failure for your runner. An easy way is to specify an invalid image in
start_workflow_service.rb, or stop your docker daemon, e.g. if you use colima:colima stop - Run a Remote Flow (e.g. issue to MR)
- Confirm the job log has an error and the flow didnt run
- After failure, check the workflow summary field
pry(main)> Ai::DuoWorkflows::Workflow.find(4178).summary
...
=> "The session failed because the GitLab Runner could not connect to the Docker daemon at `unix:///Users/reisner/.colima/default/docker.sock`, indicating that Colima (the local Docker runtime) was not running on the host machine. After three retries, the executor preparation step failed with a system failure, preventing the job from starting."Before this MR, the summary would contain: "Error during Session: runner_system_failure"
After this MR: Some LLM summary of the error that is more informative than the above.
MR acceptance checklist
Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.