Evaluation of Claude 3.7
Execution Plan
-
Run CEF with Claude 3.7 -
Check that the logs show the correct prompt_version
and parameters -
Manual review on a small random sample
Resources
- The dataset used for the manual review is in this spreadsheet.
- Link to the relevant logs
- Prompt used for Claude 3.7 is
0.0.1-dev
Conclusion
- The manual review on a sample subset has shown a good correlation between the LLM Judge and the human expert
- The LLM Judge shows similar accuracy between Claude 3.5 and Claude 3.7
- This review has been an opportunity to uncover a few pre-existing bugs that have been reported in Vulnerability Resolution - MR diff patch genera... (&17227)
We are ready to switch to Claude 3.7
Edited by Meir Benayoun