Validate Llama 3 for /explain

This issue is to add support/prompting for Llama 3 for /explain

The following models are to be prioritized, as supported by AWS GovCloud Bedrock and customer requirements:

Licenses

Validation can be run via the Evaluation Runner. For prompt development guidance and validation, see this documentation for Chat, and this documentation for Code Generation. There are also development/validation guidelines here.

Definition of Done

  • Each model has been validated to support the feature on all supported platforms
  • Examine individual inputs and outputs that scored poorly (1-2 scores); Look for and document any patterns of either poor feature performance or poor LLM judge callibration. Iterate on the model prompt to eradicate patterns of poor performance.
  • Achieve less than 20% poor answers (scores of 1 and 2 are defined as poor answers) using each supported model for those areas in which we do have supporting validation datasets.
  • Quality results, based on LLM Judge scores 1-4 and/or cosine similarity are recorded in this issue's comments as distributions. For LLM Judges this means buckets of 1s, 2s, 3s, 4s. For Cosine similarity scores, this means buckets of similarity scores 0.9 and above, 0.8-0.89, 0.7-0.79 and so on.
  • The traffic light system for self-hosted models has been updated to include scores, and the documentation has been updated to reflect any changes
Edited by Bruno Cardoso