Prompt Engineering Notes

In order to optimize performance for Duo Features on a smaller OS model, we will need to investigate prompting for our curated OS models.

Current base prompting simple = {question}

Current base Claude 2 format

Provider	Model	Description	Link
Mistral	Mistral 7B	system prompts to add guardrails to Mistral	https://docs.mistral.ai/platform/guardrailing/
Mistral		getting started with prompting	https://docs.mistral.ai/guides/prompting-capabilities/
Mistral		RAG	https://docs.mistral.ai/guides/basic-RAG/

This document is intended to capture combined notes on prompt engineering and how to best implement prompt engineering for model customization.

Immediate Possible Approaches

enable few-shot prompting
1. provide input areas for a user to describe their examples
2. provide input areas for the examples
3. auto-format the examples and insert into prompt using <example></example> to wrap them
4. suggest 3-5 examples for each type
enable other XML tags
enable function calling
1. allow Claude to interact with external client-side tools and functions
2. process
  1. user defined a function by providing Claude with a description of the function wrapped in XML tags.
    
    The description should include:
    - The function name
    - A plaintext explanation of what the function does
    - The expected parameters, their types, and descriptions
    - The return values and types
    - Any exceptions that can be raised
  2. The function definitions and user question are both passed to Claude in a single prompt Claude not only needs the tools & their descriptions in order to successfully decide whether to use the tools, but likely also accompanying examples of situations in which such tools ought to be used, depending on the complexity of the use case and tools.
  3. Claude assesses the user's question and decides which function(s) to call and with what arguments
  4. Claude constructs a properly formatted function call
  5. The function call is intercepted via client code with a clear stop_sequence, and the actual function is executed on the client side
  6. The function result is passed back to Claude
  7. Claude uses the function result to formulate its final response to the user
3. Anthropic has a early alpha tool use SDK that abstracts away the XML and lets you define and call functions using Python syntax
enable chain of though (CoT) using templates

Getting Started with Prompt Engineering

How LLMs are Trained, and How This Effects Prompt Engineering

Prompt Engineering Factors

context window = amount of text a language model can look back on and reference when generating new text

latency = factors that can affect latency include model size, hardware capabilities, network conditions, and the complexity of the prompt and the generated response.

temperature = a parameter that controls the randomness of a model's predictions during text generation. Higher temperatures lead to more creative and diverse outputs, allowing for multiple variations in phrasing and, in the case of fiction, variation in answers as well. Lower temperatures result in more conservative and deterministic outputs that stick to the most probable phrasing and answers. Adjusting the temperature enables users to encourage a language model to explore rare, uncommon, or surprising word choices and sequences, rather than only selecting the most likely predictions. Using a non-zero temperature when generating responses allows for some variation in answers while maintaining coherence and relevance.

tokens = the smallest individual units of a language model, and can correspond to words, subwords, characters, or even bytes (in the case of Unicode). For Claude, a token approximately represents 3.5 English characters, though the exact number can vary depending on the language used. Tokens are typically hidden when interacting with language models at the "text" level but become relevant when examining the exact inputs and outputs of a language model. When an LLM is provided with text to evaluate, the text (consisting of a series of characters) is encoded into a series of tokens for the model to process. Larger tokens enable data efficiency during inference and pretraining (and are utilized when possible), while smaller tokens allow a model to handle uncommon or never-before-seen words. The choice of tokenization method can impact the model's performance, vocabulary size, and ability to handle out-of-vocabulary words.

Key Steps for Any Model

Before Hand

define tasks and success criteria
well-defined test cases
consistent grading rubric
establish a performance ceiling

Techniques

Provide clear instructions and context:

Just as when you instruct a human for the first time on a task, the more you explain exactly what you want in a straightforward manner, the better and more accurate the LLM's response will be.
use numbered lists or bullet points > this formatting makes it easier for LLM to follow instructions
be specific about your desired output
golden rule = could a human follow your instructions to provide your desired output?

Use examples in your prompts to illustrate the desired output format or style. Provide examples that are:

relevant (closely resemble the input/output you desire)
diverse (cover different scenarios, edge cases, and potential challenges_
clear
- to help with clarity use formatting tags like <example> to structure the examples and distinguish them from the rest of the prompt
- provide the LLM with context on what kind of examples you are providing.

role prompting -- prime the LLM to inhabit a specific role (like that of an expert) in order to increase performance for your use case

be specific, provide clear and detailed context about the role you want the LLM to play

XML tags to structure prompts and responses for greater clarity

By wrapping key parts of your prompt (such as instructions, examples, or input data) in XML tags (angle-bracket tags like <tag></tag>), you can help the LLM better understand the context and generate more accurate outputs. This technique is especially useful when working with complex prompts or variable inputs.
The tag name can be anything you like, as long as it's wrapped in angle brackets, although we recommend naming your tags something contextually relevant to the content it's wrapped around.
XML tags should always be referred to in pairs and never as just as the first half of a set
There is no canonical best set of XML tag names that LLMs performs particularly well with. For example, <doc> works just as well as <document>. The only time you need very specific XML tag names is in the case of function calling.
You can and should nest XML tags, although more than five layers of nesting may decrease performance depending on the complexity of the use case.

Chain Prompts: Divide complex tasks into smaller, manageable steps for better results

The more tasks you have an LLM handle in a single prompt, the more liable it is to drop something or perform any single task less well. Thus, for complex tasks that require multiple steps or subtasks, break those tasks down into subtasks and chaining prompts to ensure highest quality performance at every step. Prompt chaining involves using the output from one prompt as the input for another prompt. By chaining prompts together, you can guide Claude through a series of smaller, more manageable tasks to ultimately achieve a complex goal.
Use cases include: multi-step tasks, complex instructions, verifying outputs, and parallel processing

Thinking: Encourage step-by-step thinking to improve the quality of the LLM's output, also called chain-of-thought (CoT)

By explicitly instructing the LLM to think step-by-step, you encourage a more methodical and thorough approach to problem-solving. It's important to note that thinking cannot happen without output! Claude must output its thinking in order to actually "think."
The simplest way to encourage thinking step-by-step is to include the phrase "Think step by step" in your prompt. For more complex queries, you can guide the LLM's thinking by specifying the steps it should take.
Separate the thought process from the final response with XML tags. Instruct the LLM to put thought process inside <thinking> tags and the ultimate answer within <answer> tags
Prompting for step-by-step reasoning will increase the length of Claude's outputs, which can impact latency. Consider this tradeoff when deciding whether to use this technique.

Prefill response: Start LLM's response with a few words to guide its output in the desired direction

use Assistant message when making an API call (for Claude) to prefill message (ie {

control output format: Specify the desired output format to ensure consistency and readability

Prompt Elements

task definition, characteristics of a good response, any necessary context, examples of canonical input/output, restraints

Response Elements

preamble

Things to Consider

optimizations like shorter prompts or smaller models to reduce latency and costs as needed

CLAUDE

Anthropic Models

What is prompt engineering?

Prompt engineering is an empirical science that involves iterating and testing prompts to optimize performance. Most of the effort spent in the prompt engineering cycle is not actually in writing prompts. Rather, the majority of prompt engineering time is spent developing a strong set of evaluations, followed by testing and iterating against those evals.

The prompt development lifecycle

We recommend a principled, test-driven-development approach to ensure optimal prompt performance. Let's walk through the key high level process we use when developing prompts for a task, as illustrated in the accompanying diagram.

Define the task and success criteria: The first and most crucial step is to clearly define the specific task you want Claude to perform. This could be anything from entity extraction, question answering, or text summarization to more complex tasks like code generation or creative writing. Once you have a well-defined task, establish the success criteria that will guide your evaluation and optimization process.

Key success criteria to consider include:
- Performance and accuracy: How well does the model need to perform on the task?
- Latency: What is the acceptable response time for the model? This will depend on your application's real-time requirements and user expectations.
- Price: What is your budget for running the model? Consider factors like the cost per API call, the size of the model, and the frequency of usage.
Having clear, measurable success criteria from the outset will help you make informed decisions throughout the adoption process and ensure that you're optimizing for the right goals.
Develop test cases: With your task and success criteria defined, the next step is to create a diverse set of test cases that cover the intended use cases for your application. These should include both typical examples and edge cases to ensure your prompts are robust. Having well-defined test cases upfront will enable you to objectively measure the performance of your prompts against your success criteria.
Engineer the preliminary prompt: Next, craft an initial prompt that outlines the task definition, characteristics of a good response, and any necessary context for Claude. Ideally you should add some examples of canonical inputs and outputs for Claude to follow. This preliminary prompt will serve as the starting point for refinement.
Test prompt against test cases: Feed your test cases into Claude using the preliminary prompt. Carefully evaluate the model's responses against your expected outputs and success criteria. Use a consistent grading rubric, whether it's human evaluation, comparison to an answer key, or even another instance of Claude’s judgement based on a rubric. The key is to have a systematic way to assess performance.
Refine prompt: Based on the results from step 4, iteratively refine your prompt to improve performance on the test cases and better meet your success criteria. This may involve adding clarifications, examples, or constraints to guide Claude's behavior. Be cautious not to overly optimize for a narrow set of inputs, as this can lead to overfitting and poor generalization.
Ship the polished prompt: Once you've arrived at a prompt that performs well across your test cases and meets your success criteria, it's time to deploy it in your application. Monitor the model's performance in the wild and be prepared to make further refinements as needed. Edge cases may crop up that weren't anticipated in your initial test set.

Throughout this process, it's worth starting with the most capable model and unconstrained prompt length to establish a performance ceiling. Once you've achieved the desired output quality, you can then experiment with optimizations like shorter prompts or smaller models to reduce latency and costs as needed.

By following this test-driven methodology and carefully defining your task and success criteria upfront, you'll be well on your way to harnessing the power of Claude for your specific use case. If you invest time in designing robust test cases and prompts, you'll reap the benefits in terms of model performance and maintainability.

GPT/KAPPA

Control over prompting and limited control over retrieval to build functionality/applications outside the standards Duo behavior.

Request Body Parameters

Parameter	Type	Default
message	Message[]	-
persist_answer	boolean	true
use_retrieval	boolean	true
retrieval_query	string	-

Message Type

The messages submitted represent the prompting. Message objects have the following structure:

Parameter	Type	Description
role	string	system, user, assistant (will vary by LLM)
content	string

system: There can be only one system message per prompt. It is used to set the behavior of the assistant at the start of the conversation.

user: the messages represent the input from the user. You should write your instructions as user messages.

assistant: Assistant messages represent responses generated by the AI.

query: There can be only one query message. The query message is used for retrieval if no retrieval_query is given and persisted along the answer if persist_answer is true. It is treated as a user message when sent to GPT-4.

context: Ther can be only one context message. The context message is a placeholder for the retrieval context to be inserted.

Example Request Body { "persist_answer": true, "use_retrieval:": true, "retrieval_query": "What are the most recent blog articles for our database?", "messages": [ { "role": "system", "content": "You are a smart sales person for a database", }, { "role": "user", "content": "You are given a few recent blog articles. Please use them to write an outbound email template targeted at CTOs.", }, { "role": "context", }, { "role": "user", "content": "Sales Template:" } ] }

What happens when I submit a Request? When a user submits a request kapa will perform the following steps:

kapa performs semantic search over your knowledge sources using the content of the query message. If a retrieval_query is specified it is used instead. kapa replaces the context message, with user messages containing the relevant context it found during retrieval. The query message is converted into a user message. All messages are sent to Openai. If persist_answer is true the query message is persisted for analytics along the generated answer. Response Body The following response body is returned for each request.

Parameter Type Description answer Message[] The generated answer thread_id boolean The id of the created thread which the query is part of question_answer_id boolean The id of the created query answer pair messages Message[] The final messages messages sent to Openai after retrieval Previous Conversation API Next Feedback API API Route Request Body Message Type Example Request Body What happens when I submit a Request? Response Body Copyright © 2024 kapa.ai, Inc.

References

Prompt Engineering Session with AI Model Val

Edited May 03, 2024 by Susie Bitters