[Prompt Engineering] Include code snippets to the prompt
GitHub recently released a blog post explaining the basic prompt engineering techniques used to power their Copilot solution - https://github.blog/2023-07-17-prompt-engineering-guide-generative-ai-llms/. One of the techniques is to chunk the n
open files into many snippets of 60 lines each and add those snippets to the prompt as an example via the comment section. Since the total size of the snippets is more likely to be larger than the maximum prompt size, authors select m
"appropriate" snippets based on the Jaccard
similarity metric, which is cheap to compute.
Overall, it looks like we're trying to implement almost the same logic working on #174 and #170 (closed). However, our changes involve high complexity (due to language-dependent transformations) and a large number of assumptions (due to differences in language grammars), making finally the source code and solution error-prone.
I'd propose to focus first on the algorithm used by GitHub already that has the following benefits:
- almost language-independent (we need to get the context of the cursor only, no need to move functions from suffix, etc) - #237 (closed)
- fast to implement, we already have all required components
- a high probability of model adaptation to the project style based on native non-synthetic examples
- easier rollback than language-dependent approaches
By default, the chunking algorithm is pretty simple. We take every 60 lines of the input file. Over time, we can make the chunking algorithm more intelligent by considering classes, functions, and other symbols. We already have a great MR from @HongtaoYang that might be reused - !256 (merged).
Algorithm
Brief description of the algorithm adapted to 1 input file.
- Extract imports and add them to the prompt - already done
- Get the context of the cursor from the prefix/suffix - #237 (closed)
- Chunk the input file (without imports and the context identified above) into
m
snippets of 60 lines each. - Tokenize the input data - tokenization is already implemented
- Calculate the Jaccard similarity based on the tokenized data and select
m
snippets similar to the context of the cursor. (need to estimate a threshold) - Add appropriate snippets to the prompt as a comment - we already add file metadata as a comment
cc @m_gill @wayne @stanhu @bastirehm @bcardoso- @HongtaoYang