Update LLM deployment tutorial
LLM infrastructure moves quickly, so tutorial needs to be updated every few months.
Updates:
- Shift from vLLM + 1.3B model on g3.small to llama.cpp serving quantized Llama 3.1 8B Q3_K_M on g3.medium (g3.small retired)
- Added explicit model sizing table (g3.medium/g3.large/g3.xl) with approximate VRAM
- Replaced installation flow: use centrally provided Miniforge module (remove manual installer) and add separate conda envs (, )
- Added CUDA-enabled build instructions for llama-cpp-python (CMAKE_ARGS with arch 80) and pin version 0.3.16
- Added Hugging Face auth + GGUF model download steps (QuantFactory repo) and rationale for quantization
- Added guidance on full GPU offload using plus VRAM / KV cache explanation and mitigation strategies for OOM
- Replaced systemd service example (vllm) with new for llama.cpp; simplified PATH handling using module + conda run
- Updated Open WebUI service unit and clarified environment activation
- Updated Caddy instructions (use sensible-editor, add note on finding hostname)
- Security/admin notes for first user creation and API connection (OpenAI-compatible endpoint)
- Removed old quantization appendix specific to vLLM; added new 'Scaling up or changing models' section with upgrade paths