Update LLM deployment tutorial

LLM infrastructure moves quickly, so tutorial needs to be updated every few months.

Updates:

  • Shift from vLLM + 1.3B model on g3.small to llama.cpp serving quantized Llama 3.1 8B Q3_K_M on g3.medium (g3.small retired)
  • Added explicit model sizing table (g3.medium/g3.large/g3.xl) with approximate VRAM
  • Replaced installation flow: use centrally provided Miniforge module (remove manual installer) and add separate conda envs (, )
  • Added CUDA-enabled build instructions for llama-cpp-python (CMAKE_ARGS with arch 80) and pin version 0.3.16
  • Added Hugging Face auth + GGUF model download steps (QuantFactory repo) and rationale for quantization
  • Added guidance on full GPU offload using plus VRAM / KV cache explanation and mitigation strategies for OOM
  • Replaced systemd service example (vllm) with new for llama.cpp; simplified PATH handling using module + conda run
  • Updated Open WebUI service unit and clarified environment activation
  • Updated Caddy instructions (use sensible-editor, add note on finding hostname)
  • Security/admin notes for first user creation and API connection (OpenAI-compatible endpoint)
  • Removed old quantization appendix specific to vLLM; added new 'Scaling up or changing models' section with upgrade paths

@cmart

Merge request reports

Loading