Update LLM deployment tutorial (!183) · Merge requests · Jetstream Cloud / Jetstream2 / docs

LLM infrastructure moves quickly, so tutorial needs to be updated every few months.

Updates:

Shift from vLLM + 1.3B model on g3.small to llama.cpp serving quantized Llama 3.1 8B Q3_K_M on g3.medium (g3.small retired)
Added explicit model sizing table (g3.medium/g3.large/g3.xl) with approximate VRAM
Replaced installation flow: use centrally provided Miniforge module (remove manual installer) and add separate conda envs (, )
Added CUDA-enabled build instructions for llama-cpp-python (CMAKE_ARGS with arch 80) and pin version 0.3.16
Added Hugging Face auth + GGUF model download steps (QuantFactory repo) and rationale for quantization
Added guidance on full GPU offload using plus VRAM / KV cache explanation and mitigation strategies for OOM
Replaced systemd service example (vllm) with new for llama.cpp; simplified PATH handling using module + conda run
Updated Open WebUI service unit and clarified environment activation
Updated Caddy instructions (use sensible-editor, add note on finding hostname)
Security/admin notes for first user creation and API connection (OpenAI-compatible endpoint)
Removed old quantization appendix specific to vLLM; added new 'Scaling up or changing models' section with upgrade paths

Update LLM deployment tutorial