Model Lifecycle
When RELAY_MODEL_LIFECYCLE_ENABLED=true, Relay manages local model servers automatically.
How It Works
- Lazy loading — a model starts on the first request, not at boot
- Auto-shutdown — models unload after
RELAY_MODEL_IDLE_SHUTDOWN_MSof inactivity (default 1 hour) - Eager switching — when a client requests a different model, the old one is killed before the new one starts (keeps VRAM free)
- Session-aware context — a new
session-idheader triggers a model restart to clear KV cache - Orphan cleanup — kills stale llama-server processes from previous Relay instances on startup
- Circuit breaker — stops retrying models that fail repeatedly
Port Allocation
Each model gets a dedicated port starting from RELAY_MODEL_PORT_BASE (default 8081). Relay routes requests to the correct port automatically. You can pin a model to a specific port by adding "port": 8085 to its model map entry.
Session-Aware Context Clearing
Send a session-id header with requests:
curl -H "session-id: my-project" http://127.0.0.1:1234/v1/chat/completions ...When the session ID changes (e.g. you switch from project-a to project-b), Relay restarts the model. This clears the KV cache so conversation state doesn't leak between sessions. If you don't send a session ID, the model keeps its context across requests.
Headers checked (first match wins): session-id, session_id, x-session-affinity, x-client-request-id.
Request Serialization
When RELAY_SERIALIZE_REQUESTS=true, Relay processes one request at a time (FCFS). Additional requests queue up. This prevents thrash when multiple agents hit the same model simultaneously.
Start Scripts
Model start scripts are generated by the setup wizard (relay setup) or provision command (relay provision). Each script:
- Is a standalone shell script in
start-scripts/ - Accepts
LLAMA_PORTenv var for dynamic port allocation - Includes optimal flags from hardware detection (GPU layers, KV cache type, MoE offloading,
--jinja) - Includes mmproj files for vision models and draft models for speculative decoding
Example generated script:
#!/bin/bash
# relay model: qwen3.6-35b-a3b-ud-q4-k-xl
# context: 262144 arch: qwen35moe
set -e
exec "/home/achu/llama.cpp/build-vulkan/bin/llama-server" \
--model "/home/achu/models/unsloth/Qwen3.6-35B-A3B-GGUF/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf" \
--host 127.0.0.1 \
--port ${LLAMA_PORT:-8081} \
--ctx-size 262144 \
-ngl 999 \
--parallel 1 \
--flash-attn on \
--cache-type-k q4_0 \
--cache-type-v q4_0 \
--jinja \
--n-cpu-moe 25Docker
Relay runs in Docker with network_mode: host and pid: host. This gives it direct access to localhost ports (for model servers) and the ability to spawn/kill model processes on the host.