Skip to content

Configuration

Relay is configured via environment variables, typically in a .env file. The setup wizard writes this for you. Run relay setup to generate it interactively, or relay provision --apply to regenerate from downloaded GGUFs.

Core

VariableDefaultDescription
HOST127.0.0.1Bind address. Set to 0.0.0.0 for LAN/Docker/tunnel access.
PORT1234Bind port
UPSTREAM_BASE_URLhttp://127.0.0.1:8080/v1Default upstream when lifecycle is disabled
DEFAULT_MODEL(empty)Fallback model when client omits model name
REQUEST_TIMEOUT_SECONDS600Upstream request timeout in seconds
MAX_REQUEST_BODY_BYTES1048576Request body size limit
MAX_UPSTREAM_RESPONSE_BYTES16777216Upstream response body limit
API_KEY(empty)If set, require Bearer or x-api-key on all requests. Rotate via TUI Config screen.
RELAY_MODEL_DIR~/modelsDirectory containing GGUF files
RELAY_LLAMA_SERVER_PATHauto-detectedPath to llama-server binary
RELAY_HOST_PROFILEauto-detectedheadless (server, less VRAM reserved) or desktop (workstation, more headroom)

Operating Mode

VariableDefaultDescription
RELAY_MODEgatewaygateway (manage local models) or cloud (proxy to external APIs)

Gateway Mode

Gateway mode manages local llama.cpp models with auto lifecycle. Unknown models can be forwarded to a cloud fallback.

VariableDefaultDescription
RELAY_CLOUD_FALLBACK_URL(empty)Forward unknown models to this URL (e.g. http://relay-cloud:1235/v1)

Cloud Mode

Cloud mode proxies to external APIs. No local GPU or llama.cpp needed.

VariableDefaultDescription
RELAY_CLOUD_MODELS(empty)JSON map of model name → {base_url, auth_env, ctx_size}

Cloud model entry format:

json
{
  "deepseek-chat": {
    "base_url": "https://api.deepseek.com/v1",
    "auth_env": "DEEPSEEK_API_KEY",
    "ctx_size": 131072
  }
}

Environment variables referenced by auth_env must be set in .env (e.g. DEEPSEEK_API_KEY=sk-...).

Model Lifecycle

Enabled when RELAY_MODEL_LIFECYCLE_ENABLED=true. Models start on first request and unload after idle timeout.

VariableDefaultDescription
RELAY_MODEL_LIFECYCLE_ENABLEDfalseEnable auto start/stop/switch of local models
RELAY_MODEL_MAP(empty)JSON: model name → start config
RELAY_SWITCH_POLICYeagerOnly eager supported (kill old, start new)
RELAY_MODEL_PORT_BASE8081Starting port for dynamic model allocation
RELAY_MODEL_IDLE_SHUTDOWN_MS3600000Idle timeout before unloading (1 hour)
RELAY_MODEL_START_TIMEOUT_MS120000Max time to wait for model health check
RELAY_SWITCH_MAX_WARM_MODELS2Max concurrent model processes
RELAY_SERIALIZE_REQUESTSfalseQueue requests one-at-a-time (FCFS, prevents thrash)
RELAY_LIFECYCLE_RING_BUFFER_BYTES65536Process stdout/stderr ring buffer size
RELAY_LIFECYCLE_SHUTDOWN_CONFIRM_TIMEOUT_MS10000Max time to wait for health to go red after shutdown
RELAY_LIFECYCLE_CIRCUIT_BREAKER_THRESHOLD3Consecutive failures before circuit breaker trips
RELAY_LIFECYCLE_CIRCUIT_BREAKER_WINDOW_MS300000Window for counting failures
RELAY_LIFECYCLE_CIRCUIT_BREAKER_COOLDOWN_MS120000Cooldown before retrying a tripped model

RELAY_MODEL_MAP Format

JSON object mapping model names to start configurations. The setup wizard generates this.

Each entry:

  • cmd (required) — shell script to start the model (must accept LLAMA_PORT env var)
  • ctx_size (required) — context window size, exposed via /v1/models
  • timeout_sec (optional) — startup timeout override
  • multimodal (optional) — true if model supports vision
  • port (optional) — fixed port (auto-allocated if unset)
  • thinking_levels (optional) — ["on"] or ["on","off"] for thinking-capable models
  • expert_flag (optional) — MoE expert offload flag (--cpu-moe or --n-cpu-moe N), informational (baked into the start script)
  • health_url (optional) — override health check URL

Example:

json
{
  "qwen3.6-35b-a3b": {
    "cmd": "/relay/start-scripts/start-qwen3.6-35b-a3b.sh",
    "ctx_size": 262144
  },
  "gemma-4-26b": {
    "cmd": "/relay/start-scripts/start-gemma-4-26b.sh",
    "ctx_size": 131072,
    "multimodal": true,
    "thinking_levels": ["on", "off"]
  }
}

Session-Aware Context

Send a session-id header with requests. When the session ID changes, Relay restarts the model to clear its KV cache. This prevents conversation state from leaking between different users or projects sharing a relay instance.

curl -H "session-id: project-alpha" http://127.0.0.1:1234/v1/chat/completions ...

Headers checked (first match wins): session-id, session_id, x-session-affinity, x-client-request-id.

Compatibility

VariableDefaultDescription
RELAY_MODEL_PROFILEgenericModel family profile for sampling defaults
RELAY_UNKNOWN_FIELD_POLICYpass_throughpass_through, strip (with warning), or reject
RELAY_STRICT_COMPATfalseReject non-standard requests
RELAY_WARN_ON_STRIPPED_FIELDStrueLog warnings when fields are stripped
RELAY_REASONING_MODEoffoff, preserve, or auto for reasoning/thinking fields
RELAY_TOOL_MODEautoTool call handling mode
RELAY_THINKING_SUPPORTEDfalseDeclare thinking capability to clients
RELAY_THINKING_LEVELSon,offAvailable thinking levels (comma-separated)
UPSTREAM_VISION_OKfalseDeclare vision/multimodal capability

Observability

VariableDefaultDescription
RELAY_OBSERVABILITY_ENABLEDtrueEnable /relay/* endpoints
RELAY_OBSERVABILITY_CAPTURE_BODYfalseCapture request/response bodies
RELAY_REQUEST_HISTORY_LIMIT100Max recent requests tracked
RELAY_LOG_PROMPTSfalseLog prompt bodies (security risk)
RELAY_EXPOSE_UPSTREAM_ERRORStrueInclude upstream error details in responses
LOG_LEVELinfosilent, error, warn, info, debug

Security

VariableDefaultDescription
CORS_ORIGIN(empty)Allowed CORS origin
RATE_LIMIT_AUTH_MAX20Max requests per window per key
RATE_LIMIT_AUTH_WINDOW_SECONDS60Rate limit window in seconds
RATE_LIMIT_RELAY_POST_MAX50Rate limit for /relay/* POST endpoints
RATE_LIMIT_RELAY_POST_WINDOW_MS60000Window for relay endpoints
RELAY_ALLOWED_HOSTS(empty)Allowed host header values

Rate limiting is per-key (each API token gets its own bucket). Falls back to IP-based limiting when no token is present.

Startup

VariableDefaultDescription
RELAY_PROBE_ON_STARTUPtrueProbe upstream during startup
RELAY_STRICT_STARTUPfalseExit if probe fails
RELAY_PROBE_TIMEOUT_MS3000Startup probe timeout

Sampling Defaults

VariableDefaultDescription
DEFAULT_TEMPERATURE(empty)Default temperature
DEFAULT_TOP_P(empty)Default top_p
DEFAULT_TOP_K(empty)Default top_k
DEFAULT_MIN_P(empty)Default min_p
DEFAULT_PRESENCE_PENALTY(empty)Default presence_penalty
DEFAULT_REPETITION_PENALTY(empty)Default repetition_penalty