Architecture
Relay is a protocol adapter that sits between agent clients and local LLM servers.
Two Operating Modes
| Mode | Description |
|---|---|
| Gateway | Manages local llama.cpp models. Auto lifecycle: start, stop, switch. Forwards unknown models to cloud fallback if configured. |
| Cloud | Proxies to external APIs (OpenAI, Anthropic, DeepSeek, Groq). No local GPU needed. Per-model base URL and auth. |
Request Flow
text
Client (OpenAI/Anthropic)
→ endpoint parser + auth + field policy
→ canonical request (provider-agnostic intermediate form)
→ upstream OpenAI-compatible chat request
→ canonical response / stream events
→ client protocol response (shape-matched to inbound protocol)Canonical Pipeline
Relay normalizes everything through a provider-neutral canonical form:
- Request model:
src/internal/canonical.ts - Protocol converters:
src/internal/openai-chat.ts,src/internal/openai-responses.ts,src/internal/anthropic-messages.ts - Response model:
src/internal/response.ts - Shared sampling logic:
src/internal/sampling.ts
Streaming
- OpenAI chat streams emit repaired
[DONE]terminators - Responses streams emit the full SSE event lifecycle (output_item, content_part, text_delta, function_call_arguments, completed)
- Anthropic streams emit message_start, content_block, message_delta, message_stop events
- Missing terminators are repaired; duplicates are collapsed
Auto Model Lifecycle
When enabled, Relay manages model processes:
- Spawns model servers on dedicated ports (8081+)
- Routes requests to the correct port via
getUpstreamUrl() - Kills old models when switching (eager switching)
- Clears KV cache on session change
- Cleans orphan processes from previous instances on startup
Observability
When RELAY_OBSERVABILITY_ENABLED=true:
GET /— HTML dark-theme dashboard with model table, queue cards, lifecycle stateGET /health— JSON liveness checkGET /relay/status— full lifecycle + queue statusGET /relay/metrics— request counts, latencies, error ratesGET /relay/jobs— job queue stateGET /relay/lifecycle— per-model process detailsGET /relay/stats— request historyGET /relay/capabilities— capability matrix
All sensitive values (API keys, auth headers) are redacted in logs and observability output.
Response Store
The /v1/responses endpoint stores completed responses in an in-memory store with:
- Full chat message history for
previous_response_idconversation continuation - TTL-based expiry (
COMPLETION_TTL_SECONDS) - Memory-bound eviction (
maxStoreEntries,maxStoreBytes)
Key Modules
| Module | Purpose |
|---|---|
src/server.ts | HTTP server, routing, auth, rate limiting, mutex, cloud fallback |
src/config.ts | Environment variable loading and validation |
src/auth.ts | API key validation, per-key rate limiting (KeyRateLimiter) |
src/lifecycle.ts | Model process management (start, stop, switch, probe, idle shutdown) |
src/openai/responses.ts | /v1/responses handler, streaming SSE lifecycle, ResponseStore |
src/openai/chat.ts | /v1/chat/completions handler |
src/internal/openai-responses.ts | Responses → canonical conversion, tool normalization |
src/upstream/llama.ts | Upstream HTTP client with auth header injection |
src/normalize/ | Message, tool, and stream normalization |