# Xantly > Xantly is an AI infrastructure layer that routes to 10,000+ LLMs across 15 providers with 12ms median overhead, semantic caching (62% hit rate, sub-5ms), and persistent memory — reducing API costs by up to 80% via a single OpenAI-compatible endpoint. Xantly sits between your application and LLM providers. Change your `base_url` to `api.xantly.com/v1` — zero other code changes required. You get intelligent routing, semantic caching, persistent memory, multi-provider waterfall failover, and full observability via response headers. ## Docs - [Introduction](https://xantly.com/docs/introduction): What Xantly is, key capabilities, and how it works - [Quickstart](https://xantly.com/docs/quickstart): Get up and running in under 5 minutes - [Authentication](https://xantly.com/docs/authentication): API keys, JWT tokens, and security best practices - [Platform Overview](https://xantly.com/docs/platform-overview): How Xantly fits in your stack — architecture, request lifecycle, differentiators - [Intelligent Routing](https://xantly.com/docs/intelligent-routing): How the routing engine selects optimal models across providers - [Caching & Performance](https://xantly.com/docs/caching-and-performance): Multi-layer caching — exact match (sub-5ms), semantic match, cross-conversation dedup - [Memory & Context](https://xantly.com/docs/memory-and-context): Persistent per-org memory — session detection, knowledge extraction, context assembly - [Chat Completions API](https://xantly.com/docs/chat-completions): Unified /v1/chat/completions endpoint for any model - [Responses API](https://xantly.com/docs/responses): Modern OpenAI Responses API endpoint - [Completions (Legacy)](https://xantly.com/docs/completions): Legacy prompt-based text completions - [Embeddings](https://xantly.com/docs/embeddings): Vector embeddings for semantic search and RAG - [Moderations](https://xantly.com/docs/moderations): Content classification with BYOK support - [Audio](https://xantly.com/docs/audio): Speech-to-text (Whisper) and text-to-speech - [Voice Models](https://xantly.com/docs/voice-models): 30+ voice models (STT, TTS, Realtime, Audio LLMs, Music) - [Voice Billing](https://xantly.com/docs/voice-billing): Voice request pricing, quotas, and metering - [Images](https://xantly.com/docs/images): DALL-E image generation via BYOK - [Models](https://xantly.com/docs/models): List all available models in the catalog - [Rate Limits](https://xantly.com/docs/rate-limits): RPM/TPM limits, response headers, backoff patterns - [Billing & Quotas](https://xantly.com/docs/billing-and-quotas): Token quotas, budgets, credit balance, cost visibility - [OpenAPI Spec](https://xantly.com/docs/openapi-spec): Download the full OpenAPI 3.1 specification - [Cost-Optimized Routing](https://xantly.com/docs/cost-optimized-routing): Route to cheapest model meeting quality thresholds - [Multi-Agent Orchestration](https://xantly.com/docs/multi-agent-orchestration): AI agent pipelines with handoffs and shared memory - [Streaming Responses](https://xantly.com/docs/streaming-responses): Real-time token streaming via SSE - [Benchmark Results](https://xantly.com/docs/benchmark-results): 252 validations, 24/24 industry benchmarks, 10/10 SDK clients - [Voice Agents](https://xantly.com/docs/voice-agents): Production voice agents — 80-90% cheaper, sub-300ms latency - [Intelligence Modes](https://xantly.com/docs/intelligence-modes): Control pipeline stages — proxy, cache, or full mode - [Bring Your Own Key](https://xantly.com/docs/bring-your-own-key): Use your own provider API keys through Xantly - [Glossary](https://xantly.com/docs/glossary): Definitions of AI infrastructure terms ## Xantly MCP Server - [Xantly MCP Launch](https://xantly.com/docs/xantly-mcp-launch): `npx @xantly/mcp` ships the first LLM-gateway MCP server — 15 tools exposing routing decisions, cache inspection, memory, cost breakdowns, intelligence-mode control. Works with Claude Desktop, Cursor, VS Code, Zed, ChatGPT Desktop. ## Integrations (use Xantly with existing coding tools) - [Use Xantly with Claude Code](https://xantly.com/docs/use-with-claude-code): Anthropic's CLI via ANTHROPIC_BASE_URL — chat, edit, sub-agents, skills, MCP tools all work - [Use Xantly with GitHub Copilot CLI](https://xantly.com/docs/use-with-copilot-cli): Skip the $10/mo subscription — four env vars, zero GitHub auth (Apr 7, 2026 BYOK release) - [Use Xantly with OpenCode](https://xantly.com/docs/use-with-opencode): Cleanest BYO in the ecosystem — opencode.jsonc + @ai-sdk/openai-compatible - [Use Xantly with Cline](https://xantly.com/docs/use-with-cline): Largest VS Code agent user base, OpenAI Compatible provider - [Use Xantly with Continue.dev](https://xantly.com/docs/use-with-continue-dev): Per-role model config — Xantly for chat, Ollama for autocomplete - [Use Xantly with Roo Code](https://xantly.com/docs/use-with-roo-code): Cline fork with per-mode model assignments - [Use Xantly with Kilo Code](https://xantly.com/docs/use-with-kilo-code): #1 IDE Extension on OpenRouter by token volume - [Use Xantly with Aider](https://xantly.com/docs/use-with-aider): Terminal pair programmer — one flag, role-split for 60-70% cost savings - [Migrate from Cursor](https://xantly.com/docs/migrate-from-cursor): Cursor's BYOK is chat-only — escape to Cline + Xantly for full-stack BYO - [Migrate from Antigravity](https://xantly.com/docs/migrate-from-antigravity): Google cut Antigravity's free tier by 92% — replace with OpenCode + Xantly in 5 minutes ## SDK integrations - [OpenAI SDK (Python)](https://xantly.com/docs/use-with-openai-sdk-python): Point `OpenAI(base_url=...)` at Xantly — chat, streaming, function calling, async - [OpenAI SDK (TypeScript)](https://xantly.com/docs/use-with-openai-sdk-typescript): `new OpenAI({ baseURL, apiKey })` for Xantly — chat, streaming iterator, tool use - [Anthropic SDK (Python)](https://xantly.com/docs/use-with-anthropic-sdk-python): `Anthropic(base_url="https://api.xantly.com")` — messages.create, streaming, tool_use, prompt caching - [Anthropic SDK (TypeScript)](https://xantly.com/docs/use-with-anthropic-sdk-typescript): Same for `@anthropic-ai/sdk` npm - [LangChain (Python)](https://xantly.com/docs/use-with-langchain-python): `ChatOpenAI` / `ChatAnthropic` with Xantly base_url — LCEL chains, agents, RAG - [LangChain.js](https://xantly.com/docs/use-with-langchain-typescript): Same for langchain.js - [LlamaIndex (Python)](https://xantly.com/docs/use-with-llamaindex-python): `OpenAI(api_base=..., api_key=..., model=...)` in llama_index — RAG, agents, query engines - [Vercel AI SDK](https://xantly.com/docs/use-with-vercel-ai-sdk): `createOpenAICompatible({ baseURL, apiKey, name: 'xantly' })` — generateText, streamText, agents - [PydanticAI](https://xantly.com/docs/use-with-pydantic-ai): Configure OpenAI-compatible provider for PydanticAI agents - [Instructor](https://xantly.com/docs/use-with-instructor): Structured output library with Xantly base URL ## Framework integrations - [CrewAI](https://xantly.com/docs/use-with-crewai): Multi-agent framework with Xantly-routed LLMs - [AutoGen](https://xantly.com/docs/use-with-autogen): Microsoft AutoGen — OpenAI-compatible provider - [LangGraph](https://xantly.com/docs/use-with-langgraph): LangChain's graph-based agent runtime via ChatOpenAI node - [DSPy](https://xantly.com/docs/use-with-dspy): Stanford DSPy — `dspy.OpenAI(api_base=..., api_key=..., model=...)` - [Guidance](https://xantly.com/docs/use-with-guidance): Microsoft Guidance with Xantly endpoint - [llamafile](https://xantly.com/docs/use-with-llamafile): Use Xantly alongside local llamafile for hybrid workflows - [Ollama bridge](https://xantly.com/docs/use-with-ollama-bridge): Xantly for hosted models, Ollama for local — hybrid setup ## IDE / editor integrations - [Zed](https://xantly.com/docs/use-with-zed): Native Rust IDE assistant panel — `openai_compatible` settings block - [JetBrains AI Assistant](https://xantly.com/docs/use-with-jetbrains-ai): AI Assistant + Junie agent with Xantly BYOK - [Windsurf](https://xantly.com/docs/use-with-windsurf): Windsurf BYOK limitations (Anthropic-only, proxy-gated) — migrate to Cline instead - [Void Editor](https://xantly.com/docs/use-with-void-editor): OSS Cursor alternative (beta) with env-var config - [OpenHands](https://xantly.com/docs/use-with-openhands): Formerly OpenDevin — autonomous agent with LiteLLM backbone - [Aide (codestory)](https://xantly.com/docs/use-with-aide-codestory): Aide editor via Sidecar layer - [Cursor (chat-only)](https://xantly.com/docs/use-with-cursor-chat-only): Detailed chat-only BYOK limitations + known base-URL bug - [Gemini CLI](https://xantly.com/docs/use-with-gemini-cli): Google Gemini CLI with Xantly endpoint - [Raycast AI](https://xantly.com/docs/use-with-raycast-ai): Raycast extensions routing through Xantly - [n8n](https://xantly.com/docs/use-with-n8n): n8n workflow automation with OpenAI Chat node - [Zapier](https://xantly.com/docs/use-with-zapier-openai): Zapier OpenAI connector with base URL override ## Endpoints at a glance - `POST https://api.xantly.com/v1/chat/completions` — OpenAI Chat Completions (full compat: messages, tools, tool_choice, response_format, streaming, function calling) - `POST https://api.xantly.com/v1/messages` — Anthropic Messages API (full compat: system, tools, tool_use blocks, streaming with message_start/content_block_delta/message_stop events) - `POST https://api.xantly.com/v1/completions` — Legacy OpenAI completions - `POST https://api.xantly.com/v1/embeddings` — Embeddings (all OpenAI + provider-native embedding models) - `POST https://api.xantly.com/v1/audio/transcriptions` — Whisper + Deepgram Nova + Groq Whisper - `POST https://api.xantly.com/v1/audio/speech` — TTS (OpenAI, ElevenLabs, Deepgram Aura, Groq Orpheus) - `POST https://api.xantly.com/v1/moderations` — OpenAI moderation with BYOK - `POST https://api.xantly.com/v1/images/generations` — DALL-E image generation - `GET https://api.xantly.com/v1/models` — Live model catalog (includes xantly/auto-* routing aliases) - `POST https://api.xantly.com/v1/responses` — OpenAI Responses API ## Routing model IDs - `provider/model` (e.g. `anthropic/claude-sonnet-4.6`, `openai/gpt-5.4`, `groq/llama-3.3-70b`) — honored exactly, no re-routing, waterfall fallback available on errors - `xantly/auto` — BaRP bandit routing across all tiers - `xantly/auto-quality` — BaRP on T1 Quality pool - `xantly/auto-value` — BaRP on T2 Value pool (balanced default) - `xantly/auto-speed` — BaRP on T3 Speed pool - `xantly/auto-safety` — BaRP on SafetyCritical pool ## Comparisons - [Xantly vs OpenRouter](https://xantly.com/compare/xantly-vs-openrouter): Full infrastructure vs model aggregation - [Xantly vs Helicone](https://xantly.com/compare/xantly-vs-helicone): Active optimization vs observability proxy - [Xantly vs Portkey](https://xantly.com/compare/xantly-vs-portkey): AI gateway feature comparison - [Xantly vs LiteLLM](https://xantly.com/compare/xantly-vs-litellm): Managed service vs self-hosted open source ## API - [OpenAPI Spec](https://xantly.com/docs/openapi-spec): Full API schema download (OpenAPI 3.1) ## Optional - [Pricing](https://xantly.com/pricing): Free, Pay-as-you-go, Pro, Scale, and Enterprise plans - [Terms of Service](https://xantly.com/terms) - [Privacy Policy](https://xantly.com/privacy) - [Cookie Policy](https://xantly.com/cookies) - [Acceptable Use Policy](https://xantly.com/aup)