# Xantly

> Xantly is an AI infrastructure layer that routes to 10,000+ LLMs across 15 providers with 12ms median overhead, semantic caching (62% hit rate, sub-5ms), and persistent memory — reducing API costs by up to 80% via a single OpenAI-compatible endpoint.

Xantly sits between your application and LLM providers. Change your `base_url` to `api.xantly.com/v1` — zero other code changes required. You get intelligent routing, semantic caching, persistent memory, multi-provider waterfall failover, and full observability via response headers.

## Docs

- [Introduction](https://xantly.com/docs/introduction): What Xantly is, key capabilities, and how it works
- [Quickstart](https://xantly.com/docs/quickstart): Get up and running in under 5 minutes
- [Authentication](https://xantly.com/docs/authentication): API keys, JWT tokens, and security best practices
- [Platform Overview](https://xantly.com/docs/platform-overview): How Xantly fits in your stack — architecture, request lifecycle, differentiators
- [Intelligent Routing](https://xantly.com/docs/intelligent-routing): How the routing engine selects optimal models across providers
- [Caching & Performance](https://xantly.com/docs/caching-and-performance): Multi-layer caching — exact match (sub-5ms), semantic match, cross-conversation dedup
- [Memory & Context](https://xantly.com/docs/memory-and-context): Persistent per-org memory — session detection, knowledge extraction, context assembly
- [Chat Completions API](https://xantly.com/docs/chat-completions): Unified /v1/chat/completions endpoint for any model
- [Responses API](https://xantly.com/docs/responses): Modern OpenAI Responses API endpoint
- [Completions (Legacy)](https://xantly.com/docs/completions): Legacy prompt-based text completions
- [Embeddings](https://xantly.com/docs/embeddings): Vector embeddings for semantic search and RAG
- [Moderations](https://xantly.com/docs/moderations): Content classification with BYOK support
- [Audio](https://xantly.com/docs/audio): Speech-to-text (Whisper) and text-to-speech
- [Voice Models](https://xantly.com/docs/voice-models): 30+ voice models (STT, TTS, Realtime, Audio LLMs, Music)
- [Voice Billing](https://xantly.com/docs/voice-billing): Voice request pricing, quotas, and metering
- [Images](https://xantly.com/docs/images): DALL-E image generation via BYOK
- [Models](https://xantly.com/docs/models): List all available models in the catalog
- [Rate Limits](https://xantly.com/docs/rate-limits): RPM/TPM limits, response headers, backoff patterns
- [Billing & Quotas](https://xantly.com/docs/billing-and-quotas): Token quotas, budgets, credit balance, cost visibility
- [OpenAPI Spec](https://xantly.com/docs/openapi-spec): Download the full OpenAPI 3.1 specification
- [Cost-Optimized Routing](https://xantly.com/docs/cost-optimized-routing): Route to cheapest model meeting quality thresholds
- [Multi-Agent Orchestration](https://xantly.com/docs/multi-agent-orchestration): AI agent pipelines with handoffs and shared memory
- [Streaming Responses](https://xantly.com/docs/streaming-responses): Real-time token streaming via SSE
- [Benchmark Results](https://xantly.com/docs/benchmark-results): 252 validations, 24/24 industry benchmarks, 10/10 SDK clients
- [Voice Agents](https://xantly.com/docs/voice-agents): Production voice agents — 80-90% cheaper, sub-300ms latency
- [Intelligence Modes](https://xantly.com/docs/intelligence-modes): Control pipeline stages — proxy, cache, or full mode
- [Bring Your Own Key](https://xantly.com/docs/bring-your-own-key): Use your own provider API keys through Xantly
- [Glossary](https://xantly.com/docs/glossary): Definitions of AI infrastructure terms

## Xantly MCP Server

- [Xantly MCP Launch](https://xantly.com/docs/xantly-mcp-launch): `npx @xantly/mcp` ships the first LLM-gateway MCP server — 15 tools exposing routing decisions, cache inspection, memory, cost breakdowns, intelligence-mode control. Works with Claude Desktop, Cursor, VS Code, Zed, ChatGPT Desktop.

## Integrations (use Xantly with existing coding tools)

- [Use Xantly with Claude Code](https://xantly.com/docs/use-with-claude-code): Anthropic's CLI via ANTHROPIC_BASE_URL — chat, edit, sub-agents, skills, MCP tools all work
- [Use Xantly with GitHub Copilot CLI](https://xantly.com/docs/use-with-copilot-cli): Skip the $10/mo subscription — four env vars, zero GitHub auth (Apr 7, 2026 BYOK release)
- [Use Xantly with OpenCode](https://xantly.com/docs/use-with-opencode): Cleanest BYO in the ecosystem — opencode.jsonc + @ai-sdk/openai-compatible
- [Use Xantly with Cline](https://xantly.com/docs/use-with-cline): Largest VS Code agent user base, OpenAI Compatible provider
- [Use Xantly with Continue.dev](https://xantly.com/docs/use-with-continue-dev): Per-role model config — Xantly for chat, Ollama for autocomplete
- [Use Xantly with Roo Code](https://xantly.com/docs/use-with-roo-code): Cline fork with per-mode model assignments
- [Use Xantly with Kilo Code](https://xantly.com/docs/use-with-kilo-code): #1 IDE Extension on OpenRouter by token volume
- [Use Xantly with Aider](https://xantly.com/docs/use-with-aider): Terminal pair programmer — one flag, role-split for 60-70% cost savings
- [Migrate from Cursor](https://xantly.com/docs/migrate-from-cursor): Cursor's BYOK is chat-only — escape to Cline + Xantly for full-stack BYO
- [Migrate from Antigravity](https://xantly.com/docs/migrate-from-antigravity): Google cut Antigravity's free tier by 92% — replace with OpenCode + Xantly in 5 minutes

## SDK integrations

- [OpenAI SDK (Python)](https://xantly.com/docs/use-with-openai-sdk-python): Point `OpenAI(base_url=...)` at Xantly — chat, streaming, function calling, async
- [OpenAI SDK (TypeScript)](https://xantly.com/docs/use-with-openai-sdk-typescript): `new OpenAI({ baseURL, apiKey })` for Xantly — chat, streaming iterator, tool use
- [Anthropic SDK (Python)](https://xantly.com/docs/use-with-anthropic-sdk-python): `Anthropic(base_url="https://api.xantly.com")` — messages.create, streaming, tool_use, prompt caching
- [Anthropic SDK (TypeScript)](https://xantly.com/docs/use-with-anthropic-sdk-typescript): Same for `@anthropic-ai/sdk` npm
- [LangChain (Python)](https://xantly.com/docs/use-with-langchain-python): `ChatOpenAI` / `ChatAnthropic` with Xantly base_url — LCEL chains, agents, RAG
- [LangChain.js](https://xantly.com/docs/use-with-langchain-typescript): Same for langchain.js
- [LlamaIndex (Python)](https://xantly.com/docs/use-with-llamaindex-python): `OpenAI(api_base=..., api_key=..., model=...)` in llama_index — RAG, agents, query engines
- [Vercel AI SDK](https://xantly.com/docs/use-with-vercel-ai-sdk): `createOpenAICompatible({ baseURL, apiKey, name: 'xantly' })` — generateText, streamText, agents
- [PydanticAI](https://xantly.com/docs/use-with-pydantic-ai): Configure OpenAI-compatible provider for PydanticAI agents
- [Instructor](https://xantly.com/docs/use-with-instructor): Structured output library with Xantly base URL

## Framework integrations

- [CrewAI](https://xantly.com/docs/use-with-crewai): Multi-agent framework with Xantly-routed LLMs
- [AutoGen](https://xantly.com/docs/use-with-autogen): Microsoft AutoGen — OpenAI-compatible provider
- [LangGraph](https://xantly.com/docs/use-with-langgraph): LangChain's graph-based agent runtime via ChatOpenAI node
- [DSPy](https://xantly.com/docs/use-with-dspy): Stanford DSPy — `dspy.OpenAI(api_base=..., api_key=..., model=...)`
- [Guidance](https://xantly.com/docs/use-with-guidance): Microsoft Guidance with Xantly endpoint
- [llamafile](https://xantly.com/docs/use-with-llamafile): Use Xantly alongside local llamafile for hybrid workflows
- [Ollama bridge](https://xantly.com/docs/use-with-ollama-bridge): Xantly for hosted models, Ollama for local — hybrid setup

## IDE / editor integrations

- [Zed](https://xantly.com/docs/use-with-zed): Native Rust IDE assistant panel — `openai_compatible` settings block
- [JetBrains AI Assistant](https://xantly.com/docs/use-with-jetbrains-ai): AI Assistant + Junie agent with Xantly BYOK
- [Windsurf](https://xantly.com/docs/use-with-windsurf): Windsurf BYOK limitations (Anthropic-only, proxy-gated) — migrate to Cline instead
- [Void Editor](https://xantly.com/docs/use-with-void-editor): OSS Cursor alternative (beta) with env-var config
- [OpenHands](https://xantly.com/docs/use-with-openhands): Formerly OpenDevin — autonomous agent with LiteLLM backbone
- [Aide (codestory)](https://xantly.com/docs/use-with-aide-codestory): Aide editor via Sidecar layer
- [Cursor (chat-only)](https://xantly.com/docs/use-with-cursor-chat-only): Detailed chat-only BYOK limitations + known base-URL bug
- [Gemini CLI](https://xantly.com/docs/use-with-gemini-cli): Google Gemini CLI with Xantly endpoint
- [Raycast AI](https://xantly.com/docs/use-with-raycast-ai): Raycast extensions routing through Xantly
- [n8n](https://xantly.com/docs/use-with-n8n): n8n workflow automation with OpenAI Chat node
- [Zapier](https://xantly.com/docs/use-with-zapier-openai): Zapier OpenAI connector with base URL override

## Endpoints at a glance

- `POST https://api.xantly.com/v1/chat/completions` — OpenAI Chat Completions (full compat: messages, tools, tool_choice, response_format, streaming, function calling)
- `POST https://api.xantly.com/v1/messages` — Anthropic Messages API (full compat: system, tools, tool_use blocks, streaming with message_start/content_block_delta/message_stop events)
- `POST https://api.xantly.com/v1/completions` — Legacy OpenAI completions
- `POST https://api.xantly.com/v1/embeddings` — Embeddings (all OpenAI + provider-native embedding models)
- `POST https://api.xantly.com/v1/audio/transcriptions` — Whisper + Deepgram Nova + Groq Whisper
- `POST https://api.xantly.com/v1/audio/speech` — TTS (OpenAI, ElevenLabs, Deepgram Aura, Groq Orpheus)
- `POST https://api.xantly.com/v1/moderations` — OpenAI moderation with BYOK
- `POST https://api.xantly.com/v1/images/generations` — DALL-E image generation
- `GET  https://api.xantly.com/v1/models` — Live model catalog (includes xantly/auto-* routing aliases)
- `POST https://api.xantly.com/v1/responses` — OpenAI Responses API

## Routing model IDs

- `provider/model` (e.g. `anthropic/claude-sonnet-4.6`, `openai/gpt-5.4`, `groq/llama-3.3-70b`) — honored exactly, no re-routing, waterfall fallback available on errors
- `xantly/auto` — BaRP bandit routing across all tiers
- `xantly/auto-quality` — BaRP on T1 Quality pool
- `xantly/auto-value` — BaRP on T2 Value pool (balanced default)
- `xantly/auto-speed` — BaRP on T3 Speed pool
- `xantly/auto-safety` — BaRP on SafetyCritical pool

## Comparisons

- [Xantly vs OpenRouter](https://xantly.com/compare/xantly-vs-openrouter): Full infrastructure vs model aggregation
- [Xantly vs Helicone](https://xantly.com/compare/xantly-vs-helicone): Active optimization vs observability proxy
- [Xantly vs Portkey](https://xantly.com/compare/xantly-vs-portkey): AI gateway feature comparison
- [Xantly vs LiteLLM](https://xantly.com/compare/xantly-vs-litellm): Managed service vs self-hosted open source

## API

- [OpenAPI Spec](https://xantly.com/docs/openapi-spec): Full API schema download (OpenAPI 3.1)

## Optional

- [Pricing](https://xantly.com/pricing): Free, Pay-as-you-go, Pro, Scale, and Enterprise plans
- [Terms of Service](https://xantly.com/terms)
- [Privacy Policy](https://xantly.com/privacy)
- [Cookie Policy](https://xantly.com/cookies)
- [Acceptable Use Policy](https://xantly.com/aup)