Skip to main content

/responses [Beta]

LiteLLM provides a BETA endpoint in the spec of OpenAI's /responses API

FeatureSupportedNotes
Cost Tracking✅Works with all supported models
Logging✅Works across all integrations
End-user Tracking✅
Streaming✅
Fallbacks✅Works between supported models
Loadbalancing✅Works between supported models
Supported LiteLLM Versions1.63.8+
Supported LLM providersAll LiteLLM supported providersopenai, anthropic, bedrock, vertex_ai, gemini, azure, azure_ai etc.

Usage​

LiteLLM Python SDK​

Non-streaming​

OpenAI Non-streaming Response
import litellm

# Non-streaming response
response = litellm.responses(
model="openai/o1-pro",
input="Tell me a three sentence bedtime story about a unicorn.",
max_output_tokens=100
)

print(response)

Streaming​

OpenAI Streaming Response
import litellm

# Streaming response
response = litellm.responses(
model="openai/o1-pro",
input="Tell me a three sentence bedtime story about a unicorn.",
stream=True
)

for event in response:
print(event)

LiteLLM Proxy with OpenAI SDK​

First, set up and start your LiteLLM proxy server.

Start LiteLLM Proxy Server
litellm --config /path/to/config.yaml

# RUNNING on http://0.0.0.0:4000

First, add this to your litellm proxy config.yaml:

OpenAI Proxy Configuration
model_list:
- model_name: openai/o1-pro
litellm_params:
model: openai/o1-pro
api_key: os.environ/OPENAI_API_KEY

Non-streaming​

OpenAI Proxy Non-streaming Response
from openai import OpenAI

# Initialize client with your proxy URL
client = OpenAI(
base_url="http://localhost:4000", # Your proxy URL
api_key="your-api-key" # Your proxy API key
)

# Non-streaming response
response = client.responses.create(
model="openai/o1-pro",
input="Tell me a three sentence bedtime story about a unicorn."
)

print(response)

Streaming​

OpenAI Proxy Streaming Response
from openai import OpenAI

# Initialize client with your proxy URL
client = OpenAI(
base_url="http://localhost:4000", # Your proxy URL
api_key="your-api-key" # Your proxy API key
)

# Streaming response
response = client.responses.create(
model="openai/o1-pro",
input="Tell me a three sentence bedtime story about a unicorn.",
stream=True
)

for event in response:
print(event)

Supported Responses API Parameters​

ProviderSupported Parameters
openaiAll Responses API parameters are supported
azureAll Responses API parameters are supported
anthropicSee supported parameters here
bedrockSee supported parameters here
geminiSee supported parameters here
vertex_aiSee supported parameters here
azure_aiSee supported parameters here
All other llm api providersSee supported parameters here

Load Balancing with Routing Affinity​

When using the Responses API with multiple deployments of the same model (e.g., multiple Azure OpenAI endpoints), LiteLLM provides routing affinity for conversations. This ensures that follow-up requests using a previous_response_id are routed to the same deployment that generated the original response.

Example Usage​

Python SDK with Routing Affinity
import litellm

# Set up router with multiple deployments of the same model
router = litellm.Router(
model_list=[
{
"model_name": "azure-gpt4-turbo",
"litellm_params": {
"model": "azure/gpt-4-turbo",
"api_key": "your-api-key-1",
"api_version": "2024-06-01",
"api_base": "https://endpoint1.openai.azure.com",
},
},
{
"model_name": "azure-gpt4-turbo",
"litellm_params": {
"model": "azure/gpt-4-turbo",
"api_key": "your-api-key-2",
"api_version": "2024-06-01",
"api_base": "https://endpoint2.openai.azure.com",
},
},
],
optional_pre_call_checks=["responses_api_deployment_check"],
)

# Initial request
response = await router.aresponses(
model="azure-gpt4-turbo",
input="Hello, who are you?",
truncation="auto",
)

# Store the response ID
response_id = response.id

# Follow-up request - will be automatically routed to the same deployment
follow_up = await router.aresponses(
model="azure-gpt4-turbo",
input="Tell me more about yourself",
truncation="auto",
previous_response_id=response_id # This ensures routing to the same deployment
)