Responses - RunInfra

POST https://api.runinfra.ai/v1/responses

RunInfra /v1/responses is a chat-completions compatibility adapter for LLM and vision-language deployments. The gateway converts supported input and instructions fields into chat messages, forwards the request through the same serving path as /v1/chat/completions, then wraps the result in a Responses-shaped envelope.

This endpoint does not implement full OpenAI Responses state, include, reasoning, tool, conversation-item, or background-job semantics. Use /v1/chat/completions when your app needs tool calls today.

Minimal request

import os
from openai import OpenAI

client = OpenAI(
    base_url=os.environ.get("RUNINFRA_BASE_URL", "https://api.runinfra.ai/v1"),
    api_key=os.environ["RUNINFRA_API_KEY"],
)

response = client.responses.create(
    model=os.environ["RUNINFRA_MODEL"],
    input="Write a one-sentence deployment health check.",
)

print(response.output_text)

Streaming

Set stream: true to receive server-sent events. The stream uses OpenAI-shaped Responses event names for text deltas and terminal completion events when the selected deployment supports streaming.

import os
from openai import OpenAI

client = OpenAI(
    base_url=os.environ.get("RUNINFRA_BASE_URL", "https://api.runinfra.ai/v1"),
    api_key=os.environ["RUNINFRA_API_KEY"],
)

stream = client.responses.create(
    model=os.environ["RUNINFRA_MODEL"],
    input="Count to five.",
    stream=True,
)

for event in stream:
    if event.type == "response.output_text.delta":
        print(event.delta, end="", flush=True)

Supported request fields

model

string

required

Model id returned by GET /v1/models, or default for a pipeline-scoped snippet.

input

string | object[]

required

Prompt text, or an array of supported Responses input message objects.

instructions

string

System-level instruction text. The adapter maps this into a system message.

stream

boolean

default:"false"

Return a server-sent event stream instead of one JSON response.

max_output_tokens

integer

Maximum generated output tokens. The adapter maps this to the serving backend’s chat completion token limit.

temperature

number

Sampling temperature for compatible LLM deployments.

top_p

number

Nucleus sampling cutoff for compatible LLM deployments.

response_format

object

Structured output format passed through to the chat-completions serving path. Support depends on the selected deployment.

tools

object[]

OpenAI chat-completions tool definitions accepted as pass-through input for compatible deployments. This adapter does not execute hosted tools or manage a stateful tool loop.

tool_choice

string | object

Tool-selection preference passed through to compatible chat-completions deployments.

Not shipped on this adapter

Stateful response retrieval or deletion.
include, reasoning, hosted tools, conversation items, file search, web search, computer use, and background jobs.
Stateful tool execution or hosted tool orchestration. Use Chat completions for production tool loops.

Retry semantics

The native RunInfra SDK treats non-streaming responses.create() as replay-safe only when you provide an idempotency key. Streaming Responses requests are sent once because a partial stream may already have reached your app.

​Minimal request

​Streaming

​Supported request fields

​Not shipped on this adapter

​Retry semantics

​Next steps

Chat completions

RunInfra SDK

Minimal request

Streaming

Supported request fields

Not shipped on this adapter

Retry semantics

Next steps