This changelog tracks every RunInfra release. Each entry lists what shipped and when. The roadmap section at the bottom covers features currently in development.Documentation Index
Fetch the complete documentation index at: https://runinfra.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
May 26, 2026
EnginePlanUX
Plan honesty: latency-budget clarification, cheaper-GPU defaults, phase reorder, open-questions card, ETA range
Plan-quality enterprise hardening
A third-party product reviewer audited a realOptimize Llama 3.1 8B on vLLM for the cheapest GPU session and found that the runbook was answering the wrong question. The plan contradicted the chat panel, invented a latency budget the user never set, and ruled out cheap GPUs on roofline grounds. This release fixes the intake → plan contract end-to-end.Latency clarification before plan generation. The planner used to inject a 500ms (startup) or 200ms (growth, enterprise) SLO whenever the goal string contained “latency”. That silent default filtered cheap GPUs out of feasibility. Now: if the user wrote “latency” without a number, the agent calls ask_clarification and the chat overlay asks “Do you have a latency budget, or should we pick the cheapest VRAM-compatible GPU and measure latency in Phase 1?” before producing the runbook.Cheapest GPU when no SLO is set. Feasibility’s verdict for a null target was previously undefined; now any VRAM-fitting GPU is “comfortable” and the planner ranks by cost subject to fit. Prompts like “find me the cheapest GPU” now actually recommend cheap GPUs.Phase reorder: serving comes early. The LLM optimization recipe now runs Hardware, Baseline, Serving, Quantization, KV cache, Speculation, Kernels, Deploy. Serving config (batch size, prefix cache, GPU utilization) is the cheapest lever, and every downstream optimizer benchmarks against the tuned serving baseline rather than raw FP16.Confirm-vs-sweep hardware phase. When feasibility already bound a GPU during intake, the runbook’s hardware phase is now a one-shot “validate the bound choice” check (~10s) instead of a fresh sweep. The full sweep still runs when intake did not pick a GPU or the user explicitly requested the cheapest-GPU search.LLM inference is its own modality. The intake classifier used to bucket prompts like “optimize Llama 8b” as “Other”; downstream prompts got generic context. Catch keywords (inference, llm, vllm, sglang, …) and known families (llama, qwen, mistral, mixtral, phi, gemma, deepseek, …) now route to the “llm inference” modality.Reasoning paragraph reads in full. The “Reasoning” paragraph in the plan markdown used to truncate at 420 characters with an ellipsis cliff; now it renders the full LLM output. Tradeoffs / rejected alternatives bullets stay capped at 250 characters (up from 150).Open-questions card. The plan’s openQuestions field is now surfaced inline above the markdown body. Each question renders as a row with a required-chip when applicable and an option-hint list when present. Hidden entirely when the plan has no open questions.Quantization quality gate is a typed field. Phase-quant carries qualityThreshold: 0.95 on the plan graph and the runbook UI renders “canary >= 0.95” inline next to the phase name instead of “default >= 0.95” in description prose.Total ETA shows a range. Plans now ship expectedDurationRangeSec per phase (min, typical, max) and a graph-level totalExpectedDurationRangeSec. The runbook header reads “X of 14m, up to 38m total” when the range is present, so the FP8-full-canary-runs-long case is no longer a surprise.Estimate assumptions plumbed. LatencyEstimate and StackEstimate now carry an assumptions field exposing the request profile (context tokens, output tokens, batch, concurrency) the formula used. The number is the same; the disclosure is new.Cost rationale spells out the request profile. The plan’s cost line used to read "$0.0029 / request" with no disclosure of what input/output/batch/concurrency was assumed. The rationale string now reads “Derived from the latest feasibility estimate (assumes 2048 input tokens, 256 output tokens, batch 1, concurrency 1). The execution run replaces this with measured cost per request after baseline profiling.” so the number is never quoted profile-free.Runbook plan streams smoothly word-by-word. The Plan tab’s runbook viewer used to receive paragraphs from the server with 80-3500ms gaps between SSE chunks, which the client reveal hook saw as long idle windows. The user perceived this as paragraph-by-paragraph chunkiness. The server now ships the full plan markdown in ~300ms regardless of length; the client useSmoothTextReveal hook (already wired into the viewer) drives the perceptual cadence at ~71 wps with steady buffer growth instead of cliff-edged jumps. Long plans now reveal continuously over the natural reading time, no more stalls between paragraphs.May 18, 2026
DocsSite
Docs polish: use cases, research, news, SSE event reference, deployment targets
Documentation expansion
A round of docs additions to match what shipped on the product:New sections- Use cases. Six pre-built workflow pages live under /use-cases: voice-agent, ai-assistant, embeddings, rag-search, document-ai, transcription. Each walks through the architecture, the canonical model stack, an example prompt to paste into Pipes, and a code snippet for the OpenAI-compatible API.
- Research index at /research/overview with five published arXiv papers grouped into Compute efficiency and Model architectures. Each entry links to arXiv (PDF + abstract) and the code repo on GitHub.
- News overview at /news/overview pointing at the live newsroom plus RSS / Atom subscription URLs and the structured-data setup AI engines rely on.
- Deployment targets at /deployments/targets explaining the three places a pipeline can ship: managed RunPod (default), self-hosted Modal, and custom GPU.
- SSE event reference at /api-reference/sse-events with every server-sent-event the engine emits during chat, optimization, and runbook streams, including heartbeats and reconnection rules.
- Instant Start now covers the regional cache architecture, eviction rules, multi-GPU shard staging, and the cold-start time breakdown.
- Autoscaling explains how replica count is computed from concurrency and queue depth, with a cost-versus-latency knob table.
- Rate limits documents the leaky-bucket burst behavior and the per-key versus workspace scope.
- OpenAI compatibility lists the unsupported parameters explicitly, the HTTP-status to OpenAI
error.typemapping, and thestrict_params=trueflag. - Plans now spells out the two independent credit pools, optimization credits and inference credits, that paid plans use.
- Mintlify theme tuned to sharp edges (no rounded corners on callouts or sidebar active items), brand-lime primary collapsed to a single accent across light/dark, and
Inter Displayas the font family. - Every page swept for em-dashes, en-dashes, and middle-dot separators, all replaced with commas, hyphens, or rewrites per the house style.
May 10, 2026
PlatformReliabilityPrivacy
Measured-only metrics, realtime reliability, and privacy hardening
Measured numbers, no more guesses
Every number you see in the product is now backed by a real benchmark. If we can’t measure it on real hardware, we don’t show a number at all.Optimization and feasibility- Feasibility cards now report fits or doesn’t-fit only. Before, the GPU comparison grid showed estimated latency and dollars-per-request derived from a physics roofline. Those numbers were ±25% accurate at best and looked authoritative. They’re gone. Real latency lands during the runbook on a real GPU.
- KV cache quality scores now use measured FP16 comparisons in fast and deep modes. Fast-mode used to ship hardcoded heuristics (FP8 = 0.99, INT4 = 0.95). Both modes now run a real per-model inference comparison against FP16 (8 prompts in fast, 20 in deep).
- Optimization rows drop placeholder metrics on failure paths. When a re-profile on the recommended GPU fails, the row no longer ships the orchestrator’s heuristic latency. You see the method, GPU, and quantization. No fake numbers.
- Synthetic HuggingFace configs are gone. Gated models (Llama 3.1, Mixtral, Qwen 2.5) now require
HF_TOKENwith license access. Hardcoded architecture fallbacks could drift from the actual model and silently break deployments. They’re removed.
Reliability and stability
- Deployment subscriptions now use per-consumer Supabase channels to avoid callback registration races. Fixes the “cannot add postgres_changes callbacks” error some users hit when opening a pipeline with multiple optimization versions in history.
- SSE drain failures surface to Sentry. Stream-truncation events that used to silently drop now show up with workspace context so we can act on them.
- Chat, deploy, infer, and optimize requests now include workspace trace headers.
X-User-Id,X-Workspace-Id,X-Plan-Tier, andX-Request-Idthread end-to-end so a single trace can be linked across RunPipe and the engine.
Privacy and observability
- Do Not Track is respected. PostHog initialization aborts when
navigator.doNotTrack === "1"regardless of cookie consent. - Client IPs are no longer sent to analytics. PostHog client init now passes
ip: false. - Signout resets analytics identity. Posthog identity and Sentry scope clear on logout so the next user on the same device gets a clean session.
- URL secret scrubbing. Sentry now redacts
?api_key=,?token=,?secret=, and?password=query params from captured request URLs.
UX polish
- User chat bubbles now use the same rounded-corner styling as dashboard tool cards. No more lone sharp panel sitting next to rounded surfaces.
- Deployments page now has a loading skeleton instead of flashing blank during slow first loads.
Runtime and endpoint expansion
RunInfra now exposes more of the serving stack directly in the product, so teams can choose the runtime and endpoint shape that matches their model type.Serving and models- Runtime-aware deployments. Pipelines can target vLLM, SGLang, TensorRT-LLM, or vLLM Omni when the selected model category supports that runtime.
- Embeddings API. Deployed embedding models can be called through the OpenAI-compatible
POST /v1/embeddingsendpoint for RAG, semantic search, clustering, and retrieval workflows. - Voice and audio endpoints. Speech-to-text and text-to-speech deployments expose OpenAI-compatible
/v1/audio/transcriptionsand/v1/audio/speechendpoints.
- Instant Start. FlashBoot is now Instant Start, RunInfra’s weight-caching layer for faster Flex cold starts.
- Exact endpoint playground tests. The Deploy tab playground now targets the selected deployment endpoint, so tests match the endpoint row you are inspecting.
- Workspace-scoped keys. One API key can reach every verified active deployment in a workspace. Pass the target
modelin the request body or discover available deployments withGET /v1/models.
Initial release
RunInfra is now live. Here is everything that shipped in the first release.Core platform- RunInfra is live. Build, optimize, and deploy AI inference pipelines through conversation. Describe what you need in plain English, and the agent handles the rest, model selection, GPU configuration, optimization, and deployment.
- Chat-driven pipeline builder. No YAML, no DevOps. The AI agent selects models, configures routing, and optimizes your pipeline from a single chat interface.
- Visual pipeline canvas. Drag-and-drop node composition with Model, Cache, Guardrail, Rate Limiter, Router, and Load Balancer nodes for teams that prefer a visual workflow.
- Session persistence. Conversations, optimization results, and pipeline state survive page reloads.
- GPU optimization. Benchmarks models across GPU types (L4, L40S, A100, H100, H200, B200) using real inference. Results show P50/P99 latency, throughput, and cost per request for every experiment.
- Quantization search. Finds and tests pre-optimized model variants (AWQ, GPTQ, FP8) against your baseline and ranks them by your stated constraints.
- Forge kernel optimization. Profiles GPU bottlenecks and applies pre-optimized Triton kernels for additional throughput improvements beyond quantization alone.
- NVIDIA TensorRT-LLM. Compiled inference engine for maximum throughput on NVIDIA GPUs. Available on the Team plan.
- Optimization dashboard. Compare optimization versions side by side with real metrics: latency (P50, P99), throughput, cost per request, and quality score.
- One-click deploy. Push optimized pipelines to production API endpoints with managed GPU hosting, auto-scaling, and monitoring.
- Deployment modes. Flex (scale-to-zero, pay only when processing) or Active (always-on with zero cold start, Team plan). Cold starts under 2 seconds with cached model weights.
- OpenAI-compatible endpoints. Every deployed pipeline works with the OpenAI SDK. Change two lines of code to switch from OpenAI to RunInfra.
- Per-token pricing. Transparent billing based on model size. See estimated costs before you deploy.
- API playground. Test your pipeline with real requests before deploying. See response quality, latency, and token usage in real time.
- Code export. Generate production-ready deployment files: Python scripts, Dockerfiles, Kubernetes manifests, and Docker Compose configurations.
- Usage analytics. Track requests, tokens, cost, and latency across all endpoints with daily charts and per-model breakdowns on the Observe dashboard.
- LLMs. Llama, Qwen, Mistral, DeepSeek, Gemma, Phi, and Cohere models supported out of the box.
- Speech-to-text. Whisper (all sizes) for automatic speech recognition.
- Text-to-speech. XTTS and Bark for speech synthesis.
- Custom models. Upload models from Hugging Face and run them through the full optimization and deployment workflow (Team plan).
- Starter. Free. 3 pipelines, 3 optimization sessions/month, 100 playground requests/day.
- Pro. $49/month. 20 optimization sessions, deployment to live API endpoints, priority email support.
- Team. $249/seat/month. 100 sessions per seat, TensorRT-LLM, Active deployment mode, shared Slack support. Unused sessions roll over.
- Enterprise. Custom pricing. Dedicated customer success manager and custom contract terms.
- Overage sessions. $2.50 each on all paid plans.
- Full documentation published: prompting guide, example conversations, feature docs, and troubleshooting.
Roadmap
RunInfra currently supports LLMs, embeddings, speech-to-text (Whisper), text-to-speech (XTTS, Bark), and vision-language pipelines where the selected model and runtime support them. The following capabilities are in active development:- Image generation. Stable Diffusion, FLUX, and other diffusion models with GPU optimization.
- Database integration: managed vector databases and traditional databases connected directly to inference pipelines.
- End-to-end AI infrastructure: ingest data, store embeddings, run inference, and serve results from one platform.
Want early access to any of these features? Contact us and tell us what you’re building.