When the device isn't enough.

Onde Cloud is an OpenAI-compatible LLM inference API from Onde Inference. Server-side. Integrated with your existing Onde account and model catalog. Change one baseURL — your existing OpenAI SDK connects to it.

API format
OpenAI-compatible REST
Models
Qwen 2.5 Coder, Qwen 3, GGUF Q4_K_M
Auth
Bearer token (app ID:secret)
Streaming
SSE, same as OpenAI
Before
import OpenAI from "openai";

const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
After
import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://cloud.ondeinference.com/v1",
  apiKey: "your-app-id:your-app-secret",
});

Everything else stays the same. Same SDK. Same response shape. Same streaming. Works with the official openai npm package and the openai Python library.

OpenAI-compatible REST API

Drop in. No rewrite required.

Onde Cloud accepts the same request format as the OpenAI chat completions API. If you ship with Onde on-device, you already know the model catalog and the account system. If you're evaluating Onde Cloud as a Baseten or RunPod alternative, the integration takes one afternoon. You don't learn a new platform. You add one endpoint.

Drop in. Don't rewrite.

The endpoint speaks OpenAI. Your existing client libraries work as-is. Change one URL. Nothing else changes in your code.

Your model. Your call.

Assign a model to your app from the Onde dashboard. Update it any time without touching code or redeploying. Model changes take effect immediately.

Already an Onde user?

Your account is your API key. The same app you registered for on-device inference works here. No new signup. No new billing page.

Hybrid LLM inference architecture

On-device inference and cloud, from one account.

Most requests never need a server. Onde on-device handles them at ~85ms, for free, with full privacy — data never leaves the device. Onde Cloud is for the cases where the on-device model isn't the right call: background jobs, heavy prompts, server-rendered features, or platforms where you can't ship a model bundle.

The same model family. The same Onde account. Swap the endpoint, not the mental model.

On-deviceOnde Cloud
API formatOnde SDK (Swift, Dart, Rust)OpenAI-compatible REST
Where it runsThe user's deviceOnde inference servers
Latency~85ms, no networkNetwork + model time
AuthApp credentials (baked in at build)Same credentials, as Bearer token
Model selectionPlatform default or SDK configAssigned in Onde dashboard

API authentication

Simple authentication with your app credentials.

Register your app in the Onde dashboard. You get an app ID and a secret. Pass them as your Bearer token. That's the full auth setup — no token refresh, no OAuth flow, no rotating secrets page.

# Verify your credentials and confirm the endpoint is live
curl https://cloud.ondeinference.com/v1/chat/completions \
  -H "Authorization: Bearer your-app-id:your-app-secret" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5-coder-3b",
    "messages": [{ "role": "user", "content": "Hello" }],
    "stream": false
  }'

Supported GGUF models

Qwen 2.5 Coder, Qwen 3, and more — assign once, change any time.

Onde Cloud serves GGUF-quantized models from the Onde model catalog. Every model runs in Q4_K_M format, balancing quality and memory. Assign a model to your Onde app from the dashboard — no redeployment, no environment variable change, no support ticket.

ModelParametersSizeBest for
Qwen 2.5 Coder 1.5B Instruct1.5B~941 MBFast code generation
Qwen 2.5 Coder 3B Instruct3B~1.93 GBCode and reasoning
Qwen 2.5 Coder 7B Instruct7B~4.4 GBComplex code tasks
Qwen 3 1.7B1.7B~1.3 GBFast general chat
Qwen 3 4B4B~2.7 GBBalanced general use
Qwen 3 8B8B~5 GBHigh-quality reasoning

The same models available in the Onde Swift SDK, Dart SDK, and Onde CLI. If it runs locally, it runs here.

Common questions

Frequently asked questions.

Is Onde Cloud compatible with the OpenAI Python and JavaScript SDKs?
Yes. Onde Cloud speaks the OpenAI REST API format. Set baseURL to https://cloud.ondeinference.com/v1 and your apiKey to your Onde app credentials. No other changes are needed — chat completions, streaming, and model listing all work with the official openai npm package and the openai Python library.
What language models does Onde Cloud support?
Onde Cloud supports GGUF-quantized models from the Onde model catalog: Qwen 2.5 Coder 1.5B, 3B, and 7B Instruct (Q4_K_M); Qwen 3 1.7B, 4B, and 8B (Q4_K_M); and DeepSeek Coder 6.7B Instruct. You assign a model to your app from the Onde dashboard. The assigned model is served on every API request.
How do I get an API key for Onde Cloud?
Sign in to Onde Inference, register an app, and assign a model. Your app ID and app secret together are your API credentials. Pass them as your Bearer token: Authorization: Bearer your-app-id:your-app-secret. No separate API key management page.
What is the difference between Onde on-device inference and Onde Cloud?
Onde on-device inference runs language models directly on the user's Apple silicon device using the Onde Swift, Dart, or Rust SDK — no network request, ~85ms latency, zero server cost, full privacy. Onde Cloud is a server-side REST API for cases where on-device isn't feasible: background jobs, server-rendered features, or platforms without a native SDK. Both use the same Onde account and model catalog.

Get started

Your Onde app is already most of the way there.

Sign in, open your app, assign a model, copy your credentials. Five minutes to a working LLM inference endpoint. Read The Forward Pass if you want the longer argument for why on-device and cloud should work together.