Which model loads by default?

platform_default() picks Qwen 2.5 Coder 1.5B on iOS, tvOS, and Android, and Qwen 2.5 Coder 3B on macOS, Linux, and Windows. You can override it with any GgufModelConfig constructor.

Where are models cached?

On macOS the HuggingFace hub cache lives at ~/.cache/huggingface. On iOS and tvOS the sandbox requires you to call setupInferenceEnvironment() at launch to seed HF_HOME inside the app container before any OndeChatEngine call.

Does Onde Cloud use a different model than on-device?

No. The same GGUF models run on both. Your dashboard assignment determines which model the cloud endpoint serves. The request model field is accepted for compatibility but dashboard assignment takes precedence.

Can I stream tokens from the cloud API?

Yes. Add "stream": true to your request body. The endpoint returns Server-Sent Events using the same format as the OpenAI streaming API.

Does data leave the device with the on-device SDK?

No. When you use the on-device SDK, inference runs entirely in-process. No prompt, token, or result is transmitted over the network. The only network activity is the initial model download from HuggingFace Hub.

What HuggingFace token do I need?

All current default models (Qwen 2.5 and Qwen 3 family) are public and do not require a token. A token is only needed if you want to download gated models or upload custom fine-tunes.

Documentation

Everything in one place.

Quick-start guides, API reference, model table, and FAQ for every Onde Inference entry point.

01 / Quick start

Pick your language.

Every SDK ships the same OndeChatEngine surface. Load a model, send a message, read the result. The runtime is native on every platform — no server required.

SwiftiOS · macOS · tvOS · visionOS · watchOS

import Onde

let engine = OndeChatEngine()
try await engine.loadDefaultModel(
    systemPrompt: "You are helpful.",
    sampling: nil
)
let result = try await engine.sendMessage(
    message: "Hello!"
)
print(result.text) // 85ms, on device

View full docs →

RustmacOS · Linux · Windows

use onde::inference::{ChatEngine, GgufModelConfig};

let engine = ChatEngine::new();
engine.load_gguf_model(
    GgufModelConfig::platform_default(),
    Some("You are helpful.".into()),
    None,
).await?;

let result = engine.send_message("Hello!").await?;
println!("{}", result.text); // 85ms, on device

View full docs →

FlutteriOS · Android · macOS

import 'package:onde_inference/onde_inference.dart';

final engine = OndeChatEngine();
await engine.loadDefaultModel(
    systemPrompt: 'You are helpful.',
);

final result = await engine.sendMessage(message: 'Hello!');
print(result.text); // 85ms, on device

View full docs →

React NativeiOS · Android

import { OndeChatEngine } from '@ondeinference/react-native';

await OndeChatEngine.loadDefaultModel('You are helpful.');

const result = await OndeChatEngine.sendMessage('Hello!');
console.log(result.text); // 85ms, on device

View full docs →

02 / SDKs

Four first-class entry points.

One engine. All four SDKs share the same GGUF runtime, the same model cache, and the same API surface. Pick the one that matches your stack.

Swift SDKSPM · XCFramework

Binary package for all Apple platforms. Add the GitHub URL in Xcode, done.

Rust cratecrates.io

The core engine. Use it directly in Rust apps or as the foundation for custom integrations.

Flutter SDKpub.dev

Cross-platform Dart bindings. Works on iOS, Android, and macOS from a single import.

React Nativenpm

Expo module wrapping the Rust core. iOS and Android from one JavaScript import.

03 / Cloud API

OpenAI-compatible. No migration cost.

Onde Cloud runs at cloud.ondeinference.com. Auth is a single Bearer token. Any client that already uses the OpenAI API works without modification.

Sign in and create an app

Create an account, register an app workspace, and copy your app_id and app_secret from the console.

Make your first request

The endpoint is OpenAI-compatible. Any client that already speaks the OpenAI API works without modification.

curl https://cloud.ondeinference.com/v1/chat/completions \
  -H "Authorization: Bearer <app_id>:<app_secret>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5-coder-3b",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Use the OpenAI SDK

Point the base URL at Onde Cloud. Everything else stays the same.

from openai import OpenAI

client = OpenAI(
    base_url="https://cloud.ondeinference.com/v1",
    api_key="<app_id>:<app_secret>",
)

response = client.chat.completions.create(
    model="qwen2.5-coder-3b",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

Base URLhttps://cloud.ondeinference.com/v1

Auth headerAuthorization: Bearer <app_id>:<app_secret>

Health checkGET /health → 200 OK

Models listGET /v1/models → 200 OK (authenticated)

04 / CLI

Install once. Use everywhere.

The onde binary ships on every major package manager. It handles account management, model downloads, local fine-tuning, and GGUF export.

npm

npm install -g @ondeinference/cli

Homebrew

brew tap ondeinference/homebrew-tap && brew install onde

PyPI

pip install onde-cli

cargo

cargo install onde-cli

05 / Models

Supported GGUF models.

All models are Q4_K_M quantized GGUF files sourced from bartowski on HuggingFace. platform_default() selects automatically based on your target OS.

Model IDSizeTarget

qwen2.5-coder-1.5b941 MBMobile · iOS · Android

qwen2.5-coder-3b1.93 GBDesktop · macOS · Linux

qwen2.5-coder-7b4.4 GBHigh-memory devices

qwen3-1.7b1.3 GBAll platforms

qwen3-4b2.7 GBAll platforms

qwen3-8b5 GBHigh-memory devices

qwen3-14b8.4 GBHigh-memory devices

deepseek-coder-6.7b3.8 GBHigh-memory devices

06 / FAQ

Common questions.

Which model loads by default?: platform_default() picks Qwen 2.5 Coder 1.5B on iOS, tvOS, and Android, and Qwen 2.5 Coder 3B on macOS, Linux, and Windows. You can override it with any GgufModelConfig constructor.
Where are models cached?: On macOS the HuggingFace hub cache lives at ~/.cache/huggingface. On iOS and tvOS the sandbox requires you to call setupInferenceEnvironment() at launch to seed HF_HOME inside the app container before any OndeChatEngine call.
Does Onde Cloud use a different model than on-device?: No. The same GGUF models run on both. Your dashboard assignment determines which model the cloud endpoint serves. The request model field is accepted for compatibility but dashboard assignment takes precedence.
Can I stream tokens from the cloud API?: Yes. Add "stream": true to your request body. The endpoint returns Server-Sent Events using the same format as the OpenAI streaming API.
Does data leave the device with the on-device SDK?: No. When you use the on-device SDK, inference runs entirely in-process. No prompt, token, or result is transmitted over the network. The only network activity is the initial model download from HuggingFace Hub.
What HuggingFace token do I need?: All current default models (Qwen 2.5 and Qwen 3 family) are public and do not require a token. A token is only needed if you want to download gated models or upload custom fine-tunes.

Get started

Start building with Onde.

Create an account, register an app, and ship your first on-device or cloud inference call in minutes. Talk to us when you need scale, custom models, or security review.

Talk to us