Company

What Onde Is

Onde lets your iOS or macOS app run a language model locally, on the device, without calling any server. This is what that means, what it solves, and where it fits.

July 10, 2025

The short version: Onde lets your iOS or macOS app run a language model locally, on the device, without calling any server.

The problem

Adding AI to a mobile app usually means calling a cloud API. You write a request, wait for a round-trip to a data center, get a response. The cost scales with usage, the latency scales with network conditions, and every message your user types goes to infrastructure you don't control.

That works for a lot of apps. It works less well when:

Response time matters. Conversational AI above 300ms starts to feel broken. Cloud round-trips on mobile networks rarely stay below that reliably.
Cost scales against you. $0.01 per API call is fine at 1,000 users and a real problem at 1,000,000.
Data cannot leave the device. Healthcare, legal, finance, mental health, any category where "we send your input to a third-party server" is a liability.
The app needs to work offline. Planes, rural areas, enterprise MDM environments with restricted network egress.

On-device inference sidesteps all of these. The model runs in your app's process, on the user's hardware, with no network required after the initial download.

What Onde does

Onde is a Rust library with a Swift Package for Apple platforms. It handles the gap between "I want to run a model" and "I have a working model loaded and producing output." If you want the package pages, start with the SDK overview, the Swift SDK, or Onde CLI.

That gap is bigger than it sounds. It includes downloading the model file from HuggingFace Hub, managing the on-disk cache so the model isn't re-downloaded every launch, configuring the inference engine for the target platform (Metal on Apple silicon), setting up conversation history, system prompts, and sampling parameters, and exposing a clean async API that won't block the UI thread.

The inference engine underneath is mistral.rs. Onde wraps it.

What it looks like

import Onde
 
let engine = OndeChatEngine()
 
// Downloads Qwen 2.5 1.5B on first run (~941 MB), then loads from cache.
try await engine.loadDefaultModel(
    systemPrompt: "You are a helpful assistant.",
    sampling: nil
)
 
let result = try await engine.sendMessage(message: "What year did the Berlin Wall fall?")
print(result.text)
// → "The Berlin Wall fell in 1989."

Two calls. No API key. No server. No network after the first run.

The models

Onde defaults to the Qwen 2.5 family in GGUF Q4_K_M format, pre-quantized files where the 1.5B variant is 941 MB and the 3B is 1.93 GB.

On iOS, Onde loads the 1.5B. iOS gives apps roughly 2–3 GB of memory depending on the device; the 1.5B leaves headroom for everything else. The 3B caused OOM terminations on iPhone 16e in testing.

On macOS, the 3B default gives noticeably better output and sits comfortably on any Mac with 8 GB unified memory.

You can load any GGUF model from HuggingFace. The defaults are sensible starting points, not restrictions.

What it doesn't do

Onde is not a frontier model. Qwen 2.5 1.5B handles summarization, Q&A, classification, and conversational assistance well. It isn't GPT-4. Complex multi-step reasoning, very long documents, and tasks requiring deep world knowledge will get better results from a larger cloud model.

There's also the initial download. Users need to pull ~941 MB before the feature works. If AI is an occasional feature in your app, a cloud API is probably simpler. Onde makes the most sense when AI is central enough to the product that the download is worth it.

Where it runs

Platform	Status	Acceleration
iOS	Production	Metal, Apple Neural Engine
macOS	Production	Metal, Apple Neural Engine
tvOS	Production	Metal
visionOS	In development	Metal
Android	In development	CPU

The Swift Package is at github.com/ondeinference/onde-swift. The Rust crate is on crates.io/crates/onde.