Company
What Onde Is
Onde lets your iOS or macOS app run a language model locally, on the device, without calling any server. This is what that means, what it solves, and where it fits.
The short version: Onde lets your iOS or macOS app run a language model locally, on the device, without calling any server.
The problem
Adding AI to a mobile app usually means calling a cloud API. You write a request, wait for a round-trip to a data center, get a response. The cost scales with usage, the latency scales with network conditions, and every message your user types goes to infrastructure you don't control.
That works for a lot of apps. It works less well when:
- Response time matters. Conversational AI above 300ms starts to feel broken. Cloud round-trips on mobile networks rarely stay below that reliably.
- Cost scales against you. $0.01 per API call is fine at 1,000 users and a real problem at 1,000,000.
- Data cannot leave the device. Healthcare, legal, finance, mental health — any category where "we send your input to a third-party server" is a liability.
- The app needs to work offline. Planes, rural areas, enterprise MDM environments with restricted network egress.
On-device inference sidesteps all of these. The model runs in your app's process, on the user's hardware, with no network required after the initial download.
What Onde does
Onde is a Rust library with a Swift Package for Apple platforms. It handles the gap between "I want to run a model" and "I have a working model loaded and producing output."
That gap is bigger than it sounds. It includes downloading the model file from HuggingFace Hub, managing the on-disk cache so the model isn't re-downloaded every launch, configuring the inference engine for the target platform (Metal on Apple silicon), setting up conversation history, system prompts, and sampling parameters, and exposing a clean async API that won't block the UI thread.
The inference engine underneath is mistral.rs. Onde wraps it.
What it looks like
import Onde
let engine = OndeChatEngine()
// Downloads Qwen 2.5 1.5B on first run (~941 MB), then loads from cache.
try await engine.loadDefaultModel(
systemPrompt: "You are a helpful assistant.",
sampling: nil
)
let result = try await engine.sendMessage(message: "What year did the Berlin Wall fall?")
print(result.text)
// → "The Berlin Wall fell in 1989."Two calls. No API key. No server. No network after the first run.
The models
Onde defaults to the Qwen 2.5 family in GGUF Q4_K_M format — pre-quantized files where the 1.5B variant is 941 MB and the 3B is 1.93 GB.
On iOS, Onde loads the 1.5B. iOS gives apps roughly 2–3 GB of memory depending on the device; the 1.5B leaves headroom for everything else. The 3B caused OOM terminations on iPhone 16e in testing.
On macOS, the 3B default gives noticeably better output and sits comfortably on any Mac with 8 GB unified memory.
You can load any GGUF model from HuggingFace — the defaults are sensible starting points, not restrictions.
What it doesn't do
Onde is not a frontier model. Qwen 2.5 1.5B handles summarization, Q&A, classification, and conversational assistance well. It isn't GPT-4. Complex multi-step reasoning, very long documents, and tasks requiring deep world knowledge will get better results from a larger cloud model.
There's also the initial download. Users need to pull ~941 MB before the feature works. If AI is an occasional feature in your app, a cloud API is probably simpler. Onde makes the most sense when AI is central enough to the product that the download is worth it.
Where it runs
| Platform | Status | Acceleration |
|---|---|---|
| iOS | Production | Metal, Apple Neural Engine |
| macOS | Production | Metal, Apple Neural Engine |
| tvOS | Production | Metal |
| visionOS | In development | Metal |
| Android | In development | CPU |
The Swift Package is at github.com/ondeinference/onde-swift. The Rust crate is on crates.io/crates/onde.