How fast is on-device AI inference on Apple Silicon?

Onde Inference runs at 85ms first-token latency on Apple Silicon devices, compared to 800ms or more for a typical cloud API round-trip. Inference runs entirely on-device with no network request in the critical path.

Does data leave the device when using Onde Inference?

No. When using the on-device SDK, all inference runs in-process on the device. No prompt, token, or result is sent over the network. The only network activity is the initial one-time model download from HuggingFace Hub.

Which programming languages does Onde Inference support?

Onde Inference ships first-class SDKs for Swift (iOS, macOS, tvOS, visionOS, watchOS), Rust (macOS, Linux, Windows), Flutter/Dart (iOS, Android, macOS), and React Native (iOS, Android). All four share the same underlying GGUF runtime and OndeChatEngine API surface.

Onde Cloud is an OpenAI-compatible inference API at cloud.ondeinference.com. It runs the same GGUF models as the on-device SDK. When a device hits its memory or compute limit, apps can route requests to Onde Cloud with a single line of code change. Auth uses a Bearer token scoped per app.

Private AI inference

Instant local inference.

A native on-device runtime built on Apple Silicon. Your data never leaves the device — compliant by architecture.

Start building Talk to an engineer

main.rs

use onde::inference::{ChatEngine, GgufModelConfig};
 
let engine = ChatEngine::new();
engine.load_gguf_model(
    GgufModelConfig::platform_default(),
    Some("You are a helpful assistant.".into()),
    None,
)
    .await?;
 
let result = engine.send_message("Hello!").await?;
println!("{}", result.text);
// completed in 85ms — 100% on device

In production across

App Store
Hugging Face
crates.io
npm
pub.dev
Homebrew

01 / Edge Compute

Zero latency.
Zero cost margin.

Compiled natively in Rust, Swift, or Flutter. Runs directly on Apple Silicon unified memory. 85 ms first-token latency, absolute privacy, and zero server overhead for every local workload.

85msFirst-token latency
$0Server cost on-device
100%Data stays on device

02 / Cloud Fallback

Seamless fallback.
Enterprise state.

When the local model hits its limit, Onde bursts to high-performance cloud compute. Heavy-parameter routing, global state sync, and ironclad privacy compliance — transparent to your users.

DeviceApple Silicon · on-device

Token throughputState syncAES-256

Onde Cloudcloud.ondeinference.com

OpenAI-compatibleDrop-in endpoint for any client already using the OpenAI API.

App-scoped authBearer credentials are scoped per app. No shared secrets.

Global state syncConversation context follows the user across device and cloud.

Security · Compliance

Compliant by
architecture.

Most inference vendors send your users' data to a shared GPU fleet, then ask you to trust the paperwork. Onde runs the model in-process on the device. There is no prompt to intercept, no transcript to subpoena, no third party in the data path.

No data egressInference stays on-devicePrompts, tokens, and results never touch the network. The only call is a one-time model download.

GDPR · HIPAARegulation-ready by defaultNo PII or PHI leaves the user's hardware, so data-residency and processing-agreement burden drops to near zero.

AES-256Encrypted cloud burstWhen a workload bursts to Onde Cloud, transport is encrypted and auth is scoped per app — no shared secrets.

No trainingYour data is never trained onCustomer prompts and completions are never used to train or tune any model, on-device or in the cloud.

Solutions

Built for teams that
can't leak data.

Healthcare

PHI never leaves the device

Run clinical assistants, scribing, and triage on the clinician's own iPad or Mac. No BAA gymnastics for the inference path.

Financial services

Zero data-egress AI

Summarize, classify, and draft against sensitive records without a single token crossing your network boundary.

Consumer apps

Private & offline by default

Ship assistant features that work on a plane, cost nothing per call, and keep user data on the user's phone.

Regulated & public sector

Sovereign by design

Data residency is wherever the device is. Pair on-device defaults with an encrypted cloud burst only when you choose.

SwiftApple platforms · SPM RustNative · crates.io React NativeMobile · npm FlutterCross-platform · pub.dev

Powering Splitfire AB apps in production on the Apple App Store.

Enterprise

Ship it with a team
behind you.

Volume licensing, custom and fine-tuned models, dedicated cloud capacity, security review support, and a direct line to the engineers who build the runtime.

Talk to an engineer See pricing

Dedicated supportSlack-connect channel and named engineering contact.
Custom modelsBring or fine-tune GGUF models for your domain.
Reserved capacityProvisioned cloud throughput with predictable cost.
Security reviewArchitecture docs and questionnaire support for procurement.

The world's intelligence.
On your terms.

Start building Talk to an engineer

Instant local inference.

Zero latency.Zero cost margin.

Seamless fallback.Enterprise state.

Compliant byarchitecture.

Built for teams thatcan't leak data.