Performance

The Latency War

We gave generative AI a free pass on speed for two years because it felt like magic. That honeymoon is over. People just want their apps to be fast again.

June 15, 2024

I keep coming back to a basic rule of building software that the AI industry seems to have completely forgotten: if a button takes more than 100 milliseconds to do something, it feels broken.

For the last couple of years, we've given AI products a massive free pass on performance. The first time a computer writes a poem or debugs a Python script for you, you don't really care that it took four seconds. You're just staring at the blinking cursor, watching the tokens stream in, thinking about how the world is changing.

But I think that grace period is basically over.

AI is moving from a parlor trick to a utility. And the thing about utilities is that we expect them to happen instantly. We're entering what I'd call the latency war.

The illusion of speed

When you build an AI feature by wiring up a cloud API, you're building on top of a chaotic system.

A user taps a button. The phone wakes up the cellular radio, does a DNS lookup, negotiates TLS, and fires a JSON payload into the void. It hits a load balancer somewhere in Virginia, sits in a queue, gets chunked through a massive GPU cluster, and then trickles back to the phone token by token.

You can hide this with clever UI. You can build nice skeleton loaders. You can stream the text so it looks like it's typing. But you can't cheat physics. It's still slow.

It feels heavy. It feels like you're talking to someone who takes a deep breath before answering every single question.

The 500ms tax

Latency isn't just an engineering metric. It's a psychological tax.

Think about the stuff that actually makes your phone feel good to use. Autocorrect. Pull-to-refresh. Swiping to go back. These interactions are invisible because they happen synchronously with your thumb. If your keyboard took 500 milliseconds to register a keystroke, you'd throw the phone across the room.

Yet we're trying to build the next generation of software using infrastructure that guarantees those exact 500-millisecond delays.

If AI is supposed to be this ambient, invisible layer that augments how we think, it has to operate at the speed of thought. The second I notice the loading state, the feature stops being an extension of my brain and becomes a clunky tool I have to manage.

Local compute

You can't fix this in the cloud. You have to move the compute to the device.

This used to be impossible, but look at the hardware Apple and Qualcomm are shipping right now. We have dedicated Neural Engines and massive unified memory pools sitting in our pockets.

When you run a quantized 1.5B parameter model natively through Metal, the whole paradigm breaks open. There's no network handshake. There's no waiting in line. The time-to-first-token is measured in nanoseconds. The generation happens in the same memory space as your UI rendering loop.

It just feels native.

I'm not saying you don't need cloud models. If you're doing complex multi-step reasoning or writing a novel, sure, call a frontier model. But for 90% of what we actually do on our phones—summarizing a text, fixing a typo, pulling an address out of an email—you don't need a trillion parameters. You need something small, and you need it right now.

We built the Onde SDK because we wanted to stop waiting on network requests. Getting local inference working shouldn't require managing GPU backends or compiling C++. It should just be a Swift package you drop into Xcode.

The novelty of AI is wearing off. Users just want their apps to be fast again. Stop making them wait.