Architecture

The Death of the Cloud

We've spent the last decade treating our phones like dumb terminals for AWS. But the hardware caught up. The cloud was just a stopgap.

May 20, 2024

I've spent most of my career waiting on network requests.

For the last fifteen years, the prevailing wisdom in software was that real compute happens in a windowless data center in Virginia. We called it "the cloud." We treated our phones and laptops like expensive, glass-covered dumb terminals.

We pushed everything over the network because local hardware kind of sucked. We traded immediacy for brute force, and we learned to live with loading spinners, dropped connections, and AWS bills that go up every month.

It was a compromise. But I think that compromise is basically over.

Generative AI is what finally broke the model. Sending every keystroke, voice command, and half-baked thought to a remote server is expensive, and it feels terrible to use.

The physics problem

You can't negotiate with the speed of light.

When your app pings an OpenAI endpoint, it serializes a payload, does a TLS handshake, routes through a chaotic mess of cell towers and fiber, sits in a queue, generates a response token by token, and ships it all back.

Best case scenario, you're waiting a few hundred milliseconds. Often, it's a full second or two.

In UI terms, a second is an eternity. It completely ruins the illusion. Talking to an AI with a 500ms delay feels less like interacting with a smart assistant and more like arguing with someone over a bad Zoom connection.

The hardware caught up

Look at the device you're reading this on right now.

If it's an iPhone 15 or an M-series Mac, you're sitting on a massive amount of unified memory and memory bandwidth that would have rivaled a supercomputer not that long ago. Apple didn't just make the battery last longer; they quietly shipped server-class compute to the edge.

When you take that hardware and run a heavily quantized model on it, like a 1.5B or 3B parameter model running natively through Metal, the math changes.

The model is already sitting in memory. You tap a button, and inference starts. No network handshake. No waiting in line. It just runs.

It's also free. The user already bought the phone. You aren't paying fractions of a cent per token to a cloud provider every time someone asks a question.

And it works when you're on a subway, or in airplane mode, or when your wifi router decides to restart.

The latency war

We gave AI products a pass for the last two years because the tech was new. It was a neat parlor trick that a computer could summarize a PDF, so we didn't mind waiting three seconds for it to happen.

I think the novelty phase is over.

People expect software to be fast. If you're building an AI feature today, your real competition isn't whether your model is 2% smarter on a benchmark. Your competition is the user's patience.

You can't build fluid, subconscious interactions over a REST API. You just can't.

The actual pivot

I'm not saying data centers are going away. If you need a massive, frontier model to write a legal brief or do complex multi-step reasoning, you're still going to send that to the cloud.

But what about the other 90% of things we use AI for? Summarizing a chat thread, extracting a date from an email, formatting a list, fixing grammar?

That stuff doesn't need a trillion parameters. A local 1.5B model handles it easily.

We built the Onde SDK because we got tired of writing the same C++ and Rust boilerplate to get models running locally on iOS and Android. We wanted a bridge between the raw speed of mistral.rs and the Swift code we actually write in our apps. If you want to test the pipeline before wiring it into an app, use Onde CLI.

The cloud was a great hack while we waited for our phones to get fast. They're fast now. Start using them.