Engineering

Fine-Tune a Language Model on Your Mac

Onde CLI now lets you LoRA fine-tune a model, merge the adapter, export to GGUF, and upload to HuggingFace, all from a terminal UI on Apple silicon. No cloud GPUs. No Python. No notebooks.

July 22, 2025

There's a weird bottleneck in on-device AI right now.

Running a model locally is a solved problem. Onde has been doing that on iPhones and Macs for a while: download a GGUF file, load it with Metal, get tokens out. It works. It's fast. No server required.

But customizing a model? That still means renting cloud GPUs, setting up a Python environment, wrestling with CUDA drivers, running a Jupyter notebook, converting formats, and hoping everything lines up when you try to load the result back into your app.

It's a pipeline that assumes you have an NVIDIA GPU, a cloud budget, and infinite patience for dependency hell.

We wanted something different. We wanted the whole thing to happen on the same machine where the model runs, the Mac sitting on your desk.

What we built

The latest version of onde-cli has a full fine-tuning pipeline that runs entirely in the terminal. No Python. No cloud. No notebooks. Just a Rust binary, Apple silicon, and your data.

The pipeline has four steps, and each one runs locally:

1. Fine-tune. LoRA training on any Qwen 2.5 or Qwen 3 safetensors model. The training loop runs on Metal via candle, Apple's Accelerate framework handles the linear algebra, and the whole thing fits comfortably in unified memory. A 3-epoch run on Qwen 2.5 0.5B with a small dataset takes a few minutes on an M-series chip.

2. Merge. The LoRA adapter (~2–5 MB) gets merged back into the base model weights. This produces a full safetensors file with your fine-tuned knowledge baked in.

3. Export to GGUF. The merged model gets converted to GGUF format. It's the same format Onde uses for inference. You can choose Q8_0 quantization (good balance of quality and size) or F16 (full precision, larger file). The GGUF writer handles the full Qwen 2 architecture: tensor name mappings, tokenizer vocabulary, merge rules, special tokens, and model metadata.

4. Upload to HuggingFace. The exported GGUF file gets streamed to a HuggingFace repository using the LFS protocol. One Enter keypress, and your fine-tuned model is on the Hub.

The result is a GGUF file that works with Onde's GgufModelBuilder, with llama.cpp, and with other GGUF-compatible runtimes. You fine-tune on your Mac, upload to HuggingFace, and your iOS app can download and run it. No format conversion, no compatibility headaches.

What it looks like

The entire flow happens inside a terminal UI. You navigate to a downloaded model, press f to open the fine-tune screen, and fill in your training data path:

Fine-Tune - Qwen/Qwen2.5-0.5B-Instruct

Model Directory
╭──────────────────────────────────────────────────╮
│ ~/.cache/huggingface/hub/models--Qwen--Qwen2.5.. │
╰──────────────────────────────────────────────────╯

Training Data (JSONL)
╭──────────────────────────────────────────────────╮
│ ~/.onde/finetune/train.jsonl                     │
╰──────────────────────────────────────────────────╯

LoRA Rank        Epochs        Learning Rate
╭────────╮       ╭────────╮   ╭────────────╮
│ 8      │       │ 3      │   │ 0.0001     │
╰────────╯       ╰────────╯   ╰────────────╯

Press Enter. Training starts. You see live progress (epoch, step, loss) with a progress bar. When it finishes, press m to merge, then g to export GGUF.

The model detail screen shows all your artifacts: LoRA adapters and exported GGUF files, sorted by date. Select a GGUF file and press Enter to open the upload screen. Edit the repo name, press Enter again, and the file streams to HuggingFace with a progress bar showing bytes uploaded.

The training data

Your dataset is a JSONL file. One JSON object per line, each with a text field containing the full conversation:

{"text": "User: What is Onde?\nAssistant: Onde is an on-device inference SDK for Apple silicon."}
{"text": "User: How do I load a model?\nAssistant: Call engine.loadDefaultModel() with a system prompt."}

That's it. No special formatting tools. No preprocessing scripts. The tokenizer handles the rest.

Why LoRA, and why these defaults

LoRA fine-tuning doesn't modify the base model weights. It learns a small low-rank update: two tiny matrices per layer that get applied on top. For a rank-8 LoRA on Qwen 2.5 0.5B, the adapter file is about 2 MB. The base model is 1 GB. You're training 0.2% of the parameters.

This matters for on-device work because it means:

Training fits in memory. The base weights are frozen and memory-mapped. Only the LoRA matrices and optimizer state need to be in active memory.
It's fast. Fewer parameters means fewer gradient computations. A training run finishes in minutes, not hours.
The base model stays intact. You can train multiple adapters for different tasks against the same base, and merge whichever one you need.

The defaults (rank 8, alpha 16, learning rate 1e-4, 3 epochs) are conservative and work well for instruction tuning on small datasets. If you need more capacity, increase the rank. If the model is overfitting, reduce epochs or lower the learning rate.

The GGUF export

This was the part that didn't really exist in the Rust ecosystem. There are tools for reading GGUF files (llama.cpp, mistral.rs, candle), but writing them from Rust with the right metadata, tokenizer encoding, and quantization was still a gap.

The GGUF writer in onde-cli produces files that match the format expected by mistral.rs and llama.cpp:

GGUF v3 binary format with 32-byte aligned tensor data
Full Qwen 2 architecture metadata (block count, head counts, RoPE frequency, layer norm epsilon)
HuggingFace tokenizer vocabulary, merge rules, and special token IDs parsed directly from tokenizer.json
Q8_0 quantization: 34 bytes per block of 32 values (f16 scale + 32 int8 quantized weights)
Handles both Qwen 2.5 and Qwen 3 tokenizer formats (Qwen 3 encodes merge rules as arrays instead of strings, the kind of thing that takes an hour to debug and one line to fix)

The exported file is a real GGUF model. HuggingFace validates it, shows the architecture metadata, and lets you download it. Onde's SDK loads it with GgufModelBuilder like any other model.

Why Rust, not Python

The honest answer is that Onde is already Rust. The inference engine is Rust. The SDK is Rust with Swift bindings. The CLI is Rust. Adding fine-tuning in Python would mean shipping a Python runtime, managing pip dependencies, and maintaining a language boundary that doesn't need to exist.

But there's a more practical reason: candle is genuinely good for this. It has autograd, Metal support through Apple's Accelerate framework, safetensors I/O, and optimizers. The training loop is about 400 lines of straightforward Rust. No framework magic, no hidden abstractions, just tensors, gradients, and a loss function.

The one thing candle can't do is train on GGUF files directly. Quantized tensors are detached from the autograd graph by design. QTensor::dequantize() sets BackpropOp::none(), so gradients can't flow through it. This is true across every Rust framework. You need the full-precision safetensors base model for training, which is why the pipeline starts with a safetensors download and ends with a GGUF export.

What's next

This is a v1. It works, it's useful, and it ships. But the list of things we want to improve is long:

More architectures. The GGUF writer currently handles Qwen 2 only. LLaMA and Mistral are next.
Batch size > 1. The current training loop processes one sequence at a time. Sequence packing would make better use of the GPU.
Validation set. Right now there's no eval loop, so you can overfit without noticing. A held-out split with periodic loss reporting would fix that.
QLoRA. Training on quantized weights would let you fine-tune larger models in the same memory budget. This needs custom autograd kernels that don't exist in candle yet.
LR scheduling. Cosine decay with warmup is standard for a reason. The constant learning rate works but isn't optimal.

Try it

Install the CLI:

brew tap ondeinference/tap
brew install onde

If you are here because you want the shipping side too, start with the Swift SDK or the SDK overview.

Download a base model (safetensors, not GGUF):

onde
# Sign in → Downloads → search "Qwen/Qwen2.5-0.5B-Instruct" → download

Prepare your training data at ~/.onde/finetune/train.jsonl, navigate to the model detail screen, and press f.

The whole pipeline runs on your Mac: fine-tune, merge, export, upload. The model you get back runs on your users' iPhones. No cloud required at any step.

That's the point.