Gemma 4 on Cactus: The first model you can talk to, show things, and trust to know when it needs help

Gemma 4 is the first on-device model that genuinely works across text, vision, and audio in a single architecture, and does it well enough that you stop thinking about the model and start thinking about what to build with it.

We shipped day-one support in Cactus. Here's what that means in practice and why we think this changes the trajectory of on-device AI.

What's actually new here

Most "multimodal" models bolt on vision or audio as an afterthought. Gemma 4 is different. The vision encoder, audio encoder, and language model were trained together from the start. The audio conformer isn't a whisper-style bolt-on. It's a 300M-parameter encoder that feeds directly into the transformer's residual stream. Same for vision. The model doesn't "transcribe then think." It reasons over the raw modality.

This matters because latency compounds. A pipeline that transcribes audio, then feeds text to an LLM, then generates a response has three serialized steps. Gemma 4 collapses that into one forward pass. On a modern ARM device, that means 0.3 seconds from the end of a 30-second voice clip to the first token of a response. Not 0.3 seconds for the transcription step. 0.3 seconds total.

Performance

Cactus targets ARM across platforms: Apple Silicon Macs, iPhones, iPads, Vision Pro, and Android devices with ARM64 chipsets (Snapdragon, Dimensity, Tensor). On Apple hardware we additionally leverage the Neural Engine for vision and audio encoders, while on Android and Linux the runtime uses NEON, i8mm, and dot-product intrinsics.

Metric (M5 Mac/iPad/Vision Pro)	E2B
4096-token prefill	660 tok/s
1024-token decode	40 tok/s
30s audio end-to-end	0.3s
Image encode (ANE)	0.7s

40 tok/s decode is faster than most people read. You're not waiting for the model. The model is waiting for you.

The benchmarks are hard to believe

We normally don't lead with benchmarks because they rarely reflect real-world use. But these numbers demand attention.

LLM

Benchmark	E4B	E2B	Gemma 3 27B (no think)
MMLU Pro	69.4%	60.0%	67.6%
AIME 2026 (no tools)	42.5%	37.5%	20.8%
LiveCodeBench v6	52.0%	44.0%	29.1%
Codeforces ELO	940	633	110
GPQA Diamond	58.6%	43.4%	42.4%
Tau2 (avg over 3)	42.2%	24.5%	16.2%
BigBench Extra Hard	33.1%	21.9%	19.3%
MMMLU	76.6%	67.4%	70.7%
MRCR v2 8-needle 128k (avg)	25.4%	19.1%	13.5%

E4B, a 4.5B-effective-parameter model running on your phone, outperforms Gemma 3 27B on nearly every benchmark. E2B, the smaller variant at 2.3B effective parameters, matches or beats it on half of them. The AIME and LiveCodeBench numbers are particularly striking: these are hard reasoning tasks where you'd expect scale to dominate.

Vision

Benchmark	E4B	E2B	Gemma 3 27B (no think)
MMMU Pro	52.6%	44.2%	49.7%
OmniDocBench 1.5 (edit dist, lower=better)	0.181	0.290	0.365
MATH-Vision	59.5%	52.4%	46.0%
MedXPertQA MM	28.7%	23.5%	-

E4B beats Gemma 3 27B on vision tasks across the board. Document understanding (OmniDocBench) is 2x better. This isn't a toy. It's genuinely useful for reading receipts, parsing handwritten notes, understanding diagrams.

Audio

Benchmark	E4B	E2B
CoVoST	35.54	33.47
FLEURS (lower=better)	0.08	0.09

Model Architecture

Property	E2B	E4B
Total Parameters	2.3B effective (5.1B w/ embeddings)	4.5B effective (8B w/ embeddings)
Layers	35	42
Sliding Window	512 tokens	512 tokens
Context Length	128K tokens	128K tokens
Vocabulary Size	262K	262K
Supported Modalities	Text, Image, Audio	Text, Image, Audio
Vision Encoder	~150M params	~150M params
Audio Encoder	~300M params	~300M params

The architecture uses per-layer embeddings with AltUp, a technique that keeps most of the vocabulary knowledge in a shared embedding table while giving each layer a small specialized projection. This is how they fit 262K vocabulary tokens into a model that runs on a phone. The sliding window attention at 512 tokens with global attention every 5 layers gives you 128K context without the quadratic memory cost.

What this unlocks

Voice control that actually works

We've all used voice assistants that feel like speaking into a form field. The voice gets transcribed, the text gets processed, and you get a response that could just as easily have been typed. Gemma 4 doesn't work that way. Because it reasons directly over the audio signal, it picks up on tone, hesitation, emphasis. It knows the difference between "delete that" spoken confidently and "delete that?" spoken as a question.

On headsets running Cactus (Vision Pro, Quest, or any AR/VR platform with ARM compute), this means spatial voice control that responds to how you speak, not just what you say. Ask it to "move that over there" while looking at a 3D object and gesturing, and the model has the audio understanding to disambiguate "that" from context. The 0.3s latency means the interaction feels physical. You speak, things happen.

Voice agents that run locally

If you're building a voice agent (a customer service bot, an in-car assistant, a medical triage system) you currently have two bad options. You can run everything in the cloud and deal with latency, cost, and privacy issues. Or you can cobble together a local pipeline of ASR + LLM + TTS and deal with error propagation and integration pain.

Gemma 4 on Cactus gives you a third option. The model handles voice understanding natively. You pipe audio in, you get structured responses out. Tool calling works. The model can decide to call functions, hit APIs, or route to a cloud model when the task exceeds its capability. And because it runs locally, there's no per-request cost and no audio leaving the device.

For healthcare, legal, and finance applications where audio data is sensitive, this isn't a nice-to-have. It's a compliance requirement.

Hybrid inference: knowing when to ask for help

This is the part we're most excited about. Gemma 4 is good, but it's not omniscient. A 2B-parameter model isn't going to write a production database migration or debug a complex distributed system. What it can do is recognize when a task is beyond its capability and route it to a frontier cloud model.

Cactus implements this as cloud handoff. The on-device model evaluates the complexity of the request, and if it determines it can't handle it confidently, it signals for handoff. The request goes to a cloud model (Claude, GPT-4, Gemini, your choice), the response comes back, and the user sees a seamless interaction. The on-device model handles the 80% of requests that are straightforward. The cloud handles the 20% that need heavy lifting.

This means you can build apps that feel like they have frontier-model intelligence while keeping costs at a fraction of full cloud inference. And the user's casual conversations, voice notes, and image queries never leave their device.

Building on any platform

Cactus runs Gemma 4 on macOS, iOS, Android, and Linux.

Try it

brew install cactus-compute/cactus/cactus
cactus run google/gemma-4-E2B-it

The weights download automatically. INT4 quantized, optimized for ARM. Both E2B and E4B are available.

Build with it

Cactus supports React Native, Flutter, Swift, Kotlin, Python, Rust, and C++. Pick the SDK that fits your stack:

Platform	SDK	Install
React Native	cactus-react-native	`npm install cactus-react-native`
Flutter	cactus-flutter	Dart FFI bindings
Swift (iOS/macOS)	cactus-apple	XCFramework with NPU support
Kotlin (Android)	cactus-android	JNI + Kotlin Multiplatform
Python	cactus-ai	`pip install cactus-ai`
Rust	cactus-sys	`cargo add cactus-sys`
C++	cactus.h	Single header, link `libcactus.a`

Full quickstart with code examples for every SDK: cactus.dev/docs/quickstart

Pre-quantized weights for Gemma 4 and 30+ other models: huggingface.co/cactus-compute

If you're building something with on-device multimodal AI (voice agents, VR interfaces, local-first apps) we want to hear about it. Open an issue on GitHub, or just start building.