Overview

What's New in v1.7

Cactus Hybrid Cloud: Automatic cloud handoff based on model confidence
Cactus Hybrid Transcription: Realtime Speech-to-Text with NPU acceleration and cloud correction
Voice Activity Detection: Silero VAD model for detecting speech in audio streams
Multi-Precision Downloads: Support for multiple precision options when downloading models from HuggingFace
Homebrew Installation: Install Cactus CLI with a single brew install cactus-compute/cactus/cactus command on macOS

Cactus is a hybrid inference engine for smartphones and edge devices. Cactus offers industry-leading on-device performance and automatically optimizes your AI workloads by routing requests between:

On-device: Smaller models running on the Cactus engine with NPU acceleration
Cloud: Frontier models for state-of-the-art performance

Cactus Hybrid measures model "confidence" in their responses in real-time and routes requests accordingly.

Architecture

Cactus consists of three layers that work together to deliver efficient on-device AI:

Cactus Engine

Energy-efficient inference engine

OpenAI-compatible APIs for C/C++, Swift, Kotlin, Flutter. Supports tool calling, auto RAG, NPU acceleration, INT4 quantization, and hybrid cloud handoff for complex tasks.

Cactus Graph

Zero-copy computation graph

PyTorch-like API for implementing custom models. Highly optimized for RAM efficiency and lossless weight quantization.

Cactus Kernels

Low-level ARM SIMD kernels

Optimized for Apple, Snapdragon, Google, Exynos, and MediaTek processors. Custom attention kernels with KV-Cache quantization and chunked prefill.

Performance Benchmarks

Performance on INT8 quantized models:

Flagship Models

Device	LFM2.5-1.2B (1k-Prefill/100-Decode)	LFM2.5-VL-1.6B (256px-Latency & Decode)	Whisper-Small (30s-audio-Latency & Decode)
Mac M4 Pro	582/77 tps (76MB RAM)	0.2s & 76tps (87MB RAM)	0.1s & 119tps (73MB RAM)
iPad/Mac M4	379/46 tps (30MB RAM)	0.2s & 46tps (53MB RAM)	0.2s & 100tps (122MB RAM)
iPad/Mac M2	315/42 tps (181MB RAM)	0.3s & 42tps (426MB RAM)	0.3s & 86tps (160MB RAM)
iPhone 17 Pro	300/33 tps (108MB RAM)	0.3s & 33tps (156MB RAM)	0.3s & 114tps (177MB RAM)
Galaxy S25 Ultra	226/36 tps (1.2GB RAM)	2.6s & 33tps (2GB RAM)	2.3s & 90tps (363MB RAM)

Mid-range Models

Device	LFM2-350m (1k-Prefill/100-Decode)	LFM2-VL-450m (256px-Latency & Decode)	Moonshine-Base (30s-audio-Latency & Decode)
iPad/Mac M2	998/101 tps (334MB RAM)	0.2s & 109tps (146MB RAM)	0.3s & 395tps (201MB RAM)
Pixel 6a	218/44 tps (395MB RAM)	2.5s & 36tps (631MB RAM)	1.5s & 189tps (111MB RAM)
CMF Phone 2 Pro	146/21 tps (394MB RAM)	2.4s & 22tps (632MB RAM)	1.9s & 119tps (112MB RAM)

How Hybrid Routing Works

Cactus eliminates the choice between expensive cloud and limited local compute.

Smart Routing: Cactus dynamically routes requests to the on-device NPU/CPU for simple tasks (like clear audio transcription or standard LLM queries) and scales up to cloud APIs for complex or noisy data.
Cloud Fallback: Configure your Cactus API key with cactus auth. Choose your fallback model. If the local model cannot handle the task complexity or context window, Cactus handles the failover automatically.

Join our Discord - Get help and connect with other developers
Visualize Repository - Explore the codebase structure
GitHub Repository - View source code and contribute

Architecture

Cactus Engine

Cactus Graph

Cactus Kernels

Performance Benchmarks

How Hybrid Routing Works

FAQ

Is Cactus free?

What model format does Cactus use?

Which models are supported?

Get Started

Quickstart

Hybrid AI

LLM

Transcription

Community

On this page