Overview
On-device and Hybrid cross-platform AI framework
What's New in v1.7
- Cactus Hybrid Cloud: Automatic cloud handoff based on model confidence
- Cactus Hybrid Transcription: Realtime Speech-to-Text with NPU acceleration and cloud correction
- Voice Activity Detection: Silero VAD model for detecting speech in audio streams
- Multi-Precision Downloads: Support for multiple precision options when downloading models from HuggingFace
- Homebrew Installation: Install Cactus CLI with a single
brew install cactus-compute/cactus/cactuscommand on macOS
Cactus is a hybrid inference engine for smartphones and edge devices. Cactus offers industry-leading on-device performance and automatically optimizes your AI workloads by routing requests between:
- On-device: Smaller models running on the Cactus engine with NPU acceleration
- Cloud: Frontier models for state-of-the-art performance
Cactus Hybrid measures model "confidence" in their responses in real-time and routes requests accordingly.
Architecture
Cactus consists of three layers that work together to deliver efficient on-device AI:
Cactus Engine
Energy-efficient inference engine
OpenAI-compatible APIs for C/C++, Swift, Kotlin, Flutter. Supports tool calling, auto RAG, NPU acceleration, INT4 quantization, and hybrid cloud handoff for complex tasks.
Cactus Graph
Zero-copy computation graph
PyTorch-like API for implementing custom models. Highly optimized for RAM efficiency and lossless weight quantization.
Cactus Kernels
Low-level ARM SIMD kernels
Optimized for Apple, Snapdragon, Google, Exynos, and MediaTek processors. Custom attention kernels with KV-Cache quantization and chunked prefill.
Performance Benchmarks
Performance on INT8 quantized models:
Flagship Models
| Device | LFM2.5-1.2B (1k-Prefill/100-Decode) | LFM2.5-VL-1.6B (256px-Latency & Decode) | Whisper-Small (30s-audio-Latency & Decode) |
|---|---|---|---|
| Mac M4 Pro | 582/77 tps (76MB RAM) | 0.2s & 76tps (87MB RAM) | 0.1s & 119tps (73MB RAM) |
| iPad/Mac M4 | 379/46 tps (30MB RAM) | 0.2s & 46tps (53MB RAM) | 0.2s & 100tps (122MB RAM) |
| iPad/Mac M2 | 315/42 tps (181MB RAM) | 0.3s & 42tps (426MB RAM) | 0.3s & 86tps (160MB RAM) |
| iPhone 17 Pro | 300/33 tps (108MB RAM) | 0.3s & 33tps (156MB RAM) | 0.3s & 114tps (177MB RAM) |
| Galaxy S25 Ultra | 226/36 tps (1.2GB RAM) | 2.6s & 33tps (2GB RAM) | 2.3s & 90tps (363MB RAM) |
Mid-range Models
| Device | LFM2-350m (1k-Prefill/100-Decode) | LFM2-VL-450m (256px-Latency & Decode) | Moonshine-Base (30s-audio-Latency & Decode) |
|---|---|---|---|
| iPad/Mac M2 | 998/101 tps (334MB RAM) | 0.2s & 109tps (146MB RAM) | 0.3s & 395tps (201MB RAM) |
| Pixel 6a | 218/44 tps (395MB RAM) | 2.5s & 36tps (631MB RAM) | 1.5s & 189tps (111MB RAM) |
| CMF Phone 2 Pro | 146/21 tps (394MB RAM) | 2.4s & 22tps (632MB RAM) | 1.9s & 119tps (112MB RAM) |
How Hybrid Routing Works
Cactus eliminates the choice between expensive cloud and limited local compute.
- Smart Routing: Cactus dynamically routes requests to the on-device NPU/CPU for simple tasks (like clear audio transcription or standard LLM queries) and scales up to cloud APIs for complex or noisy data.
- Cloud Fallback: Configure your Cactus API key with
cactus auth. Choose your fallback model. If the local model cannot handle the task complexity or context window, Cactus handles the failover automatically.
FAQ
Is Cactus free?
Cactus will always have a free tier. Hybrid inference, custom models, and additional hardware acceleration are paid features.
What model format does Cactus use?
With the v1 release, Cactus moves from GGUF to a proprietary .cact format, which is optimized specifically for battery-efficient inference and minimal RAM usage (via zero-copy memory mapping). You can find a list of supported models here.
Which models are supported?
You can find our list of supported models here. You can submit a request for model support or contribute by porting a model yourself!
Get Started
Quickstart
Install Cactus and run your first model in minutes
Hybrid AI
Learn about automatic cloud handoff and confidence routing
LLM
Text generation, vision, streaming, and model options
Transcription
Audio transcription with streaming and VAD support
Community
- Join our Discord - Get help and connect with other developers
- Visualize Repository - Explore the codebase structure
- GitHub Repository - View source code and contribute