CactusCactus

Overview

On-device and Hybrid cross-platform AI framework

What's New in v1.6

  • Audio Transcription: Transcribe audio with Whisper (Small/Medium) and Moonshine models with Apple NPU support
  • Vision Models: Support for LFM2-VL and LFM2.5-VL with image understanding capabilities
  • Apple NPU Acceleration: Hardware acceleration for Whisper and vision models on Apple devices
  • New Models: Gemma-3-270M, LFM2.5-1.2B-Thinking, Qwen3 variants, and more
  • Enhanced CLI: Improved cactus transcribe command with live microphone support

Cactus is a hybrid inference engine for smartphones and edge devices. Cactus offers industry-leading on-device performance and automatically optimizes your AI workloads by routing requests between:

  • On-device: Smaller models running on the Cactus engine with NPU acceleration
  • Cloud: Frontier models for state-of-the-art performance

Cactus Hybrid measures model "confidence" in their responses in real-time and routes requests accordingly.

Architecture

Cactus consists of three layers that work together to deliver efficient on-device AI:

Cactus Engine

Energy-efficient inference engine

OpenAI-compatible APIs for C/C++, Swift, Kotlin, Flutter. Supports tool calling, auto RAG, NPU acceleration, INT4 quantization, and hybrid cloud handoff for complex tasks.

Cactus Graph

Zero-copy computation graph

PyTorch-like API for implementing custom models. Highly optimized for RAM efficiency and lossless weight quantization.

Cactus Kernels

Low-level ARM SIMD kernels

Optimized for Apple, Snapdragon, Google, Exynos, and MediaTek processors. Custom attention kernels with KV-Cache quantization and chunked prefill.

Performance Benchmarks

Performance on INT8 quantized models:

Flagship Models

DeviceLFM2.5-1.2B
(1k-Prefill/100-Decode)
LFM2.5-VL-1.6B
(256px-Latency & Decode)
Whisper-Small
(30s-audio-Latency & Decode)
Mac M4 Pro582/77 tps0.2s & 76tps0.1s & 111tps
iPhone 17 Pro300/33 tps0.3s & 33tps0.6s & 114tps
Galaxy S25 Ultra226/36 tps2.6s & 33tps2.3s & 90tps

Mid-range Models

DeviceLFM2-350m
(1k-Prefill/100-Decode)
LFM2-VL-450m
(256px-Latency & Decode)
Moonshine-Base
(30s-audio-Latency & Decode)
Pixel 6a218/44 tps2.5s & 36 tps1.5s & 189 tps

How Hybrid Routing Works

Cactus eliminates the choice between expensive cloud and limited local compute.

  • Smart Routing: Cactus dynamically routes requests to the on-device NPU/CPU for simple tasks (like clear audio transcription or standard LLM queries) and scales up to cloud APIs for complex or noisy data.
  • Cloud Fallback: Configure your Cactus API key. Choose your fallback model. If the local model cannot handle the task complexity or context window, Cactus handles the failover automatically.

FAQ

Is Cactus free?

Cactus will always have a free tier. Hybrid inference, custom models, and additional hardware acceleration are paid features.

What model format does Cactus use?

With the v1 release, Cactus moves from GGUF to a proprietary .cact format, which is optimized specifically for battery-efficient inference and minimal RAM usage (via zero-copy memory mapping). You can find a list of supported models here.

Which models are supported?

You can find our list of supported models here. You can submit a request for model support or contribute by porting a model yourself!

Get Started

Community