C++ SDK
Complete guide to using Cactus SDK with C++ applications for native development
Cactus C++
Energy-efficient AI inference framework and kernels for smartphones & AI-native hardware.
Budget and mid-range phones control over 70% of the market, but frameworks today optimise for high-end phones with advanced chips.
Cactus is designed bottom-up with no dependencies for all mobile devices.
Architecture
Cactus exposes 4 levels of abstraction:
┌─────────────────┐
│ Cactus FFI │ ←── OpenAI compatible C API for integration
└─────────────────┘
│
┌─────────────────┐
│ Cactus Engine │ ←── High-level transformer engine
└─────────────────┘
│
┌─────────────────┐
│ Cactus Graph │ ←── Unified zero-copy computation graph
└─────────────────┘
│
┌─────────────────┐
│ Cactus Kernels │ ←── Low-level ARM-specific SIMD operations
└─────────────────┘
Cactus Graph
Cactus Graph is a general numerical computing framework that runs on Cactus Kernels. Great for implementing custom models and scientific computing, like JAX for phones.
#include cactus.h
CactusGraph graph;
auto a = graph.input({2, 3}, Precision::FP16);
auto b = graph.input({3, 4}, Precision::INT8);
auto x1 = graph.matmul(a, b, false);
auto x2 = graph.transpose(x1);
auto result = graph.matmul(b, x2, true);
float a_data[6] = {1.1f, 2.3f, 3.4f, 4.2f, 5.7f, 6.8f};
float b_data[12] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};
graph.set_input(a, a_data, Precision::FP16);
graph.set_input(b, b_data, Precision::INT8);
graph.execute();
void* output_data = graph.get_output(result);
graph.hard_reset();
Cactus Engine
Cactus Engine is a transformer inference engine built on top of Cactus Graphs. It is abstracted via the minimalist Cactus Foreign Function Interface.
#include cactus.h
const char* model_path = "path/to/weight/folder";
cactus_model_t model = cactus_init(model_path, 2048);
const char* messages = R"([
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "/nothink My name is Henry Ndubuaku"}
])";
const char* options = R"({
"temperature": 0.1,
"top_p": 0.95,
"top_k": 20,
"max_tokens": 50,
"stop_sequences": ["<|im_end|>"]
})";
char response[1024];
int result = cactus_complete(model, messages, response, sizeof(response), options, nullptr, nullptr, nullptr);
With tool calling support:
const char* tools = R"([
{
"function": {
"name": "get_weather",
"description": "Get weather for a location",
"parameters": {
"properties": {
"location": {
"type": "string",
"description": "City name",
"required": true
}
},
"required": ["location"]
}
}
}
])";
int result = cactus_complete(model, messages, response, sizeof(response), options, tools, nullptr, nullptr);
This makes it easy to write Cactus bindings for any language. Header files are self-documenting, but documentation contributions are welcome.
Using Cactus in your apps
Cactus SDKs process over 500,000 weekly inference tasks in production every week. Give them a try!
Flutter SDK
Cross-platform mobile development in Flutter
React Native SDK
Cross-platform mobile development in React Native
Demo
Run this code locally
You can run the c++ code directly on any Mac with an Apple chip thanks to the shared design.
We see optimal performance gains on mobile devices, however for testing during development, a Vanilla M3 CPU-only can run Qwen3-600m-INT8 at 60-70 toks/sec:
Pull weights
Generate weights from a HuggingFace model:python3 tools/convert_hf.py Qwen/Qwen3-0.6B weights/qwen3-600m-i8/ --precision INT8
Execute
Build and test:./tests/run.sh # remember to chmod +x any script first time