CactusCactus

C++ SDK

Complete guide to using Cactus SDK with C++ applications for native development

Cactus C++

Energy-efficient AI inference framework and kernels for smartphones & AI-native hardware.

Budget and mid-range phones control over 70% of the market, but frameworks today optimise for high-end phones with advanced chips.

Cactus is designed bottom-up with no dependencies for all mobile devices.

Architecture

Cactus exposes 4 levels of abstraction:

┌─────────────────┐
│   Cactus FFI    │ ←── OpenAI compatible C API for integration  
└─────────────────┘

┌─────────────────┐
│  Cactus Engine  │ ←── High-level transformer engine
└─────────────────┘

┌─────────────────┐  
│  Cactus Graph   │ ←── Unified zero-copy computation graph 
└─────────────────┘

┌─────────────────┐
│ Cactus Kernels  │ ←── Low-level ARM-specific SIMD operations
└─────────────────┘

Cactus Graph

Cactus Graph is a general numerical computing framework that runs on Cactus Kernels. Great for implementing custom models and scientific computing, like JAX for phones.

#include cactus.h

CactusGraph graph;

auto a = graph.input({2, 3}, Precision::FP16);
auto b = graph.input({3, 4}, Precision::INT8);

auto x1 = graph.matmul(a, b, false);
auto x2 = graph.transpose(x1);
auto result = graph.matmul(b, x2, true);

float a_data[6] = {1.1f, 2.3f, 3.4f, 4.2f, 5.7f, 6.8f};
float b_data[12] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};

graph.set_input(a, a_data, Precision::FP16);
graph.set_input(b, b_data, Precision::INT8);
graph.execute();

void* output_data = graph.get_output(result);
graph.hard_reset(); 

Cactus Engine

Cactus Engine is a transformer inference engine built on top of Cactus Graphs. It is abstracted via the minimalist Cactus Foreign Function Interface.

#include cactus.h

const char* model_path = "path/to/weight/folder";
cactus_model_t model = cactus_init(model_path, 2048);

const char* messages = R"([
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "/nothink My name is Henry Ndubuaku"}
])";

const char* options = R"({
    "temperature": 0.1,
    "top_p": 0.95,
    "top_k": 20,
    "max_tokens": 50,
    "stop_sequences": ["<|im_end|>"]
})";

char response[1024];
int result = cactus_complete(model, messages, response, sizeof(response), options, nullptr, nullptr, nullptr);

With tool calling support:

const char* tools = R"([
    {
        "function": {
            "name": "get_weather",
            "description": "Get weather for a location",
            "parameters": {
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "City name",
                        "required": true
                    }
                },
                "required": ["location"]
            }
        }
    }
])";

int result = cactus_complete(model, messages, response, sizeof(response), options, tools, nullptr, nullptr);

This makes it easy to write Cactus bindings for any language. Header files are self-documenting, but documentation contributions are welcome.

Using Cactus in your apps

Cactus SDKs process over 500,000 weekly inference tasks in production every week. Give them a try!


Demo


Run this code locally

You can run the c++ code directly on any Mac with an Apple chip thanks to the shared design.

We see optimal performance gains on mobile devices, however for testing during development, a Vanilla M3 CPU-only can run Qwen3-600m-INT8 at 60-70 toks/sec:

Pull weights

Generate weights from a HuggingFace model:

python3 tools/convert_hf.py Qwen/Qwen3-0.6B weights/qwen3-600m-i8/ --precision INT8

Execute

Build and test:

./tests/run.sh # remember to chmod +x any script first time

Next steps