CactusCactus

C++ Engine

Cactus Graph API, precision types, and native C++ engine internals

Cactus Graph API

Build custom computation graphs with the PyTorch-like Graph API:

#include <cactus.h>

CactusGraph graph;

// Define inputs
auto a = graph.input({2, 3}, Precision::FP16);
auto b = graph.input({3, 4}, Precision::INT8);

// Build computation graph
auto x1 = graph.matmul(a, b, false);
auto x2 = graph.transpose(x1);
auto result = graph.matmul(b, x2, true);

// Set input data
float a_data[6] = {1.1f, 2.3f, 3.4f, 4.2f, 5.7f, 6.8f};
float b_data[12] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};

graph.set_input(a, a_data, Precision::FP16);
graph.set_input(b, b_data, Precision::INT8);

// Execute
graph.execute();

// Get output
void* output_data = graph.get_output(result);

// Clean up
graph.hard_reset();

Precision Types

  • Precision::FP32 - Full precision floating point
  • Precision::FP16 - Half precision (recommended for mobile)
  • Precision::INT8 - 8-bit quantized (best performance/size ratio)
  • Precision::INT4 - 4-bit quantized (smallest size)

Error Handling

int result = cactus_complete(...);

if (result != 0) {
    // Parse error from response JSON
    // error field will contain specific error message
}

Common error scenarios:

  • Model not found or corrupted
  • Insufficient memory
  • Invalid input format
  • Context length exceeded

Performance Tips

  1. Use INT8 quantization for best performance/quality balance
  2. Enable NPU on Apple devices for vision and transcription models
  3. Implement cloud handoff for complex queries
  4. Reuse model handles across requests (don't reinitialize)
  5. Pre-allocate buffers for streaming to avoid memory allocation overhead

Next Steps