LLM
Text generation, vision, streaming, and model options with Cactus
Basic Completion
import { CactusLM } from 'cactus-react-native';
const cactusLM = new CactusLM();
await cactusLM.download();
await cactusLM.init();
const result = await cactusLM.complete({
messages: [{ role: 'user', content: 'Hello!' }],
onToken: (token) => console.log(token) // Stream tokens
});
console.log(result.response);Using the Hook:
const cactusLM = useCactusLM();
const handleComplete = async () => {
await cactusLM.complete({
messages: [{ role: 'user', content: 'Hello!' }]
});
};
return <Text>{cactusLM.completion}</Text>;import 'cactus.dart';
final model = Cactus.create('/path/to/model.gguf');
final result = model.complete('What is the capital of France?');
print(result.text);
model.dispose();import com.cactus.*
val model = Cactus.create("/path/to/model")
val result = model.complete("What is the capital of France?")
println(result.text)
model.close()#include <cactus.h>
cactus_model_t model = cactus_init("path/to/weight/folder", nullptr);
const char* messages = R"([
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
])";
const char* options = R"({
"max_tokens": 50,
"stop_sequences": ["<|im_end|>"]
})";
char response[4096];
int result = cactus_complete(
model, messages, response, sizeof(response),
options, nullptr, nullptr, nullptr
);Response Format:
{
"success": true,
"error": null,
"cloud_handoff": false,
"response": "The capital of France is Paris.",
"function_calls": [],
"confidence": 0.8193,
"time_to_first_token_ms": 45.23,
"total_time_ms": 163.67,
"prefill_tps": 1621.89,
"decode_tps": 168.42,
"ram_usage_mb": 245.67,
"prefill_tokens": 28,
"decode_tokens": 50,
"total_tokens": 78
}Chat Messages
final model = Cactus.create(modelPath);
final result = model.completeMessages([
Message.system('You are a helpful assistant.'),
Message.user('What is 2 + 2?'),
]);
print(result.text);
model.dispose();Cactus.create(modelPath).use { model ->
val result = model.complete(
messages = listOf(
Message.system("You are a helpful assistant."),
Message.user("What is 2 + 2?")
)
)
println(result.text)
}Completion Options
final options = CompletionOptions(
temperature: 0.7,
topP: 0.9,
topK: 40,
maxTokens: 256,
stopSequences: ['\n\n'],
);
final result = model.complete('Write a haiku:', options: options);val options = CompletionOptions(
temperature = 0.7f,
topP = 0.9f,
topK = 40,
maxTokens = 256,
stopSequences = listOf("\n\n")
)
val result = model.complete("Write a haiku:", options)Vision
Vision-capable models can analyze images alongside text.
const cactusLM = new CactusLM({ model: 'lfm2-vl-450m' });
await cactusLM.complete({
messages: [
{
role: 'user',
content: "What's in this image?",
images: ['path/to/image.jpg']
}
]
});Vision is supported through the same cactus_complete API with vision-capable models (LFM2-VL, LFM2.5-VL). Pass image paths in the message content.
Streaming
Stream tokens as they are generated for responsive UIs.
const result = await cactusLM.complete({
messages: [{ role: 'user', content: 'Tell me a story' }],
onToken: (token) => console.log(token)
});Using the Hook:
The useCactusLM hook automatically updates cactusLM.completion as tokens stream in — no callback needed.
final result = model.complete(
'Tell me a story',
callback: (token, tokenId) {
print(token);
},
);val result = model.complete(
messages = listOf(Message.user("Tell me a story")),
callback = TokenCallback { token, tokenId ->
print(token)
}
)void token_callback(const char* token, int token_id, void* user_data) {
printf("%s", token);
fflush(stdout);
}
cactus_complete(
model, messages, response, sizeof(response),
nullptr, nullptr,
token_callback, // streaming callback
nullptr // user data
);Cloud Handoff
When the model lacks confidence, the response signals a cloud handoff:
const result = await cactusLM.complete({
messages: [{ role: 'user', content: 'Explain quantum entanglement' }]
});
if (result.cloudHandoff) {
// Use Cactus Cloud for better accuracy
}The CompletionResult includes needsCloudHandoff and confidence fields. Check these to decide whether to route to a cloud API.
The CompletionResult includes needsCloudHandoff and confidence fields. Check these to decide whether to route to a cloud API.
{
"success": true,
"cloud_handoff": true,
"response": null,
"confidence": 0.42
}Your application should route to a cloud API when cloud_handoff is true.
Model Options
Choose quantization and enable NPU acceleration:
const cactusLM = new CactusLM({
model: 'lfm2-vl-450m',
options: {
quantization: 'int8', // 'int4' or 'int8'
pro: true // Enable NPU acceleration
}
});Precision is set at model conversion time. Use cactus convert with the desired precision, then load the converted model.
Supported precision types: Precision::FP32 (full precision), Precision::FP16 (half precision, recommended for mobile), Precision::INT8 (8-bit quantized, best performance/size ratio), Precision::INT4 (4-bit quantized, smallest size).
Performance Tips
- Model Selection - Use smaller models (
qwen3-0.6b,lfm2-350m) for faster inference on mobile - Quantization -
int4uses less memory,int8is more accurate - NPU Acceleration - Enable
pro: truefor models that support it (iOS/Android NPU) - Memory - Always call
destroy()/dispose()/close()when done to free resources - Reuse model handles across requests (don't reinitialize)