Connected personas run inference on a paired contact’s device when you use their shared persona — not on your machine. See Personas for how that differs from resident and portable personas.
Inference backends
webAI supports three inference backends. The system automatically selects the best one based on your hardware, or you can choose manually.- WebGPU
- MLX
- llama.cpp
Browser-native GPU inference. Works on any platform with a supported browser (Chrome 113+, Edge 113+). This is the default backend in browser mode.
- Runs in the browser — no native app required
- Uses your GPU via the WebGPU API
- Best for: quick setup, cross-platform compatibility
WebGPU support varies by device. Older GPUs or browsers without WebGPU will fall back to llama.cpp if running in the desktop app.
Automatic backend routing
When you load a model, the system profiles your device and selects the best backend automatically:Device profiling
The system checks your available memory, GPU capabilities, and platform (macOS, browser, etc.).
Backend selection
Based on the profile, it selects WebGPU, MLX, or llama.cpp — prioritizing speed and model compatibility.
Model tiers
Models are organized into tiers based on size and the memory required to run them. The system recommends a tier based on your device’s capabilities.MLX models (Apple Silicon)
| Tier | Model | Memory required |
|---|---|---|
| 1 | Gemma 3n E2B | 4 GB |
| 2 | Gemma 3n E4B / Qwen3 4B | 8 GB |
| 3 | Gemma 3 12B / Qwen3 8B | 16 GB |
| 4 | Qwen3 14B | 24 GB |
| 5 | Qwen3 32B | 64 GB |
| 6 | Qwen3 235B | 64+ GB |
llama.cpp models (GGUF)
| Tier | Model | Memory required |
|---|---|---|
| 1 | Llama 3.2 3B / Qwen3 0.6B | 4 GB |
| 2 | Gemma 3 4B / Qwen3 4B | 8 GB |
| 3 | Qwen3 8B / Gemma 3 12B | 16 GB |
| 4 | Gemma 3 12B / Qwen3 14B | 24 GB |
| 5 | Gemma 3 27B / Qwen3 32B | 64 GB |
WebGPU models (Browser)
| Tier | Model | GPU memory required |
|---|---|---|
| 1 | Qwen3 0.6B | 4 GB |
| 2 | Qwen3 1.7B | 8 GB |
| 3+ | Qwen3 4B | 16 GB or more |
WebGPU model availability depends on your device’s GPU memory. The system automatically falls back to smaller models if a tier can’t be loaded. For the largest models, use the desktop app with MLX or llama.cpp.
LoRA adapters
LoRA (Low-Rank Adaptation) adapters let you customize a model’s behavior without downloading an entirely new model. Adapters are small files that layer on top of a base model to specialize it for a particular domain or style.Available adapters
| Adapter | Base model | Purpose |
|---|---|---|
| Chatbot LoRA | 0.5B | General conversational improvements |
| PubMedQA LoRA | 0.5B | Medical and biomedical Q&A |
| QwQ Creative LoRA | 0.5B | Creative writing and storytelling |
| Math LoRA | 0.6B | Mathematical reasoning |
| UltraChat SFT | 1.7B | Instruction following |
| SFT LoRA | 1.7B | Supervised fine-tuning |
| DPO LoRA | 1.7B | Alignment and preference optimization |
Idle management
To conserve resources, the AI runtime automatically manages model lifecycle:- Soft timeout — If no requests are made for a short period, the runtime begins preparing to release resources.
- Hard timeout — After continued inactivity, the model is fully unloaded from memory to free hardware for other apps.
- Instant reload — When you send a new message, the model loads back automatically.
Learn more
Oasis
The AI app where you interact with your local models.
Choosing a model
A practical guide to picking the right model for your hardware.