Running AI Models Locally in Your Browser
Local Mode lets you run AI models entirely in your browser using WebGPU. Your conversations never leave your device - true privacy with zero server dependency.
Prerequisites
- A modern browser with WebGPU support (Chrome 113+, Edge 113+, Opera 99+)
- A GPU with sufficient VRAM for your chosen model
- Enough storage space for model downloads (500MB - 16GB depending on model)
How It Works
Local Mode uses WebLLM to run machine learning models directly in your browser via WebGPU acceleration. Models are downloaded once and cached in your browser’s IndexedDB for instant loading on future visits.
Benefits:
- Complete privacy - conversations never leave your device
- Works offline after initial model download
- No API costs or rate limits
- No account required
Limitations:
- Requires a capable GPU
- Model quality varies (smaller models = less capable)
- No image input support (text only)
- Initial download can take several minutes
Getting Started
Step 1: Open Local Mode Setup
Click the Settings icon (gear) in the bottom left to open the sidebar, then click Local Mode in the Tool Belt section.
Step 2: Check WebGPU Support
The modal will automatically check if your browser and GPU support WebGPU. You’ll see either:
- Green checkmark: WebGPU supported, showing your GPU name
- Red X: WebGPU not supported - try updating your browser or enabling GPU acceleration
Step 3: Choose a Model
Select a model from the dropdown based on your GPU’s VRAM:
| VRAM Available | Recommended Models |
|---|---|
| 2GB or less | SmolLM2 135M/360M, Qwen 2.5 0.5B |
| 4GB | TinyLlama 1.1B, Qwen3 1.7B, Phi 3 Mini |
| 6GB | Llama 3.2 3B, Qwen 2.5 3B |
| 8GB+ | Llama 3.1 8B, Mistral 7B, Qwen3 8B |
| 12GB+ | CodeLlama 13B, Gemma 3 12B |
| 16GB+ | Gemma 2 27B |
Step 4: Download the Model
Click Download & Enable. The progress bar will show:
- Downloading model weights from HuggingFace
- Compiling shaders for your GPU
- Loading into memory
First download takes 1-10 minutes depending on model size and internet speed. Subsequent loads are much faster (cached locally).
Step 5: Start Chatting
Once downloaded, a Cloud/Local toggle appears next to the Model header in the sidebar. Click Local to switch to local mode, then select your model from the dropdown.
The toggle remembers your preference between sessions.
Switching Between Local and Cloud
After downloading at least one local model, you’ll see a toggle switch in the sidebar:
- Cloud (default) - Uses cloud models via API (requires account for paid models)
- Local - Uses your downloaded models (runs entirely in browser)
Click the toggle to switch modes. The appropriate model dropdown appears based on your selection.
When in Local mode:
- Your selected local model is shown in the input placeholder
- Responses are labeled with the model name + “(Local)”
- All processing happens on your device
Available Model Categories
General Purpose
- Llama 3.x - Meta’s open models, great all-rounders
- Qwen 2.5/3 - Strong multilingual support
- Mistral 7B - Efficient and capable
- Gemma 2 - Google’s open models
Coding
- Qwen 2.5 Coder - Optimized for code generation
- CodeLlama - Meta’s code-specialized Llama
Math & Reasoning
- Qwen 2.5 Math - Mathematical problem solving
- WizardMath - Step-by-step math reasoning
- DeepSeek R1 - Chain-of-thought reasoning models
Special Purpose
- Phi 3.5 Vision - Can analyze images (experimental)
- Gorilla OpenFunctions - Function/tool calling
Using Custom Models
Any model from the MLC-AI HuggingFace collection works with Local Mode.
To use a custom model:
- Select “Custom Model ID…” from the dropdown
- Enter the model ID exactly as it appears on HuggingFace
- Example:
Mixtral-8x7B-Instruct-v0.1-q4f16_1-MLC
- Example:
- Click Download & Enable
Model IDs follow the format: ModelName-Size-Variant-Quantization-MLC
Troubleshooting
“WebGPU not supported”
- Update your browser to the latest version
- Enable hardware acceleration in browser settings
- Try Chrome or Edge if using another browser
“Insufficient GPU memory”
- Try a smaller model
- Close other GPU-intensive applications
- Check if your GPU meets minimum requirements
Model loads slowly
- First load downloads from HuggingFace (1-10 min)
- Subsequent loads use cached data (much faster)
- Ensure stable internet for initial download
Poor response quality
- Smaller models have limited capabilities
- Try a larger model if your GPU supports it
- Local models work best for straightforward tasks
Privacy Note
When using Local Mode:
- Model weights are downloaded from HuggingFace CDN
- All inference runs locally in your browser
- No conversation data is sent to any server
- Models are cached in browser storage (IndexedDB)
To fully clear local model data, clear your browser’s site data for Obscurify.ai.