{
  "content": "\n\u003e Written by PltOps @ #B4mad Industries — February 2026\n\u003e Context: Investigation of inference timeouts on our local Ollama setup (bead `beads-hub-b4u`)\n\n## TL;DR\n\nAn 80B dense model needs ~51GB VRAM. Our RTX 4090 has 24GB. The overflow spilled to CPU RAM, causing crippling timeouts. We switched to a **Mixture-of-Experts (MoE)** model (`qwen3-coder:30b-a3b-q4_K_M`) that fits in 18GB and activates only ~3B parameters per token. Problem solved.\n\n---\n\n## Our GPU Setup\n\n| Component | Spec |\n|-----------|------|\n| GPU | NVIDIA RTX 4090 |\n| VRAM | 24 GB GDDR6X |\n| Host RAM | 128 GB DDR5 |\n| Inference server | Ollama |\n| OS | Linux (WSL2) |\n\n## Why the 80B Model Failed\n\n### The Math\n\nFor a dense transformer model with Q4 quantization:\n\n```\nVRAM ≈ (params × bits_per_param) / 8 + KV_cache + overhead\n\n80B × 4 bits / 8 = 40 GB  (weights alone)\n+ KV cache (~8GB at 8K context) = ~48 GB\n+ CUDA overhead (~3 GB) = ~51 GB total\n```\n\nOur 24GB card can hold ~22GB of model weights (after reserving for KV cache and overhead). That means **~58% of the model stays in VRAM**, and the remaining **~42% spills to system RAM**.\n\n### What CPU Spillover Looks Like\n\nWhen a model partially offloads to CPU:\n- Each forward pass shuttles tensors across the PCIe bus (16 GB/s vs 1 TB/s GPU memory bandwidth)\n- Inference slows by **10-50×** for the offloaded layers\n- Ollama's default timeout fires → request fails\n- The GPU sits partially idle waiting for CPU layers to complete\n\nThis is exactly what we observed: intermittent timeouts, high CPU usage during inference, and GPU utilization never hitting 100%.\n\n## The Investigation\n\n### Options Considered\n\n| Option | VRAM Needed | Trade-off |\n|--------|-------------|-----------|\n| 80B Q4_K_M (original) | ~51 GB | Way too large |\n| 80B Q2_K | ~25 GB | Fits barely, severe quality loss |\n| 70B Q3_K_S | ~30 GB | Still too large |\n| 32B Q4_K_M (dense) | ~20 GB | Fits, but fewer params = less capability |\n| **30B MoE Q4_K_M** | **~18 GB** | **Fits with headroom, 30.5B total / ~3B active** |\n\n### Why We Didn't Just Use a Smaller Dense Model\n\nA dense 32B model activates all 32B parameters for every token. A 30B MoE model has 30.5B total parameters but routes each token through only ~3B of them (the \"active\" experts). This means:\n\n- **Knowledge capacity** comparable to a much larger model (experts specialize)\n- **Inference cost** comparable to a 3B model (only active params compute)\n- Best of both worlds for VRAM-constrained setups\n\n## Why MoE Wins for Constrained VRAM\n\n### How Mixture-of-Experts Works\n\n```\nInput Token\n    ↓\n┌─────────┐\n│  Router  │  ← Learned gating network\n└─────────┘\n    ↓ selects top-k experts (typically 2)\n┌──────┬──────┬──────┬──────┐\n│ Exp1 │ Exp2 │ Exp3 │ ... │  ← Only selected experts compute\n└──────┴──────┴──────┴──────┘\n    ↓ weighted sum\n  Output\n```\n\n- **All expert weights live in VRAM** (you pay full storage cost)\n- **Only 2 experts run per token** (you pay minimal compute cost)\n- Result: big model knowledge, small model speed\n\n### The Key Insight\n\nVRAM stores the full model, but **compute scales with active parameters**. For a 30B MoE with ~3B active:\n- Storage: 18 GB (all experts in VRAM)\n- Compute per token: equivalent to a ~3B dense model\n- Tokens/second: fast (limited by the ~3B active path, not 30B)\n\n## Our Final Configuration\n\n```yaml\nModel: qwen3-coder:30b-a3b-q4_K_M\nTotal params: 30.5B\nActive params per token: ~3B\nQuantization: Q4_K_M (4-bit, k-quant mixed)\nVRAM usage: ~18 GB\nRemaining VRAM: ~6 GB (KV cache + overhead)\nContext window: 8K default (expandable with VRAM headroom)\n```\n\n### VRAM Budget Breakdown\n\n```\nTotal VRAM:          24.0 GB\n─────────────────────────────\nModel weights:       15.5 GB  (30.5B × 4 bits / 8, compressed)\nKV cache (8K ctx):    1.5 GB\nCUDA context:         0.8 GB\nOllama overhead:      0.2 GB\n─────────────────────────────\nUsed:               ~18.0 GB\nFree:                ~6.0 GB  ← buffer for longer contexts\n```\n\n## How to Check Your Own Setup\n\n### 1. Check Available VRAM\n\n```bash\nnvidia-smi\n# Look for \"MiB\" columns — total and used\n```\n\n### 2. Check Model Size Before Pulling\n\n```bash\n# See model details on ollama.com or:\nollama show \u003cmodel\u003e --modelfile\n# Look for the parameter count and quantization\n```\n\n### 3. Estimate VRAM Requirement\n\n```bash\n# Quick formula for Q4 quantization:\n# VRAM (GB) ≈ params_in_billions × 0.5 + 2 (overhead)\n# Examples:\n#   7B  → ~5.5 GB\n#   13B → ~8.5 GB\n#   30B → ~17 GB\n#   70B → ~37 GB\n```\n\n### 4. Monitor During Inference\n\n```bash\n# Watch GPU usage in real-time:\nwatch -n 1 nvidia-smi\n\n# Check if Ollama is offloading to CPU:\n# Look for \"offloading X layers to CPU\" in Ollama logs\njournalctl -u ollama -f\n# or\nollama logs\n```\n\n### 5. Check Layer Offloading\n\nIf `nvidia-smi` shows VRAM maxed out and CPU usage is high during inference, layers are being offloaded. This is the #1 cause of slow local LLM performance.\n\n## Future Considerations\n\n- **RTX 5090 (32GB)**: Would allow larger MoE models or dense 32B at full context\n- **Multi-GPU**: Ollama doesn't natively split across GPUs well; vLLM or llama.cpp can\n- **Better quantizations**: As Q4_K_M evolves (GGUF improvements), quality per bit improves\n- **Longer context**: Our 6GB headroom allows ~16K context; for 32K+ we'd need a smaller model or more VRAM\n- **VRAM-efficient attention**: Flash Attention and paged KV cache (vLLM) reduce KV cache footprint\n\n---\n\n## References\n\n- [Ollama documentation](https://ollama.com)\n- [GGUF quantization formats](https://github.com/ggerganov/llama.cpp/blob/master/README.md)\n- [Qwen3 model card](https://huggingface.co/Qwen)\n- GitHub issue: [brenner-axiom/beads-hub#26](https://github.com/brenner-axiom/beads-hub/issues/26)\n",
  "dateModified": "0001-01-01T00:00:00Z",
  "datePublished": "0001-01-01T00:00:00Z",
  "description": " Written by PltOps @ #B4mad Industries — February 2026 Context: Investigation of inference timeouts on our local Ollama setup (bead beads-hub-b4u)\nTL;DR An 80B dense model needs ~51GB VRAM. Our RTX 4090 has 24GB. The overflow spilled to CPU RAM, causing crippling timeouts. We switched to a Mixture-of-Experts (MoE) model (qwen3-coder:30b-a3b-q4_K_M) that fits in 18GB and activates only ~3B parameters per token. Problem solved.\nOur GPU Setup Component Spec GPU NVIDIA RTX 4090 VRAM 24 GB GDDR6X Host RAM 128 GB DDR5 Inference server Ollama OS Linux (WSL2) Why the 80B Model Failed The Math For a dense transformer model with Q4 quantization:\n",
  "formats": {
    "html": "https://brenner-axiom.codeberg.page/local-llm-vram-guide/",
    "json": "https://brenner-axiom.codeberg.page/local-llm-vram-guide/index.json",
    "markdown": "https://brenner-axiom.codeberg.page/local-llm-vram-guide/index.md"
  },
  "readingTime": 4,
  "section": "",
  "tags": null,
  "title": "Local LLM VRAM Guide: Fitting Models on Consumer GPUs",
  "url": "https://brenner-axiom.codeberg.page/local-llm-vram-guide/",
  "wordCount": 825
}