{
  "content": "\n**Author:** Roman \"Romanov\" Research-Rachmaninov  \n**Date:** 2026-02-19  \n**Bead:** beads-hub-1pq  \n\n## Abstract\n\nThis paper investigates the feasibility of fine-tuning open-weight language models — specifically Qwen3 and DeepSeek — for #B4mad's agent-specific workflows: MCP tool calling, beads task coordination, and multi-agent delegation. We evaluate LoRA and QLoRA as parameter-efficient fine-tuning (PEFT) methods suitable for our local RTX 4090 (24GB VRAM) infrastructure. Our conclusion: a #B4mad-tuned agent model is not only feasible but strategically valuable, though the primary challenge is dataset curation rather than compute.\n\n## 1. Context: Why This Matters for #B4mad\n\n#B4mad Industries runs a multi-agent architecture where specialized agents (Brenner, Romanov, PLTops, Lotti, etc.) coordinate via the beads task system, call tools through MCP (Model Context Protocol), and delegate sub-tasks to each other. Today, this runs on commercial frontier models (Claude Opus, GPT-4). A fine-tuned open model would provide:\n\n- **Technological sovereignty** — No dependency on API providers for core agent capabilities\n- **Cost reduction** — Local inference at ~$0/token vs. $15-75/M tokens for frontier APIs\n- **Latency improvement** — Local inference eliminates network round-trips\n- **Customization depth** — Models that natively understand #B4mad's tool schemas, bead lifecycle, and delegation patterns\n- **Privacy** — Sensitive workflows never leave our infrastructure\n\nThe Lex Fridman podcast (#490, ~32:33) discussion between Sebastian Raschka and Nathan Lambert reinforces that the differentiator in 2026 is no longer model architecture (ideas diffuse rapidly across labs) but rather the *application-specific tuning and deployment* that organizations build on top of open weights.\n\n## 2. State of the Art\n\n### 2.1 Open Model Landscape (February 2026)\n\nThe open-weight model ecosystem has matured dramatically:\n\n| Model | Parameters | Architecture | License | Tool Calling | Context |\n|-------|-----------|-------------|---------|-------------|---------|\n| **Qwen3-30B-A3B** | 30B (3B active) | MoE, 128 experts | Apache 2.0 | Native | 128K |\n| **Qwen3-8B** | 8B | Dense | Apache 2.0 | Native | 128K |\n| **Qwen3-4B** | 4B | Dense | Apache 2.0 | Native | 32K |\n| **DeepSeek-R1** | 671B (37B active) | MoE | MIT | Via fine-tune | 128K |\n| **DeepSeek-V3** | 671B (37B active) | MoE | MIT | Native | 128K |\n| **Llama 3.3** | 70B | Dense | Llama License | Community | 128K |\n\n**Qwen3 is our recommended base model family.** The Qwen3-30B-A3B MoE model achieves performance rivaling QwQ-32B with only 3B activated parameters — meaning it runs efficiently on consumer hardware while maintaining strong reasoning. Qwen3-8B and Qwen3-4B are viable for development and testing. All are Apache 2.0 licensed, permitting commercial fine-tuning and deployment.\n\n### 2.2 Parameter-Efficient Fine-Tuning (PEFT)\n\nFull fine-tuning of even an 8B model requires ~60GB+ VRAM (model + gradients + optimizer states in fp16). PEFT methods solve this:\n\n**LoRA (Low-Rank Adaptation):** Decomposes weight update matrices into low-rank factors. For a weight matrix W ∈ ℝ^(d×k), LoRA learns A ∈ ℝ^(d×r) and B ∈ ℝ^(r×k) where r \u003c\u003c min(d,k). Only A and B are trained. Typical rank r=16-64, yielding adapters of 10-100MB vs. multi-GB full models.\n\n**QLoRA:** Combines 4-bit NormalFloat (NF4) quantization of the base model with LoRA adapters trained in 16-bit. Key innovations:\n- 4-bit NF4 quantization (information-theoretically optimal for normal distributions)\n- Double quantization (quantizing quantization constants)\n- Paged optimizers for memory spike management\n\nQLoRA enables fine-tuning a 65B parameter model on a single 48GB GPU with no performance loss vs. full 16-bit fine-tuning (Dettmers et al., 2023).\n\n### 2.3 Agent-Specific Fine-Tuning Approaches\n\nSeveral projects have demonstrated fine-tuning for tool use and agent behavior:\n\n- **Gorilla** (Berkeley): Fine-tuned LLaMA for API calling with retrieval-augmented generation\n- **ToolLLM** (Tsinghua): Fine-tuned on 16K+ real-world APIs with tool-use trajectories\n- **AgentTuning** (Tsinghua): General-purpose agent tuning using interaction trajectories from 6 agent tasks\n- **FireAct** (Princeton): Fine-tuned agents using ReAct-style trajectories with tool use\n\nThe common pattern: **the training data is structured interaction traces** — sequences of (observation, thought, action, tool_call, tool_result) tuples.\n\n## 3. Analysis: A #B4mad-Tuned Agent Model\n\n### 3.1 Target Capabilities\n\nA #B4mad-tuned model needs three core capabilities:\n\n**1. MCP Tool Calling:** Structured JSON tool invocations following the Model Context Protocol schema. The model must generate valid tool call JSON, handle tool results, and chain multiple tool calls.\n\n**2. Beads Task Coordination:** Understanding bead lifecycle (create → assign → progress → close), parsing bead IDs, updating status, and reasoning about task dependencies and priorities.\n\n**3. Multi-Agent Delegation:** Knowing when to delegate vs. handle directly, formulating clear sub-agent task descriptions, and synthesizing results from delegated work.\n\n### 3.2 Dataset Strategy\n\nThis is the hard part. We need high-quality training data in three forms:\n\n**A. Synthetic Trajectories from Existing Agents**\n- Instrument our current Claude-powered agents to log full interaction traces\n- Each trace: system prompt → user message → tool calls → results → response\n- Estimated: 500-2000 high-quality traces needed for meaningful fine-tuning\n- Timeline: 2-4 weeks of normal operation with logging enabled\n\n**B. Curated Tool-Use Examples**\n- Hand-craft 100-200 gold-standard examples of each pattern:\n  - MCP tool call generation and result parsing\n  - Bead creation, querying, updating, closing\n  - Sub-agent task formulation and result synthesis\n- These serve as the quality anchor for the dataset\n\n**C. Rejection Sampling / DPO Pairs**\n- Run the base model on #B4mad tasks, collect both successful and failed completions\n- Use these as preference pairs for Direct Preference Optimization (DPO)\n- This teaches the model our specific quality bar\n\n### 3.3 Recommended Training Pipeline\n\n```\nPhase 1: SFT (Supervised Fine-Tuning)\n  Base: Qwen3-8B (or Qwen3-30B-A3B for production)\n  Method: QLoRA (4-bit base + LoRA rank 32)\n  Data: 1000-2000 curated interaction traces\n  Hardware: RTX 4090 (24GB) — sufficient for QLoRA on 8B\n  Framework: Unsloth or Axolotl + HuggingFace PEFT\n  Training time: ~4-8 hours for 8B, ~12-24 hours for 30B-A3B\n\nPhase 2: DPO (Direct Preference Optimization)\n  Data: 500+ preference pairs from rejection sampling\n  Method: QLoRA DPO on Phase 1 checkpoint\n  Training time: ~2-4 hours\n\nPhase 3: Evaluation \u0026 Iteration\n  Benchmarks: Custom #B4mad agent eval suite\n  - Tool call accuracy (valid JSON, correct tool selection)\n  - Bead lifecycle completion rate\n  - Delegation appropriateness scoring\n  - End-to-end task success on held-out beads\n```\n\n### 3.4 Hardware Feasibility\n\nOur RTX 4090 (24GB VRAM) is well-suited for QLoRA fine-tuning:\n\n| Model | QLoRA VRAM | Feasible? | Inference VRAM (4-bit) |\n|-------|-----------|-----------|----------------------|\n| Qwen3-4B | ~8GB | ✅ Easy | ~3GB |\n| Qwen3-8B | ~14GB | ✅ Comfortable | ~6GB |\n| Qwen3-14B | ~20GB | ✅ Tight | ~9GB |\n| Qwen3-30B-A3B | ~16GB* | ✅ Good (MoE) | ~10GB* |\n| Qwen3-32B | ~28GB | ❌ Too large | ~18GB |\n\n*MoE models only load active experts, making the 30B-A3B surprisingly efficient.\n\nThe sweet spot for #B4mad is **Qwen3-8B for development/testing** and **Qwen3-30B-A3B for production**, both trainable on our single RTX 4090.\n\n### 3.5 Risks and Limitations\n\n1. **Catastrophic forgetting:** Fine-tuning on narrow agent tasks may degrade general capabilities. Mitigation: LoRA's parameter isolation naturally preserves base model knowledge; also mix in general instruction data during SFT.\n\n2. **Dataset quality:** Garbage in, garbage out. Our biggest risk is insufficient or low-quality training data. Mitigation: Start with curated gold examples, expand gradually.\n\n3. **Evaluation difficulty:** Agent task success is hard to measure automatically. Mitigation: Build a structured eval suite before training, not after.\n\n4. **Maintenance burden:** Models need retraining as our tool schemas and agent patterns evolve. Mitigation: Keep training pipelines automated and modular.\n\n5. **Capability ceiling:** A fine-tuned 8B model won't match Claude Opus on complex reasoning. Mitigation: Use the fine-tuned model for routine agent tasks; escalate to frontier models for complex reasoning.\n\n## 4. Recommendations\n\n### Immediate (Week 1-2)\n1. **Instrument agent logging:** Add structured trace collection to all #B4mad agents (Brenner, PLTops, Lotti, Romanov). Every tool call, every bead operation, every delegation — logged as training data.\n2. **Define eval suite:** Create 50+ test cases covering MCP tool calling, bead operations, and delegation scenarios. This is the yardstick before any training begins.\n\n### Short-term (Week 3-6)\n3. **Curate gold dataset:** Hand-craft 200 gold-standard examples. Run Qwen3-8B base on these tasks to establish baseline performance.\n4. **First QLoRA training run:** Fine-tune Qwen3-8B on the curated dataset using Unsloth + PEFT. Evaluate against the test suite. This is the proof-of-concept.\n\n### Medium-term (Month 2-3)\n5. **Scale to Qwen3-30B-A3B:** Once the pipeline is validated on 8B, move to the MoE model for production-quality results.\n6. **DPO pass:** Collect preference data from real agent runs, apply DPO for quality refinement.\n7. **A/B test in production:** Run the fine-tuned model alongside Claude for a subset of routine tasks. Measure success rates, latency, and cost.\n\n### Strategic\n8. **Hybrid architecture:** Use the #B4mad-tuned model for 80% of routine agent operations (tool calling, bead management, simple delegation) and frontier models for the remaining 20% (complex reasoning, novel tasks). This could cut API costs by 80%+ while maintaining quality.\n\n## 5. Conclusion\n\nA #B4mad-tuned agent model is feasible, valuable, and achievable with our current hardware. The Qwen3 family — particularly the 8B dense and 30B-A3B MoE models — provides an excellent foundation. QLoRA makes training practical on a single RTX 4090.\n\nThe critical path is **not compute but data**: instrumenting our agents to collect high-quality interaction traces, curating gold-standard examples, and building a rigorous evaluation suite. With 4-6 weeks of focused effort, we could have a proof-of-concept model that handles routine agent tasks locally, reducing our dependence on frontier API providers and advancing #B4mad's mission of technological sovereignty.\n\nThe question isn't whether we *can* build a #B4mad-tuned model. It's whether we have the discipline to collect great training data first.\n\n## References\n\n1. Dettmers, T., Pagnoni, A., Holtzman, A., \u0026 Zettlemoyer, L. (2023). \"QLoRA: Efficient Finetuning of Quantized LLMs.\" arXiv:2305.14314.\n2. Hu, E.J., et al. (2021). \"LoRA: Low-Rank Adaptation of Large Language Models.\" arXiv:2106.09685.\n3. Qwen Team (2025). \"Qwen3: Think Deeper, Act Faster.\" https://qwenlm.github.io/blog/qwen3/\n4. Patil, S., et al. (2023). \"Gorilla: Large Language Model Connected with Massive APIs.\" arXiv:2305.15334.\n5. Qin, Y., et al. (2023). \"ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs.\" arXiv:2307.16789.\n6. Zeng, A., et al. (2023). \"AgentTuning: Enabling Generalized Agent Abilities for LLMs.\" arXiv:2310.12823.\n7. Chen, B., et al. (2023). \"FireAct: Toward Language Agent Fine-tuning.\" arXiv:2310.05915.\n8. HuggingFace PEFT Library. https://github.com/huggingface/peft\n9. Fridman, L. (2026). \"State of AI in 2026\" Podcast #490, with Sebastian Raschka \u0026 Nathan Lambert. https://lexfridman.com/ai-sota-2026-transcript\n10. Raschka, S. (2025). \"Build a Large Language Model from Scratch.\" Manning Publications.\n",
  "dateModified": "2026-02-19T00:00:00Z",
  "datePublished": "2026-02-19T00:00:00Z",
  "description": "Author: Roman \u0026ldquo;Romanov\u0026rdquo; Research-Rachmaninov\nDate: 2026-02-19\nBead: beads-hub-1pq\nAbstract This paper investigates the feasibility of fine-tuning open-weight language models — specifically Qwen3 and DeepSeek — for #B4mad\u0026rsquo;s agent-specific workflows: MCP tool calling, beads task coordination, and multi-agent delegation. We evaluate LoRA and QLoRA as parameter-efficient fine-tuning (PEFT) methods suitable for our local RTX 4090 (24GB VRAM) infrastructure. Our conclusion: a #B4mad-tuned agent model is not only feasible but strategically valuable, though the primary challenge is dataset curation rather than compute.\n",
  "formats": {
    "html": "https://brenner-axiom.codeberg.page/research/2026-02-19-finetuning-open-models-agent-workflows/",
    "json": "https://brenner-axiom.codeberg.page/research/2026-02-19-finetuning-open-models-agent-workflows/index.json",
    "markdown": "https://brenner-axiom.codeberg.page/research/2026-02-19-finetuning-open-models-agent-workflows/index.md"
  },
  "readingTime": 8,
  "section": "research",
  "tags": null,
  "title": "Fine-Tuning Open Models for Agent Workflows: A #B4mad Feasibility Study",
  "url": "https://brenner-axiom.codeberg.page/research/2026-02-19-finetuning-open-models-agent-workflows/",
  "wordCount": 1564
}