The Miniloader Model Family: Benchmarks, Capabilities, and Why We Chose These Models
Every model that ships inside Miniloader's Basic Brain goes through a full capability benchmark before it earns a place in the default catalogue. This post covers the April 2026 run: seven models from three families, tested on a single consumer desktop under the Vulkan backend that the majority of Miniloader users will be running day-to-day.
These are not arbitrary picks. Each model was chosen because it represents a genuine frontier in what a local, consumer-grade machine can do without a cloud subscription.
Why These Models
The Miniloader model family is built around three criteria.
Multimodal as standard. Most local model catalogues are still text-only. We treat image understanding as a baseline requirement, not a premium feature. Four of the seven models tested here pass the vision check out of the box.
Reasoning that ships. Chain-of-thought thinking modes are no longer an API-only feature. Five of the seven models in this run have a functioning think mode verified under our harness. That means your local agent can reason through a problem before committing to an answer, the same way cloud frontier models do.
Native tool use. An agent that cannot call functions is just a chatbot. All seven models in this run parse native tool schemas and complete end-to-end tool roundtrips correctly. That is what makes them suitable for Miniloader's agent pipeline rather than just a Q+A terminal.
The Qwen3, Gemma 4, and LFM2 families were specifically selected because they are the most technically current models available as consumer GGUF builds at the time of this test. We are not picking popular models. We are picking the ones that represent where open-weight AI actually is right now.
Test Environment
The benchmark was run on a single consumer desktop with the following configuration.
- GPU: AMD Radeon RX 7900 XTX (24 GB VRAM)
- CPU: Intel Core i7-12700K (16 physical cores, 24 logical)
- RAM: 32 GB
- OS: Windows 11 build 26200
- Backend: Vulkan (llama.cpp b8640, JamePeng 0.3.35)
Vulkan was chosen because it is the backend Miniloader defaults to on AMD hardware, and AMD is the GPU family most underserved by existing local AI tooling. NVIDIA users running CUDA will see higher throughput numbers across the board. These results represent a realistic floor for the majority of non-NVIDIA Miniloader users.
Each model ran five passes of 100-token generation. The reported speed is the average of those passes after trimming the statistical outlier. Load time is the cold-start time from disk. All tests were run sequentially on a freshly loaded instance per model.
Results at a Glance
| Model | Size | Speed (tok/s) | Vision | Thinking | Tool Use |
|---|---|---|---|---|---|
| LFM2.5 1.2B Instruct | 0.68 GB | 354 | No | No | Yes |
| Qwen3 4B | 2.33 GB | 180 | No | Yes | Yes |
| Gemma 4 E4B IT | 4.97 GB | 105 | Yes | Yes | Yes |
| Gemma 4 26B A4B IT | 15.64 GB | 131 | Yes | Yes | Yes |
| Gemma 4 31B IT | 17.40 GB | 34 | Yes | Yes | Yes |
| Qwen3.6 35B A3B | 20.61 GB | 23 | Yes | Yes | Yes |
| LFM2 2.6B Exp | 1.46 GB | see notes | No | Yes | Yes |
All seven models passed conversation, native tool parsing, and end-to-end tool roundtrip tests. Speeds are average tok/s across five passes at Q4_K_M quantisation.
Model Breakdowns
LFM2.5 1.2B Instruct
354 tok/s / 0.68 GB
The fastest model in the run by a significant margin. At under 700 MB on disk and cold-loading in under 600 ms, this is the model Miniloader routes to for low-stakes work: formatting, short rewrites, quick summarization, and routing decisions. It does not have a thinking mode and does not handle images, but it does pass native tool calling, which means it can be used as a fast dispatch layer in a multi-agent pipeline. If your use case is speed and your prompts are simple, nothing in this catalogue comes close.
Qwen3 4B
180 tok/s / 2.33 GB
The sweet spot in the Qwen3 family for most users. Fast enough to feel interactive, small enough to leave headroom for other workloads, and fully capable on the capability checklist: native thinking mode, native tool calling, end-to-end tool roundtrip. It does not pass the image test because it is a text-only architecture, but for pure text agents it is the default recommendation. The thinking mode verification confirmed that the model correctly produces and strips scratchpad reasoning when instructed. Tool roundtrip was clean: both test functions called, results used correctly, final answer accurate.
Gemma 4 E4B IT
105 tok/s / 4.97 GB
Google's smallest Gemma 4 instruction model and the entry point into the multimodal range of the catalogue. It passes the image test, thinking mode, and both tool tests. At just under 5 GB it fits comfortably on any GPU with 6 GB of VRAM, making it the multimodal baseline for budget hardware. The E4B designation refers to the 4-billion active-parameter MoE routing (though this particular variant logged as a dense model in the metadata). Realistically, this is the model most Miniloader users with a mid-range GPU will be running when they need vision plus reasonable speed.
Gemma 4 26B A4B IT
131 tok/s / 15.64 GB
The most interesting result in the run. At 26 billion total parameters with a sparse MoE architecture (128 experts, top 8 active per token), this model punches far above its effective compute weight. It runs faster than the smaller 31B dense model while delivering full vision, thinking, and tool support. The 15.64 GB footprint requires 16 GB of VRAM to run fully offloaded, which is achievable on an RX 7900 XTX (24 GB), an RTX 3090, or similar cards. For users who have the VRAM headroom, this is the recommended daily-driver model: near-frontier capability at interactive speeds.
Gemma 4 31B IT
34 tok/s / 17.40 GB
The dense 31B variant of Gemma 4. It passes every capability test and handles image understanding cleanly, but at 34 tok/s it is noticeably slower than the 26B MoE model above it. This is the expected cost of a dense architecture at this scale. It is still perfectly usable for non-interactive workloads like document analysis, long-form drafting, and batch processing where throughput matters less than quality. Users choosing between the 26B A4B and this model should default to the MoE variant unless they have a specific reason to prefer a dense architecture.
Qwen3.6 35B A3B
23 tok/s / 20.61 GB
The largest and most capable model in this run. Qwen3.6 is a 35-billion parameter sparse MoE model (256 experts, 8 active) from Alibaba, and it is the first Qwen-family model in the catalogue to pass the vision test. At 20.6 GB it requires a 24 GB card to run fully on-GPU (exactly the RX 7900 XTX's limit). Speed is the trade-off: 23 tok/s is functional but not conversational for most people. This is a model for deliberate, high-quality work: research synthesis, complex reasoning chains, code review, multi-step analysis. Its MoE architecture and 256-expert routing make it a technically forward model, and we expect it to be the ceiling of the Miniloader local family until 40-50B MoE models become consumer-accessible.
LFM2 2.6B Exp (Experimental)
Throughput flagged / 1.46 GB
The LFM2 2.6B Experimental model is an unusual case. It passed conversation, thinking mode, native tool parsing, and end-to-end tool roundtrip correctly, but the speed benchmark returned inconsistent token counts across passes, triggering a throughput_invalid flag in the harness. Two of five passes produced very short outputs, which inflated variance and disqualified the average. This is consistent with the model being an experimental pre-release. Capability-wise it is complete. The thinking mode verification was particularly clean: the model correctly wrapped scratchpad reasoning inside the LFM2 thinking format and stripped it on request. We are tracking this model and expect a stable release to clear the throughput test cleanly.
What the Tests Actually Verified
The benchmark runs more than just a speed loop. Each model goes through six distinct checks.
Conversation. A basic identity question confirms the model is loaded correctly and generating coherent text.
Speed. Five independent generation passes at a fixed token budget. The outlier pass is trimmed and the remaining four are averaged. Standard deviation is also captured: low stddev means consistent throughput, high stddev means the model is latency-variable under load.
Image understanding. A vision prompt against a test image. Models that do not support image input are skipped automatically. Passing requires a coherent, accurate description. All four multimodal models in this run passed.
Thinking mode. A math problem that benefits from step-by-step reasoning. The harness checks that the model both enters the thinking scratchpad and produces a correct final answer. It also verifies that the thought block can be cleanly stripped from the output, which is required for agent pipelines where internal reasoning should not be surfaced to the end user.
Native tool parsing. A structured tool call prompt with a defined function schema. The harness verifies that the model emits a correctly formatted tool call matching the schema, without extra text or malformed JSON.
Tool roundtrip. The full agent loop: tool call emitted, tool results injected back, final answer generated using those results. Both test functions (add_numbers and get_current_time) must be called, the injected values must appear in the final answer, and the answer must be factually correct. All seven models completed this test successfully.
On the Hardware
The RX 7900 XTX is a good representative of the upper end of non-NVIDIA consumer hardware. It is not the fastest card for AI workloads (that remains NVIDIA's territory) but its 24 GB of VRAM makes it unusually capable for local model hosting, and its Vulkan support means it runs correctly in Miniloader without driver gymnastics.
Users with an RTX 3090, 4090, or similar NVIDIA card will see 15 to 40 percent higher throughput on most models due to more mature CUDA kernels in llama.cpp. Users with 8 or 12 GB of VRAM will need to rely on partial GPU offloading for the larger models, which reduces speed further.
These numbers are a snapshot of one real build running the Miniloader stack as shipped. No cherry-picked backends, no manual kernel tuning. What you see here is what you get.
The Miniloader Local Model Family
The models in this post are not listed in a compatibility table. They are supported. That means Miniloader ships with hardcoded chat templates, tested tool parsing, and verified thinking mode behaviour for each one. When you load any of these models into Basic Brain, you are getting a known-good configuration rather than a best-guess attempt.
The families in this run reflect our current view of where open-weight AI is heading: smaller MoE models that outperform larger dense ones, vision as a standard feature rather than a premium addon, and reasoning modes that work out of the box without prompt engineering. These are the most capable models available to run locally on consumer hardware today. That is why we chose them.
We will publish updated benchmarks as new model families are released and as the Miniloader runtime matures. The April 2026 run sets the baseline.
