Key findings (TL;DR)
- First public REAP-40% variant of MiniMax-M2.7, published 72 hours after the model's release
- 229 billion parameters reduced to 139 billion with no measurable quality loss (5/5 smoke test, 83.3% HumanEval pass@1 on completed problems)
- Six open model formats on HuggingFace: Safetensors BF16, GGUF in 6 quantizations, FP8, NVFP4, AWQ, and NVFP4-GB10
- The Q4_K_M variant (84 GB) fits on a 96 GB Mac Studio M4 Max with room for KV-cache
- Part of m51 Lab's open research series: previously NorskGemma4-31B, NorskMistral-119B, and SeoGemma4 v2
MiniMax AI released M2.7 on April 12, 2026. A 229-billion-parameter Mixture-of-Experts model, 10 billion active parameters per token, and in its standard quantization too large to run outside a data center. Three days later, m51 Lab had published six open pruned variants, the smallest of which (Q4_K_M at 84 GB) runs on a 96 GB Mac Studio.
This is the story of that work, and why a marketing platform does it.
Brief technical context: MoE and REAP
Mixture-of-Experts (MoE) is the architecture behind most frontier models in 2026. Instead of all parameters being active for every token, the model routes each token to a small subset of so-called experts. MiniMax-M2.7 has 256 experts per layer and uses top-8 routing, giving it the quality of a 229-billion-parameter model at the compute cost of a 10-billion-parameter model, provided you can get it to run.
REAP (Router-weighted Expert Activation Pruning) from Cerebras Research, published October 2025, identifies the least important experts in an MoE model based on how often they activate and how much they contribute on calibration data. REAP-40% removes 40% of the experts (256 to 154 per layer) and cuts the model size from 140 GB to 84 GB in Q4_K_M with no measurable effect on quality.
The catalyst for this project: the official Cerebras REAP implementation supports six model families (Qwen, Llama, Mixtral, DeepSeek, Ernie, Glm). MiniMax was not on the list. No one else had published an M2.7 REAP. The first-mover window was open.
Why m51 Lab does this
M51 AI OS is an AI-driven marketing platform that runs 17 specialized AI agents daily for Norwegian customers: agents that write ad copy, generate reports, produce SEO audits, analyze competitors, and coordinate campaigns. When we recommend a model or architecture to a client, we take responsibility for it actually working under real conditions.
The problem is that the AI industry communicates in headlines. "GPT 5.4 scores 92% on MMLU." "Claude Opus 4.6 leads SWE-bench." "Gemini 3.1 has 1M context." All true. None of it tells you how the model behaves when you force it to run on a fraction of the data center, in continuous operation, with realistic traffic.
That is why we have the Lab division. We push models and tools until we know where the limits lie. The MiniMax-M2.7 project is the latest piece in an ongoing series.
Read also: We fine-tuned an open-source AI to write SEO audits, and tested it against Claude Sonnet
Three days, six deliverables
The pipeline had nine stages: FP8 download, BF16 dequantization, REAP pruning, imatrix calibration, GGUF conversion, multi-format quantization, evaluation, and publication. Estimated wall clock: 22 to 28 hours. Actual active pipeline time: 20 hours over three days. Total GPU cost: approximately 775 USD.
Day 1: Diagnosis
The official REAP code has two registries for model support, MODEL_ATTRS and OBSERVER_CONFIG_REGISTRY. MiniMax-M2.7 was in neither. After five restart rounds patching dataset formats, chat templates, and transformers versions, we reached forward pass on 4x H200.
That is where we hit the real wall. The REAP observer allocates (number of experts x batch x sequence x hidden_dim) x 4 bytes in FP32 per MoE block. For MiniMax-M2.7 that meant 96 GB per layer at batch=16. No single flag turns it off. A structural incompatibility between the REAP observer design and MiniMax architecture parameters, not the model itself. Lesson: FP32 activations are the hidden memory killer, not the weights.
Day 2: Pruning
The solution was to patch the observer code itself and switch FP32 to BF16 in the allocations. That halves the memory. With 8x H100 SXM and batch=4, we hit 41 seconds per sample, linear and stable. 768 calibration samples took 8 hours and 45 minutes. Forward pass went through, save first crashed on a small type bug (num_experts as int instead of string), and then we caught that REAP had pruned the expert weights but forgotten the e_score_correction_bias tensors. 30 lines of Python rewrote 55 of 56 shards with the correct bias shape.
Day 3: Conversion, evaluation, publication
From a 260 GB BF16 checkpoint we converted to six variants: GGUF in Q3_K_M, IQ4_XS, Q4_K_M, Q6_K, and Q8_0; FP8 W8A8 for vLLM and TRT-LLM; NVFP4 W4A16 for Blackwell-native inference; AWQ INT4 W4A16 for broad compatibility; and NVFP4-GB10 W4A4 for DGX Spark. HumanEval evaluation of the Q4_K_M variant. Methodology-transparent publication on HuggingFace with all calibration details, bias-fix documentation, and memory budget tables.
Total: 14 structural patches against the REAP codebase. All documented. Nothing held back.
What we delivered: six model variants
| Variant | Format | Size | Target |
|---|---|---|---|
| Safetensors BF16 | HuggingFace transformers | 278 GB | Data center, vLLM, TRT-LLM, fine-tuning |
| GGUF Q3_K_M | llama.cpp | 62 GB | 64 GB RAM threshold |
| GGUF IQ4_XS | llama.cpp | 69 GB | Long context on 96 GB |
| GGUF Q4_K_M | llama.cpp | 84 GB | 96 GB Mac Studio, Threadripper |
| GGUF Q6_K | llama.cpp | 106 GB | 128 GB+ machines |
| GGUF Q8_0 | llama.cpp | 138 GB | Quality over compression |
| FP8 (W8A8) | vLLM, TRT-LLM | 140 GB | H100/H200 native |
| NVFP4 (W4A16) | vLLM, compressed-tensors | 72 GB | B100/B200, Hopper fallback |
| AWQ INT4 (W4A16) | vLLM, HF Transformers | 74 GB | Broad INT4 ecosystem |
| NVFP4-GB10 (W4A4) | vLLM | 72 GB | DGX Spark, GB10 native FP4 |
All published under Modified MIT license, inherited from MiniMaxAI. Model cards include full calibration details, benchmark methodology, and memory budget tables per context length.
GGUF (6 quantizations for llama.cpp)
NVFP4 (W4A16, for Blackwell + Hopper fallback)
NVFP4-GB10 (W4A4, for DGX Spark / GB10)
Quality: what does 40% pruning cost?
We evaluated the Q4_K_M variant (84 GB, the Mac Studio target) on a five-prompt smoke test and on the full HumanEval (164 coding problems). The smoke test: 5/5 PASS on Norwegian, math, Python, explanations, and JSON. The minor bias-fix imperfection on layer 0 has no observable effect.
HumanEval pass@1 gives three numbers, and the methodology behind them is the actual point:
- 83.3% on completed problems (90 of 108 where the model had enough token budget to finish its reasoning)
- 54.9% strict (90 of 164 total, where 56 problems hit the 32K token cap because the model went deep in chain-of-thought)
- 65.2% with the raw-completion method (forcing the model into prompt completion instead of chat mode)
MiniMax-M2.7 is RLHF-tuned to emit thinking blocks before its final answer. On complex engineering problems it easily spends 15,000+ tokens on internal reasoning. 83.3% is the model quality. 54.9% is what you get in production with realistic budgets. 65.2% is an alternative methodology number. We publish all three, transparently.
A benchmark number without methodology context is nearly worthless. That applies not just to our model, but to everything you read about LLM benchmarks in 2026.
Three principles we apply to client work
1. A model size on disk is not your capacity plan
A model at "84 GB in Q4_K_M" is 84 GB on disk. During inference it is weights plus activations plus KV-cache plus compute buffers. The KV-cache alone varies 3-4x across architectures: MiniMax-M2.7 uses 250 MB per 1K tokens, Qwen3-30B 94 MB, DeepSeek V3 with MLA 72 MB. When we size infrastructure for an AI workload at a client, we account for all of it, not just the model size. Errors in this calculation are a primary reason AI pilot projects fail.
2. Benchmark numbers without methodology are nearly worthless
Same model, same benchmark, three different numbers depending on token budget. This applies to every reasoning model in 2026. When we evaluate models for our customers, we run them against their actual workload and always state the test conditions, not generic benchmark numbers from vendor whitepapers.
3. The open-source ecosystem matures fast but unevenly
REAP is production-ready for six model families. The seventh required 14 structural patches. This is the nature of specialized AI tools: the supported surface is narrower than the marketing implies. We can navigate it. When a customer asks three months from now "have you tested X?", the answer is more often "yes" than "no", because Lab work gives us real hands-on experience with every new model ecosystem.
See also: other m51 Lab research
SeoGemma4 v2 vs Claude Sonnet: can open source write Norwegian SEO audits?
NorskMistral-119B: Norway's best open-source language model
NorskGemma4-31B: smaller model, better results (83.6% NorEval)
Open source vs commercial AI on the Norwegian market: our models against GPT and Gemini
About M51 AI OS
M51 AI OS is a Norwegian SaaS platform that gives marketing teams and agencies a complete AI-driven operating system. 17 specialized AI agents work with customer data and brand context to automate analysis, reporting, ad production, SEO, and campaign optimization. In production the platform runs on a curated mix of Claude Sonnet/Opus, GPT, and our own fine-tuned models.
m51 Lab is the research and development division behind the platform. We publish open models on HuggingFace, document our research transparently, and let the learnings flow directly into the architectural choices we make for clients. The MiniMax-M2.7 experiment does not stand alone: it is the latest in a series that also includes NorskGemma4-31B, NorskMistral-119B, and SeoGemma4 v2.
Want to see what AI can do for your marketing team? Book a demo.
Book a demo