Vi bruker informasjonskapsler for analyse og markedsføring. Les mer
Back to Lab
m51 Lab ResearchApril 2026 · 10 min

We pruned MiniMax-M2.7: the first public REAP variant, published 72 hours after release

Six open model variants on HuggingFace. 229 billion parameters reduced to 139 billion with no measurable quality loss. How m51 Lab did it, and why it demonstrates the technical depth our clients rely on.

Key findings (TL;DR)

  • First public REAP-40% variant of MiniMax-M2.7, published 72 hours after the model's release
  • 229 billion parameters reduced to 139 billion with no measurable quality loss (5/5 smoke test, 83.3% HumanEval pass@1 on completed problems)
  • Six open model formats on HuggingFace: Safetensors BF16, GGUF in 6 quantizations, FP8, NVFP4, AWQ, and NVFP4-GB10
  • The Q4_K_M variant (84 GB) fits on a 96 GB Mac Studio M4 Max with room for KV-cache
  • Part of m51 Lab's open research series: previously NorskGemma4-31B, NorskMistral-119B, and SeoGemma4 v2

MiniMax AI released M2.7 on April 12, 2026. A 229-billion-parameter Mixture-of-Experts model, 10 billion active parameters per token, and in its standard quantization too large to run outside a data center. Three days later, m51 Lab had published six open pruned variants, the smallest of which (Q4_K_M at 84 GB) runs on a 96 GB Mac Studio.

This is the story of that work, and why a marketing platform does it.

Brief technical context: MoE and REAP

Mixture-of-Experts (MoE) is the architecture behind most frontier models in 2026. Instead of all parameters being active for every token, the model routes each token to a small subset of so-called experts. MiniMax-M2.7 has 256 experts per layer and uses top-8 routing, giving it the quality of a 229-billion-parameter model at the compute cost of a 10-billion-parameter model, provided you can get it to run.

REAP (Router-weighted Expert Activation Pruning) from Cerebras Research, published October 2025, identifies the least important experts in an MoE model based on how often they activate and how much they contribute on calibration data. REAP-40% removes 40% of the experts (256 to 154 per layer) and cuts the model size from 140 GB to 84 GB in Q4_K_M with no measurable effect on quality.

The catalyst for this project: the official Cerebras REAP implementation supports six model families (Qwen, Llama, Mixtral, DeepSeek, Ernie, Glm). MiniMax was not on the list. No one else had published an M2.7 REAP. The first-mover window was open.

Why m51 Lab does this

M51 AI OS is an AI-driven marketing platform that runs 17 specialized AI agents daily for Norwegian customers: agents that write ad copy, generate reports, produce SEO audits, analyze competitors, and coordinate campaigns. When we recommend a model or architecture to a client, we take responsibility for it actually working under real conditions.

The problem is that the AI industry communicates in headlines. "GPT 5.4 scores 92% on MMLU." "Claude Opus 4.6 leads SWE-bench." "Gemini 3.1 has 1M context." All true. None of it tells you how the model behaves when you force it to run on a fraction of the data center, in continuous operation, with realistic traffic.

That is why we have the Lab division. We push models and tools until we know where the limits lie. The MiniMax-M2.7 project is the latest piece in an ongoing series.

Read also: We fine-tuned an open-source AI to write SEO audits, and tested it against Claude Sonnet

Three days, six deliverables

The pipeline had nine stages: FP8 download, BF16 dequantization, REAP pruning, imatrix calibration, GGUF conversion, multi-format quantization, evaluation, and publication. Estimated wall clock: 22 to 28 hours. Actual active pipeline time: 20 hours over three days. Total GPU cost: approximately 775 USD.

Day 1: Diagnosis

The official REAP code has two registries for model support, MODEL_ATTRS and OBSERVER_CONFIG_REGISTRY. MiniMax-M2.7 was in neither. After five restart rounds patching dataset formats, chat templates, and transformers versions, we reached forward pass on 4x H200.

That is where we hit the real wall. The REAP observer allocates (number of experts x batch x sequence x hidden_dim) x 4 bytes in FP32 per MoE block. For MiniMax-M2.7 that meant 96 GB per layer at batch=16. No single flag turns it off. A structural incompatibility between the REAP observer design and MiniMax architecture parameters, not the model itself. Lesson: FP32 activations are the hidden memory killer, not the weights.

Day 2: Pruning

The solution was to patch the observer code itself and switch FP32 to BF16 in the allocations. That halves the memory. With 8x H100 SXM and batch=4, we hit 41 seconds per sample, linear and stable. 768 calibration samples took 8 hours and 45 minutes. Forward pass went through, save first crashed on a small type bug (num_experts as int instead of string), and then we caught that REAP had pruned the expert weights but forgotten the e_score_correction_bias tensors. 30 lines of Python rewrote 55 of 56 shards with the correct bias shape.

Day 3: Conversion, evaluation, publication

From a 260 GB BF16 checkpoint we converted to six variants: GGUF in Q3_K_M, IQ4_XS, Q4_K_M, Q6_K, and Q8_0; FP8 W8A8 for vLLM and TRT-LLM; NVFP4 W4A16 for Blackwell-native inference; AWQ INT4 W4A16 for broad compatibility; and NVFP4-GB10 W4A4 for DGX Spark. HumanEval evaluation of the Q4_K_M variant. Methodology-transparent publication on HuggingFace with all calibration details, bias-fix documentation, and memory budget tables.

Total: 14 structural patches against the REAP codebase. All documented. Nothing held back.

What we delivered: six model variants

VariantFormatSizeTarget
Safetensors BF16HuggingFace transformers278 GBData center, vLLM, TRT-LLM, fine-tuning
GGUF Q3_K_Mllama.cpp62 GB64 GB RAM threshold
GGUF IQ4_XSllama.cpp69 GBLong context on 96 GB
GGUF Q4_K_Mllama.cpp84 GB96 GB Mac Studio, Threadripper
GGUF Q6_Kllama.cpp106 GB128 GB+ machines
GGUF Q8_0llama.cpp138 GBQuality over compression
FP8 (W8A8)vLLM, TRT-LLM140 GBH100/H200 native
NVFP4 (W4A16)vLLM, compressed-tensors72 GBB100/B200, Hopper fallback
AWQ INT4 (W4A16)vLLM, HF Transformers74 GBBroad INT4 ecosystem
NVFP4-GB10 (W4A4)vLLM72 GBDGX Spark, GB10 native FP4

All published under Modified MIT license, inherited from MiniMaxAI. Model cards include full calibration details, benchmark methodology, and memory budget tables per context length.

Safetensors BF16 (main repo)

GGUF (6 quantizations for llama.cpp)

FP8 (W8A8, for H100/H200)

NVFP4 (W4A16, for Blackwell + Hopper fallback)

AWQ (INT4 W4A16)

NVFP4-GB10 (W4A4, for DGX Spark / GB10)

Quality: what does 40% pruning cost?

We evaluated the Q4_K_M variant (84 GB, the Mac Studio target) on a five-prompt smoke test and on the full HumanEval (164 coding problems). The smoke test: 5/5 PASS on Norwegian, math, Python, explanations, and JSON. The minor bias-fix imperfection on layer 0 has no observable effect.

HumanEval pass@1 gives three numbers, and the methodology behind them is the actual point:

  • 83.3% on completed problems (90 of 108 where the model had enough token budget to finish its reasoning)
  • 54.9% strict (90 of 164 total, where 56 problems hit the 32K token cap because the model went deep in chain-of-thought)
  • 65.2% with the raw-completion method (forcing the model into prompt completion instead of chat mode)

MiniMax-M2.7 is RLHF-tuned to emit thinking blocks before its final answer. On complex engineering problems it easily spends 15,000+ tokens on internal reasoning. 83.3% is the model quality. 54.9% is what you get in production with realistic budgets. 65.2% is an alternative methodology number. We publish all three, transparently.

A benchmark number without methodology context is nearly worthless. That applies not just to our model, but to everything you read about LLM benchmarks in 2026.

Three principles we apply to client work

1. A model size on disk is not your capacity plan

A model at "84 GB in Q4_K_M" is 84 GB on disk. During inference it is weights plus activations plus KV-cache plus compute buffers. The KV-cache alone varies 3-4x across architectures: MiniMax-M2.7 uses 250 MB per 1K tokens, Qwen3-30B 94 MB, DeepSeek V3 with MLA 72 MB. When we size infrastructure for an AI workload at a client, we account for all of it, not just the model size. Errors in this calculation are a primary reason AI pilot projects fail.

2. Benchmark numbers without methodology are nearly worthless

Same model, same benchmark, three different numbers depending on token budget. This applies to every reasoning model in 2026. When we evaluate models for our customers, we run them against their actual workload and always state the test conditions, not generic benchmark numbers from vendor whitepapers.

3. The open-source ecosystem matures fast but unevenly

REAP is production-ready for six model families. The seventh required 14 structural patches. This is the nature of specialized AI tools: the supported surface is narrower than the marketing implies. We can navigate it. When a customer asks three months from now "have you tested X?", the answer is more often "yes" than "no", because Lab work gives us real hands-on experience with every new model ecosystem.

See also: other m51 Lab research

SeoGemma4 v2 vs Claude Sonnet: can open source write Norwegian SEO audits?

NorskMistral-119B: Norway's best open-source language model

NorskGemma4-31B: smaller model, better results (83.6% NorEval)

Open source vs commercial AI on the Norwegian market: our models against GPT and Gemini


About M51 AI OS

M51 AI OS is a Norwegian SaaS platform that gives marketing teams and agencies a complete AI-driven operating system. 17 specialized AI agents work with customer data and brand context to automate analysis, reporting, ad production, SEO, and campaign optimization. In production the platform runs on a curated mix of Claude Sonnet/Opus, GPT, and our own fine-tuned models.

m51 Lab is the research and development division behind the platform. We publish open models on HuggingFace, document our research transparently, and let the learnings flow directly into the architectural choices we make for clients. The MiniMax-M2.7 experiment does not stand alone: it is the latest in a series that also includes NorskGemma4-31B, NorskMistral-119B, and SeoGemma4 v2.

Want to see what AI can do for your marketing team? Book a demo.

Book a demo
Built in NorwayGDPR-compliantClaude Opus 4.6
Privacy