Does m51.ai have deep technical AI expertise?

Yes. Our Lab team publishes open research models on HuggingFace: NorskGemma4-31B, NorskMistral-119B, SeoGemma4 v2, and now MiniMax-M2.7 REAP-40%. We contribute to the open AI ecosystem and keep our expertise at the forefront of the industry.

Why is this kind of research relevant for a marketing platform?

Our AI agents run daily for Norwegian customers. We need to know precisely what modern models can do, where they break, and how they compress onto production hardware. Lab work like this means the advice we give customers is built on hands-on technical experience, not vendor whitepapers.

What is MiniMax-M2.7, and what was m51 Lab's contribution?

MiniMax-M2.7 is a 229-billion-parameter Mixture-of-Experts model from MiniMax AI, released on April 12, 2026. m51 Lab delivered the first public REAP-40% variant, reduced to 139 billion parameters, published 72 hours after the base model. Six model formats (Safetensors, GGUF, FP8, NVFP4, AWQ, NVFP4-GB10) are open on HuggingFace.

What is REAP pruning in simple terms?

REAP (Router-weighted Expert Activation Pruning) is a method from Cerebras Research. It identifies the least-used experts in a Mixture-of-Experts model and removes them based on activation measurements on calibration data. The result is a smaller, faster model with the same quality. We removed 40%, going from 256 to 154 experts per layer.

Where can I download the models?

All six variants are open on HuggingFace under dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B*. Modified MIT license, inherited from MiniMaxAI.

Does m51.ai use this model in the product today?

No, our platform runs on a curated mix of Claude Sonnet/Opus, GPT, and our own fine-tuned models. But the technical lessons from the MiniMax work (capacity planning, KV-cache budgets, benchmark interpretation, MoE architecture) directly shape how we design AI workflows for clients.

m51 Lab Research

We pruned MiniMax-M2.7: the first public REAP variant, published 72 hours after release

Six open model variants on HuggingFace. 229 billion parameters reduced to 139 billion with no measurable quality loss. How m51 Lab did it, and why it demonstrates the technical depth our clients rely on.

m51.ai Lab

April 20267 min read

Key findings (TL;DR)

First public REAP-40% variant of MiniMax-M2.7, published 72 hours after the model's release
229 billion parameters reduced to 139 billion with no measurable quality loss (5/5 smoke test, 83.3% HumanEval pass@1 on completed problems)
Six open model formats on HuggingFace: Safetensors BF16, GGUF in 6 quantizations, FP8, NVFP4, AWQ, and NVFP4-GB10
The Q4_K_M variant (84 GB) fits on a 96 GB Mac Studio M4 Max with room for KV-cache
Part of m51 Lab's open research series: previously NorskGemma4-31B, NorskMistral-119B, and SeoGemma4 v2

MiniMax AI released M2.7 on April 12, 2026. A 229-billion-parameter Mixture-of-Experts model, 10 billion active parameters per token, and in its standard quantization too large to run outside a data center. Three days later, m51 Lab had published six open pruned variants, the smallest of which (Q4_K_M at 84 GB) runs on a 96 GB Mac Studio.

This is the story of that work, and why a marketing platform does it.

Brief technical context: MoE and REAP

Mixture-of-Experts (MoE) is the architecture behind most frontier models in 2026. Instead of all parameters being active for every token, the model routes each token to a small subset of so-called experts. MiniMax-M2.7 has 256 experts per layer and uses top-8 routing, giving it the quality of a 229-billion-parameter model at the compute cost of a 10-billion-parameter model, provided you can get it to run.

REAP (Router-weighted Expert Activation Pruning) from Cerebras Research, published October 2025, identifies the least important experts in an MoE model based on how often they activate and how much they contribute on calibration data. REAP-40% removes 40% of the experts (256 to 154 per layer) and cuts the model size from 140 GB to 84 GB in Q4_K_M with no measurable effect on quality.

The catalyst for this project: the official Cerebras REAP implementation supports six model families (Qwen, Llama, Mixtral, DeepSeek, Ernie, Glm). MiniMax was not on the list. No one else had published an M2.7 REAP. The first-mover window was open.

Why m51 Lab does this

m51.ai is an AI-driven marketing platform that runs 17 specialized AI agents daily for Norwegian customers: agents that write ad copy, generate reports, produce SEO audits, analyze competitors, and coordinate campaigns. When we recommend a model or architecture to a client, we take responsibility for it actually working under real conditions.

The problem is that the AI industry communicates in headlines. "GPT 5.4 scores 92% on MMLU." "Claude Opus 4.6 leads SWE-bench." "Gemini 3.1 has 1M context." All true. None of it tells you how the model behaves when you force it to run on a fraction of the data center, in continuous operation, with realistic traffic.

That is why we have the Lab division. We push models and tools until we know where the limits lie. The MiniMax-M2.7 project is the latest piece in an ongoing series.

Three days, six deliverables

The pipeline had nine stages: FP8 download, BF16 dequantization, REAP pruning, imatrix calibration, GGUF conversion, multi-format quantization, evaluation, and publication. Estimated wall clock: 22 to 28 hours. Actual active pipeline time: 20 hours over three days. Total GPU cost: approximately 775 USD.

Day 1: Diagnosis

The official REAP code has two registries for model support, MODEL_ATTRS and OBSERVER_CONFIG_REGISTRY. MiniMax-M2.7 was in neither. After five restart rounds patching dataset formats, chat templates, and transformers versions, we reached forward pass on 4x H200.

That is where we hit the real wall. The REAP observer allocates (number of experts x batch x sequence x hidden_dim) x 4 bytes in FP32 per MoE block. For MiniMax-M2.7 that meant 96 GB per layer at batch=16. No single flag turns it off. A structural incompatibility between the REAP observer design and MiniMax architecture parameters, not the model itself. Lesson: FP32 activations are the hidden memory killer, not the weights.

Day 2: Pruning

The solution was to patch the observer code itself and switch FP32 to BF16 in the allocations. That halves the memory. With 8x H100 SXM and batch=4, we hit 41 seconds per sample, linear and stable. 768 calibration samples took 8 hours and 45 minutes. Forward pass went through, save first crashed on a small type bug (num_experts as int instead of string), and then we caught that REAP had pruned the expert weights but forgotten the e_score_correction_bias tensors. 30 lines of Python rewrote 55 of 56 shards with the correct bias shape.

Day 3: Conversion, evaluation, publication

From a 260 GB BF16 checkpoint we converted to six variants: GGUF in Q3_K_M, IQ4_XS, Q4_K_M, Q6_K, and Q8_0; FP8 W8A8 for vLLM and TRT-LLM; NVFP4 W4A16 for Blackwell-native inference; AWQ INT4 W4A16 for broad compatibility; and NVFP4-GB10 W4A4 for DGX Spark. HumanEval evaluation of the Q4_K_M variant. Methodology-transparent publication on HuggingFace with all calibration details, bias-fix documentation, and memory budget tables.

Total: 14 structural patches against the REAP codebase. All documented. Nothing held back.

What we delivered: six model variants

Variant	Format	Size	Target
Safetensors BF16	HuggingFace transformers	278 GB	Data center, vLLM, TRT-LLM, fine-tuning
GGUF Q3_K_M	llama.cpp	62 GB	64 GB RAM threshold
GGUF IQ4_XS	llama.cpp	69 GB	Long context on 96 GB
GGUF Q4_K_M	llama.cpp	84 GB	96 GB Mac Studio, Threadripper
GGUF Q6_K	llama.cpp	106 GB	128 GB+ machines
GGUF Q8_0	llama.cpp	138 GB	Quality over compression
FP8 (W8A8)	vLLM, TRT-LLM	140 GB	H100/H200 native
NVFP4 (W4A16)	vLLM, compressed-tensors	72 GB	B100/B200, Hopper fallback
AWQ INT4 (W4A16)	vLLM, HF Transformers	74 GB	Broad INT4 ecosystem
NVFP4-GB10 (W4A4)	vLLM	72 GB	DGX Spark, GB10 native FP4

All published under Modified MIT license, inherited from MiniMaxAI. Model cards include full calibration details, benchmark methodology, and memory budget tables per context length.

Safetensors BF16 (main repo)

GGUF (6 quantizations for llama.cpp)

FP8 (W8A8, for H100/H200)

NVFP4 (W4A16, for Blackwell + Hopper fallback)

AWQ (INT4 W4A16)

NVFP4-GB10 (W4A4, for DGX Spark / GB10)

Quality: what does 40% pruning cost?

We evaluated the Q4_K_M variant (84 GB, the Mac Studio target) on a five-prompt smoke test and on the full HumanEval (164 coding problems). The smoke test: 5/5 PASS on Norwegian, math, Python, explanations, and JSON. The minor bias-fix imperfection on layer 0 has no observable effect.

HumanEval pass@1 gives three numbers, and the methodology behind them is the actual point:

83.3% on completed problems (90 of 108 where the model had enough token budget to finish its reasoning)
54.9% strict (90 of 164 total, where 56 problems hit the 32K token cap because the model went deep in chain-of-thought)
65.2% with the raw-completion method (forcing the model into prompt completion instead of chat mode)

MiniMax-M2.7 is RLHF-tuned to emit thinking blocks before its final answer. On complex engineering problems it easily spends 15,000+ tokens on internal reasoning. 83.3% is the model quality. 54.9% is what you get in production with realistic budgets. 65.2% is an alternative methodology number. We publish all three, transparently.

A benchmark number without methodology context is nearly worthless. That applies not just to our model, but to everything you read about LLM benchmarks in 2026.

Three principles we apply to client work

1. A model size on disk is not your capacity plan

A model at "84 GB in Q4_K_M" is 84 GB on disk. During inference it is weights plus activations plus KV-cache plus compute buffers. The KV-cache alone varies 3-4x across architectures: MiniMax-M2.7 uses 250 MB per 1K tokens, Qwen3-30B 94 MB, DeepSeek V3 with MLA 72 MB. When we size infrastructure for an AI workload at a client, we account for all of it, not just the model size. Errors in this calculation are a primary reason AI pilot projects fail.

2. Benchmark numbers without methodology are nearly worthless

Same model, same benchmark, three different numbers depending on token budget. This applies to every reasoning model in 2026. When we evaluate models for our customers, we run them against their actual workload and always state the test conditions, not generic benchmark numbers from vendor whitepapers.

3. The open-source ecosystem matures fast but unevenly

REAP is production-ready for six model families. The seventh required 14 structural patches. This is the nature of specialized AI tools: the supported surface is narrower than the marketing implies. We can navigate it. When a customer asks three months from now "have you tested X?", the answer is more often "yes" than "no", because Lab work gives us real hands-on experience with every new model ecosystem.

About m51.ai

m51.ai is a Norwegian SaaS platform that gives marketing teams and agencies a complete AI-driven operating system. 17 specialized AI agents work with customer data and brand context to automate analysis, reporting, ad production, SEO, and campaign optimization. In production the platform runs on a curated mix of Claude Sonnet/Opus, GPT, and our own fine-tuned models.

m51 Lab is the research and development division behind the platform. We publish open models on HuggingFace, document our research transparently, and let the learnings flow directly into the architectural choices we make for clients. The MiniMax-M2.7 experiment does not stand alone: it is the latest in a series that also includes NorskGemma4-31B, NorskMistral-119B, and SeoGemma4 v2.

Want to see what AI can do for your marketing team? Book a demo.

Book a demo