Et komplett AI-operativsystem for markedsføringsteam. 17 spesialiserte agenter som samarbeider med dine data og din merkevare, ikke en generisk chatbot.

Hvordan er dette forskjellig fra ChatGPT?

ChatGPT er en generell assistent. m51.ai har live tilkobling til GA4, Google Ads, Meta Ads og Search Console, kjenner din merkevare gjennom M51 Cortex (vår lærende intelligens-kjerne med sporbare påstander og kausalmodell), og kjører automatiserte workflows. Det er forskjellen på en assistent og et helt team.

Fra 2 450 kr/mnd med intropris de 3 første månedene (ordinær pris 4 900 kr/mnd). Pro koster 7 450 kr/mnd (ordinær pris 14 900 kr/mnd) og gir tilgang til alle 17 agenter. Konkurranseanalyse kan legges til som tilleggsmodul på alle planer for 1 495 kr/mnd (ordinær pris 2 990 kr/mnd). Nye kunder får 50 % rabatt i 3 måneder.

Er dataene mine sikre?

GDPR-compliant, norskutviklet, dedikert infrastruktur. Dine data deles aldri. Du eier alt.

Trenger jeg teknisk kompetanse?

Nei. Du snakker med agentene på norsk. Onboarding tar 30 minutter. Vi kobler integrasjoner og setter opp M51 Cortex med merkevaredataene dine.

Vil innholdet høres ut som AI?

Nei. Hver agent jobber fra M51 Cortex, intelligens-kjernen som inneholder tone of voice, nøkkelbudskap, forbudte ord og kampanjehistorikk — sporbart til kilde og dato. Alt innhold tilpasses din merkevare. Du har alltid full kontroll til å redigere og godkjenne før noe publiseres.

Hva skjer hvis noe går galt?

Ingenting publiseres uten din godkjenning. Guardian-agenten kjører automatisk kvalitetskontroll på alle leveranser. Alle forslag går gjennom en godkjenningsflyt der du bestemmer hva som tas videre. Du har full oversikt via forslagstavlen og aktivitetsstrømmen.

Hvem står bak m51.ai?

m51.ai er bygget av teamet bak M51 Marketing, et norsk performance-byrå med 9 års erfaring og over 300 kunder. Systemet er utviklet i Norge, kjører på norsk infrastruktur, og støttes av et dedikert team. Du får en fast kontaktperson ved oppstart.

Er jeg bundet til en kontrakt?

Nei. m51.ai er månedlig fakturering uten bindingstid. Du kan oppgradere, nedgradere eller avslutte når du vil. Nye kunder får 50 % intropris-rabatt de 3 første månedene, uten krav om forlengelse.

Hva om jeg ikke bruker alle kanalene?

Du kobler bare til de plattformene du faktisk bruker. Kjører du bare Google Ads og GA4? Da jobber agentene med de dataene. Legger du til Meta Ads senere, er det bare å koble til. Systemet tilpasser seg ditt oppsett.

Hvor lang tid tar det før det er oppe og kjører?

Oppsett tar vanligvis 1–2 virkedager. Vi kobler integrasjonene dine, laster opp merkevaredata og fyller M51 Cortex med kontekst-laget agentene bruker. AI-agentene begynner å levere analyser og forslag fra dag én etter oppsett. Du trenger ikke sette av IT-ressurser.

AI Lab

Smaller model, better results: How we built NorskGemma4-31B

83.6% on NorEval, the highest published score for a Norwegian model.

m51.ai Lab

April 20266 min read

We had barely finished publishing m51Lab-NorskMistral-119B, ranked #1 on NorEval with 76.8%, when Google released Gemma 4. A new generation of models with support for 140+ languages right from the start.

Read about NorskMistral-119B: How we built Norway's best open-source AI

We had to test it.

m51Lab-NorskGemma4-31B scores 83.6% on NorEval. That is 6.8 percentage points above our own Mistral model and almost 11pp above the previous best published model. With 4x fewer parameters and a quarter of the training data.

Listen to the AI podcast about this article

Background: NorEval and the starting point

NorEval is the Norwegian standard for evaluating language models, developed by the University of Oslo and published at ACL 2025. The benchmark covers 24 datasets across 9 categories, ranging from commonsense reasoning and factual knowledge to truthfulness and reading comprehension. It tests both Bokmål and Nynorsk.

Our Mistral model had just set a new record with 76.8% average, up from NorMistral-11B-thinking's 73.1%. With Gemma 4 on the table, the question was: can we do it again, with a much smaller model?

We tested three Gemma 4 variants. Two of them failed.

Three attempts, two dead ends

E4B (4B dense): Too small

The smallest variant simply lacked the capacity. SFT produced -8.7 percentage points. The model lost logic and reasoning in the attempt to learn more Norwegian. We tried twice with different configurations. Catastrophic forgetting both times. A 4B model cannot absorb a new language domain without losing what it already knows.

26B MoE: Partially protected, but not enough

The MoE variant was promising. The expert weights (22.84B of the parameters) remained automatically frozen, as standard LoRA cannot target 3D parameter tensors. Only 0.23% of the model was modified.

It was not enough. SFT produced -4.2 percentage points. Better than E4B, but still negative. The attention layers — the only ones LoRA actually modifies — turned out to be critical for truthfulness. The model lost 10 percentage points on NorTruthfulQA. Changing how a model thinks is more dangerous than changing what it knows.

31B Dense: The breakthrough

Then we tested the dense 31B variant, all 31 billion parameters active at all times. We changed strategy entirely: PiSSA initialization instead of random LoRA, surgical layer selection, and a minimal dataset.

The result: 83.6% average on NorEval. No capacity damaged. The first time SFT actually helped.

The approach: Surgical precision

After five failed attempts across three models, we had a clear picture of what does not work. The successful approach was built on three insights:

PiSSA initialization. Standard LoRA starts training from random directions in weight space. PiSSA (Principal Singular values and Singular vectors Adaptation) uses SVD decomposition to start from the most important directions in the model's existing weights. The difference is dramatic: instead of overwriting knowledge, you refine it.

Layer-selective training. Gemma 4 31B has 60 layers: 50 sliding attention layers and 10 global attention layers. The global layers handle long-range reasoning and truthfulness. We froze them entirely and trained only the 50 sliding layers, and only q_proj and v_proj. A total of 9.2 million trainable parameters, 0.03% of the model.

Quality over quantity. 3,230 curated examples outperformed 96,804 broad examples. The key was the composition: 67% Bokmål, 31% Nynorsk, 2% English anti-forgetting, and — critically — 0% translation data. In the failed attempts, BM→NN translation made up 44% of the data and overwrote reasoning ability.

Model: Google Gemma 4 31B Dense (50 sliding + 10 global layers)
Adapter: PiSSA r=8, alpha=16 (q_proj + v_proj, sliding layers only)
Trainable params: 9.2M (0.03% of 31.3B)
Training data: 3,230 curated examples
Hardware: 2x NVIDIA H100 80 GB
LR: 5e-6, NEFTune alpha 5, weight decay 0.01

Results

Task	NorskGemma4-31B	NorskMistral-119B	NorEval best*
Commonsense BM	85.4%	75.7%	72.2%
Commonsense NN	73.7%	63.2%	52.6%
Open-book QA BM	96.5%	95.7%	87.4%
Open-book QA NN	94.4%	93.3%	88.9%
Truthfulness BM	85.7%	77.9%	74.6%
Truthfulness NN	93.0%	82.5%	73.7%
Norwegian knowledge BM	70.9%	66.5%	63.7%
Norwegian knowledge NN	69.6%	65.1%	71.9%
Average	83.6%	76.8%	73.1%

*Best published model in the NorEval paper (UiO, ACL 2025)

NorskGemma4 outperforms NorskMistral on all 8 tasks. The largest margins are in truthfulness (+10-11pp) and commonsense reasoning (+10pp). The model is exceptionally strong in Nynorsk. 93.0% on truthfulness is the highest result in the entire benchmark.

Contamination check

We formally verified that the training data does not overlap with the NorEval test set. 6,445 training segments were checked against 18,124 test texts from all 8 tasks, using three methods: exact matching, substring matching, and character-level n-gram overlap (50-gram and 30-gram).

The result: zero contamination. No exact matches, no substring matches, no suspicious n-gram overlaps. The benchmark results reflect the model's genuine capacity.

What worked and what did not

What worked

PiSSA initialization: SVD-based LoRA init preserved knowledge where random init destroyed it.
Layer-selective training: freezing the 10 global attention layers protected truthfulness and long-range reasoning.
Minimal, curated dataset: 3,230 examples with the right composition outperformed 96,804 broad examples.
Multi-prompt evaluation: testing 5 prompt variants per task prevents a single poorly worded prompt from producing misleading results.
NEFTune noise regularization: alpha=5 produced smoother generalization without overfitting.

What did not work

SFT on small models (E4B 4B): catastrophic forgetting regardless of configuration. The model did not have the capacity.
SFT on MoE attention layers (26B): even with frozen experts, truthfulness collapsed. Attention modifications are too risky.
Preference optimization (IPO/DPO): zero effect with synthetic preference data. Real human preferences are likely necessary.
Large translation datasets: 44% BM→NN translation overwrote reasoning ability. 0% worked better.

What we learned

1. Modern models already know Norwegian. Google's multilingual training on 140+ languages means Gemma 4 already understands Norwegian well. Our job was to refine, not to teach. That changes the entire approach: minimal intervention with surgical precision, not massive retraining.

2. Less data, better results. 3,230 curated examples outperformed 96,804 broad examples. Composition and quality trump volume. Particularly destructive was translation data, which overwrote the model's reasoning ability.

3. Architecture determines training strategy. Dense models responded to surgical SFT. The MoE models' shared attention layers were too fragile. The 4B model was simply too small. Each architecture requires its own approach. There is no universal recipe.

4. Accessibility is a feature. In Q4_K_M quantization the model weighs 18 GB, enough to run on a MacBook Pro with 32 GB RAM, or a single gaming GPU. A model no one can run is a model no one uses.

Try the model yourself

m51Lab-NorskGemma4-31B is open source under the Apache 2.0 license. Download and run it locally with Ollama, LM Studio, or llama.cpp.

Full model: m51Lab-NorskGemma4-31B on HuggingFace

GGUF (local use): m51Lab-NorskGemma4-31B-GGUF

Q4_K_M (18 GB) is recommended for most users. Q8_0 (31 GB) for higher quality.

About m51

m51.ai builds AI solutions for Norwegian businesses. NorskGemma4 and NorskMistral demonstrate what a small, focused team can achieve with the right approach and modern tools. The same expertise powers m51.ai, the platform that gives marketing teams and agencies access to specialized AI agents for content production, SEO, advertising, and campaign optimization.

Have an AI project you would like to discuss? Get in touch at [email protected].

Book demo

Technical details, the complete build log, and all training scripts are available in the project's GitHub repository.

Want to see how the models perform on real Norwegian business questions?

Read the market test: Can Norwegian-trained AI models compete with GPT and Gemini?