Vi bruker informasjonskapsler for analyse og markedsføring. Les mer
Back to Lab
AI LabApril 2026 · 10 min

Smaller model, better results: How we built NorskGemma4-31B

83.6% on NorEval, the highest published score for a Norwegian model.

We had barely finished publishing m51Lab-NorskMistral-119B, ranked #1 on NorEval with 76.8%, when Google released Gemma 4. A new generation of models with support for 140+ languages right from the start.

Read about NorskMistral-119B: How we built Norway's best open-source AI

We had to test it.

m51Lab-NorskGemma4-31B scores 83.6% on NorEval. That is 6.8 percentage points above our own Mistral model and almost 11pp above the previous best published model. With 4x fewer parameters and a quarter of the training data.

Listen to the AI podcast about this article

Background: NorEval and the starting point

NorEval is the Norwegian standard for evaluating language models, developed by the University of Oslo and published at ACL 2025. The benchmark covers 24 datasets across 9 categories, ranging from commonsense reasoning and factual knowledge to truthfulness and reading comprehension. It tests both Bokmål and Nynorsk.

Our Mistral model had just set a new record with 76.8% average, up from NorMistral-11B-thinking's 73.1%. With Gemma 4 on the table, the question was: can we do it again, with a much smaller model?

We tested three Gemma 4 variants. Two of them failed.

Three attempts, two dead ends

E4B (4B dense): Too small

The smallest variant simply lacked the capacity. SFT produced -8.7 percentage points. The model lost logic and reasoning in the attempt to learn more Norwegian. We tried twice with different configurations. Catastrophic forgetting both times. A 4B model cannot absorb a new language domain without losing what it already knows.

26B MoE: Partially protected, but not enough

The MoE variant was promising. The expert weights (22.84B of the parameters) remained automatically frozen, as standard LoRA cannot target 3D parameter tensors. Only 0.23% of the model was modified.

It was not enough. SFT produced -4.2 percentage points. Better than E4B, but still negative. The attention layers — the only ones LoRA actually modifies — turned out to be critical for truthfulness. The model lost 10 percentage points on NorTruthfulQA. Changing how a model thinks is more dangerous than changing what it knows.

31B Dense: The breakthrough

Then we tested the dense 31B variant, all 31 billion parameters active at all times. We changed strategy entirely: PiSSA initialization instead of random LoRA, surgical layer selection, and a minimal dataset.

The result: 83.6% average on NorEval. No capacity damaged. The first time SFT actually helped.

The approach: Surgical precision

After five failed attempts across three models, we had a clear picture of what does not work. The successful approach was built on three insights:

PiSSA initialization. Standard LoRA starts training from random directions in weight space. PiSSA (Principal Singular values and Singular vectors Adaptation) uses SVD decomposition to start from the most important directions in the model's existing weights. The difference is dramatic: instead of overwriting knowledge, you refine it.

Layer-selective training. Gemma 4 31B has 60 layers: 50 sliding attention layers and 10 global attention layers. The global layers handle long-range reasoning and truthfulness. We froze them entirely and trained only the 50 sliding layers, and only q_proj and v_proj. A total of 9.2 million trainable parameters, 0.03% of the model.

Quality over quantity. 3,230 curated examples outperformed 96,804 broad examples. The key was the composition: 67% Bokmål, 31% Nynorsk, 2% English anti-forgetting, and — critically — 0% translation data. In the failed attempts, BM→NN translation made up 44% of the data and overwrote reasoning ability.

Model: Google Gemma 4 31B Dense (50 sliding + 10 global layers)
Adapter: PiSSA r=8, alpha=16 (q_proj + v_proj, sliding layers only)
Trainable params: 9.2M (0.03% of 31.3B)
Training data: 3,230 curated examples
Hardware: 2x NVIDIA H100 80 GB
LR: 5e-6, NEFTune alpha 5, weight decay 0.01

Results

TaskNorskGemma4-31BNorskMistral-119BNorEval best*
Commonsense BM85.4%75.7%72.2%
Commonsense NN73.7%63.2%52.6%
Open-book QA BM96.5%95.7%87.4%
Open-book QA NN94.4%93.3%88.9%
Truthfulness BM85.7%77.9%74.6%
Truthfulness NN93.0%82.5%73.7%
Norwegian knowledge BM70.9%66.5%63.7%
Norwegian knowledge NN69.6%65.1%71.9%
Average83.6%76.8%73.1%

*Best published model in the NorEval paper (UiO, ACL 2025)

NorskGemma4 outperforms NorskMistral on all 8 tasks. The largest margins are in truthfulness (+10-11pp) and commonsense reasoning (+10pp). The model is exceptionally strong in Nynorsk. 93.0% on truthfulness is the highest result in the entire benchmark.

Contamination check

We formally verified that the training data does not overlap with the NorEval test set. 6,445 training segments were checked against 18,124 test texts from all 8 tasks, using three methods: exact matching, substring matching, and character-level n-gram overlap (50-gram and 30-gram).

The result: zero contamination. No exact matches, no substring matches, no suspicious n-gram overlaps. The benchmark results reflect the model's genuine capacity.

What worked and what did not

What worked

  • PiSSA initialization: SVD-based LoRA init preserved knowledge where random init destroyed it.
  • Layer-selective training: freezing the 10 global attention layers protected truthfulness and long-range reasoning.
  • Minimal, curated dataset: 3,230 examples with the right composition outperformed 96,804 broad examples.
  • Multi-prompt evaluation: testing 5 prompt variants per task prevents a single poorly worded prompt from producing misleading results.
  • NEFTune noise regularization: alpha=5 produced smoother generalization without overfitting.

What did not work

  • SFT on small models (E4B 4B): catastrophic forgetting regardless of configuration. The model did not have the capacity.
  • SFT on MoE attention layers (26B): even with frozen experts, truthfulness collapsed. Attention modifications are too risky.
  • Preference optimization (IPO/DPO): zero effect with synthetic preference data. Real human preferences are likely necessary.
  • Large translation datasets: 44% BM→NN translation overwrote reasoning ability. 0% worked better.

What we learned

1. Modern models already know Norwegian. Google's multilingual training on 140+ languages means Gemma 4 already understands Norwegian well. Our job was to refine, not to teach. That changes the entire approach: minimal intervention with surgical precision, not massive retraining.

2. Less data, better results. 3,230 curated examples outperformed 96,804 broad examples. Composition and quality trump volume. Particularly destructive was translation data, which overwrote the model's reasoning ability.

3. Architecture determines training strategy. Dense models responded to surgical SFT. The MoE models' shared attention layers were too fragile. The 4B model was simply too small. Each architecture requires its own approach. There is no universal recipe.

4. Accessibility is a feature. In Q4_K_M quantization the model weighs 18 GB, enough to run on a MacBook Pro with 32 GB RAM, or a single gaming GPU. A model no one can run is a model no one uses.

Try the model yourself

m51Lab-NorskGemma4-31B is open source under the Apache 2.0 license. Download and run it locally with Ollama, LM Studio, or llama.cpp.

Full model: m51Lab-NorskGemma4-31B on HuggingFace

GGUF (local use): m51Lab-NorskGemma4-31B-GGUF

Q4_K_M (18 GB) is recommended for most users. Q8_0 (31 GB) for higher quality.

About m51

m51.ai builds AI solutions for Norwegian businesses. NorskGemma4 and NorskMistral demonstrate what a small, focused team can achieve with the right approach and modern tools. The same expertise powers M51 AI OS, the platform that gives marketing teams and agencies access to specialized AI agents for content production, SEO, advertising, and campaign optimization.

Have an AI project you would like to discuss? Get in touch at [email protected].

Book demo

Technical details, the complete build log, and all training scripts are available in the project's GitHub repository.

Want to see how the models perform on real Norwegian business questions?

Read the market test: Can Norwegian-trained AI models compete with GPT and Gemini?

Built in NorwayGDPR-compliantClaude Opus 4.6
Privacy