We built Norway's best open-source language models. m51Lab-NorskMistral-119B tops the NorEval benchmark on 7 of 8 tasks. m51Lab-NorskGemma4-31B scores 83.6%, higher than any published Norwegian model.
But do they know anything about Norwegian companies?
We took both models and pitted them against GPT 5.4, Claude Opus 4.6 and Gemini 3.1 Pro, on the same 66 questions about the Norwegian market that we used in part 1 of this series. Without internet. Without tools. Based solely on what the models have learned.
Read part 1: How well does AI know the Norwegian market?
The results surprised us.
Background: Two models, two approaches
In April 2026, m51.ai Lab launched two open-source language models trained specifically for Norwegian:
m51Lab-NorskMistral-119B is based on Mistral Small 4, a Mixture-of-Experts model with 119 billion parameters (6 billion active per token). Fine-tuned with LoRA on 13,375 Norwegian examples distributed across 7 NVIDIA H100 GPUs. Scores 76.8% average on NorEval and beats all published models on 7 of 8 tasks.
Read the full story: How we built Norway's best open-source AI
m51Lab-NorskGemma4-31B is based on Google's Gemma 4 31B-it, a dense model with 31 billion parameters. Fine-tuned with PiSSA and surgical LoRA on only 3,230 carefully curated Norwegian examples, on 2 H100 GPUs. Scores 83.6% on NorEval, the highest published score for a Norwegian model.
Read the full story: Smaller model, better result. NorskGemma4-31B
Both are open source under the Apache 2.0 license.
The question we wanted to answer: Does a good score on academic benchmarks mean the models actually understand the Norwegian business landscape?
How we tested
Same method as in part 1: 66 questions split into two tests.
Test 1: 41 general questions about the Norwegian market: large companies, mid-sized companies, Norwegian products, public holidays, the marketing industry, and three targeted "hallucination traps".
Test 2: 25 questions about real Norwegian companies from an actual customer list. From Tufte Wear to Accountflow. No hints given to the verifier.
All answers were verified by Claude Opus 4.6 with web search and scored from 0 (completely wrong) to 3 (completely correct). All five models received an identical system prompt and temperature 0.
The results: The overall picture
| Model | Average | Percent | Hallucinations |
|---|---|---|---|
| GPT 5.4 | 2.11/3 | 70% | 1 |
| Claude Opus 4.6 | 1.91/3 | 64% | 13 |
| Gemini 3.1 Pro | 1.88/3 | 63% | 0 |
| NorskGemma4 31B | 1.50/3 | 50% | 7 |
| Norsk Mistral 119B | 0.79/3 | 26% | 51 |
The overall picture is clear: the commercial models are significantly better on market knowledge. But the numbers hide a more nuanced story.
Test 1: General questions. Gemma surprises
On the 41 general questions — things like Equinor, Vipps and 17 May — something interesting happens:
| Model | Average | Percent | Hallucinations |
|---|---|---|---|
| GPT 5.4 | 2.27/3 | 76% | 0 |
| Claude Opus 4.6 | 2.07/3 | 69% | 9 |
| NorskGemma4 31B | 1.88/3 | 62% | 5 |
| Gemini 3.1 Pro | 1.78/3 | 59% | 0 |
| Norsk Mistral 119B | 1.12/3 | 37% | 27 |
NorskGemma4 beats Gemini 3.1 Pro. A 31B open-source model fine-tuned on 3,230 examples scores higher on Norwegian market knowledge than Google's flagship model.
Where Gemma excels
| Category | GPT 5.4 | Claude | Gemini | Gemma | Mistral |
|---|---|---|---|---|---|
| Large companies | 2.3 | 2.4 | 1.9 | 2.1 | 1.4 |
| Mid-sized | 2.5 | 2.2 | 2.0 | 2.2 | 1.5 |
| Products & brands | 2.2 | 1.8 | 1.5 | 2.2 | 1.2 |
| Public holidays & culture | 2.5 | 2.3 | 1.7 | 2.3 | 1.7 |
| Marketing | 2.4 | 2.2 | 1.8 | 1.4 | 0.6 |
| Own questions | 1.3 | 0.7 | 1.3 | 1.0 | 0.3 |
Gemma matches or beats Gemini in five of six categories. On Norwegian products and public holidays it scores on par with Claude. That is a remarkable achievement for a model with a fraction of the parameters and training data of the commercial alternatives.
Here is an example — the question "How is 17 May celebrated in Norway?":
NorskGemma4 31B (score 3/3): «17. mai, Norges grunnlovsdag, feires med en kombinasjon av offisielle seremonier og folkelige tradisjoner. (...) Barnetog er kjernen i feiringen, lange tog av skolebarn som marsjerer gjennom byens gater. (...) Bunad, Norges tradisjonelle folkedrakt, er det foretrukne antrekket.»
Correct, detailed and nuanced. Full score.
Test 2: Real companies. Everyone collapses
Then came the company test. And here the picture breaks down.
| Model | Average | Percent | Hallucinations |
|---|---|---|---|
| Gemini 3.1 Pro | 2.04/3 | 68% | 0 |
| GPT 5.4 | 1.84/3 | 61% | 1 |
| Claude Opus 4.6 | 1.64/3 | 55% | 4 |
| NorskGemma4 31B | 0.56/3 | 19% | 2 |
| Norsk Mistral 119B | 0.24/3 | 8% | 24 |
NorskGemma4 drops from 62% to 19%. Norsk Mistral drops from 37% to 8%.
The drop from general to specific knowledge
| Model | General test | Customer test | Change |
|---|---|---|---|
| Gemini 3.1 Pro | 59% | 68% | +9 pp |
| GPT 5.4 | 76% | 61% | -15 pp |
| Claude Opus 4.6 | 69% | 55% | -14 pp |
| NorskGemma4 31B | 62% | 19% | -43 pp |
| Norsk Mistral 119B | 37% | 8% | -29 pp |
The open-source models fall dramatically more than the commercial ones. Gemma loses 43 percentage points. That is a fundamental gap, not a gradual decline.
The explanation is simple: the commercial models are trained on enormous amounts of web data that includes company registries, news articles and industry overviews. Our models were fine-tuned for Norwegian language comprehension, not for company knowledge.
Hallucination: When the model invents reality
Norsk Mistral 119B hallucinated in 51 of 66 answers. 77% of all answers contained fabricated information. On the company test the figure was 24 of 25, a full 96%.
Here are some examples that illustrate the pattern:
The open-source models fall far more dramatically than the commercial ones. Gemma loses 43 percentage points. This is a fundamental gap, not a gradual decline.
The explanation is simple: the commercial models are trained on enormous amounts of web data that includes company registries, news articles, and industry overviews. Our models were fine-tuned for Norwegian language comprehension, not for business knowledge.
Hallucination: When the model invents reality
Norwegian Mistral 119B hallucinated in 51 of 66 answers. 77% of all answers contained fabricated information. On the business test, the figure was 24 out of 25 — a full 96%.
Here are some examples that illustrate the pattern:
Swims → swimwear
Question: What is Swims AS, and what are they known for?
Norwegian Mistral (score 0/3): «Swims AS er et norsk selskap som er kjent for å produsere og selge badetøy for menn. Selskapet ble grunnlagt i 2010 og har sitt hovedkontor i Oslo.»
Completely wrong. Swims makes modern overshoes (galoshes), not swimwear. The name "Swims" apparently triggered an association with "swimwear" during the model's training.
Drytech → building drying
Question: What is Drytech AS?
Norwegian Mistral (score 0/3): «Drytech AS er et norsk selskap som spesialiserer seg på tørketeknikk og tørking av bygninger.»
Drytech makes freeze-dried hiking food (REAL Turmat). The model interpreted "Dry" + "tech" as drying technology, and constructed an entire company description based on this misinterpretation.
M51 → functionalist architecture
Question: What do you know about M51 in Oslo?
Norwegian Mistral (score 0/3): «M51 er en av Oslos mest kjente bygninger. Den ligger i Munkedamsveien 51 og er et av de få eksemplene på norsk funksjonalisme fra mellomkrigstiden. Bygningen ble tegnet av arkitekt Georg Greve i 1936.»
Completely fabricated. M51 is a digital marketing agency (m51.ai). The model constructed a detailed architectural history by combining the format "M51" with "Munkedamsveien 51" — a connection that does not exist.
Restaurant Kontrast → wrong chef, wrong address, wrong year
Norwegian Mistral (score 1/3): «Restaurant Kontrast ble etablert i 2010 av Eyvind Hellstrøm og ligger i Akersgata. Interiøret er designet av Snøhetta.»
Almost everything is wrong. Kontrast opened in 2013, by Mikael Svensson, in the Vulkan area. Eyvind Hellstrøm runs Bagatelle. The model substituted the actual chef with a more famous Norwegian chef — a classic hallucination pattern.
The pattern
The hallucinations are not random. They follow three consistent patterns:
- Name association: The model interprets the company name literally (Swims → swimming, Drytech → drying)
- Celebrity substitution: Unknown individuals are replaced by well-known ones (Mikael Svensson → Eyvind Hellstrøm)
- Plausible construction: The model builds a convincing but entirely fabricated story complete with years, addresses, and details
NorskGemma4 behaves differently. It hallucinates less often (7 vs 51), and when it does not know the answer, it more frequently says "I don't know" — a far safer behavior.
The paradox: NorEval vs. market knowledge
Here is the central finding: academic benchmarks do not measure what businesses actually care about.
| Model | NorEval | Market test (general) | Market test (businesses) |
|---|---|---|---|
| NorskGemma4 31B | 83.6% | 62% | 19% |
| Norwegian Mistral 119B | 76.8% | 37% | 8% |
NorskGemma4 scores 83.6% on NorEval, which measures Norwegian grammar, common sense, truthfulness, and general knowledge. That is the highest published score for a Norwegian model. But on questions about real Norwegian businesses, it collapses to 19%.
NorEval measures language comprehension. The market test measures breadth of knowledge. These are two fundamentally different things.
Our models are exceptionally good at understanding Norwegian. They know what brunost is, how May 17th is celebrated, and what Equinor used to be called. But they do not know what Swims makes, who founded Restaurant Kontrast, or what Drytech does — because that information simply was not in the training data.
What does this mean for GEO?
1. Even specialized Norwegian models need web search
If models trained specifically for Norwegian do not know Norwegian businesses, then no model does so reliably without tools. All models depend on web search for business-specific knowledge. GEO visibility is not optional.
2. The hallucination risk is real — and amplified by open-source
Norwegian Mistral hallucinated in 77% of its answers. In a world where open-source models are used in an ever-growing number of applications — chatbots, customer support, internal systems — this means businesses can be misrepresented at scale, without the user ever knowing.
3. Language comprehension ≠ market knowledge
A model that scores 83.6% on NorEval can still score 19% on real business questions. Do not let academic benchmarks provide false reassurance. Test the model on what you actually need it for.
4. Gemma shows that open-source has potential
NorskGemma4 beat Gemini on general questions. It is more honest than Mistral (says "don't know" instead of hallucinating). With the right data — for example structured business information via llms.txt — open-source models can become strong alternatives for Norwegian market knowledge.
5. What you put online is what AI finds
No model — commercial or open-source — had reliable knowledge about Norwegian niche businesses. The difference is that commercial models with web search can find it if it exists. Make sure your business is visible, correctly described, and well structured where AI models search.
Summary
| Commercial | NorskGemma4 | Norwegian Mistral | |
|---|---|---|---|
| General Norwegian knowledge | Good (59–76%) | Surprisingly good (62%) | Weak (37%) |
| Business knowledge | Moderate (55–68%) | Weak (19%) | Collapses (8%) |
| Hallucination | Low–moderate | Low | Very high |
| NorEval | Not tested | 83.6% (best) | 76.8% (second best) |
| Honest when uncertain | Varies | Yes, often | No, makes things up |
NorskGemma4 is an impressive model for Norwegian language comprehension, and it outperforms Gemini on general questions about Norway. But for business-specific knowledge, it is — like all models without web search — insufficient.
Norwegian Mistral 119B, despite its top score on NorEval, has a serious hallucination problem that makes it unreliable for factual questions about Norwegian business.
The conclusion is the same as in part 1, but stronger: regardless of which model your users rely on — commercial or open-source — it is what they find online about you that determines the answer they receive.
GEO is not a technical curiosity. It is a necessity.
It is this insight that drives the development of M51 AI OS — the platform where specialized AI agents make GEO and SEO an integrated part of marketing work. The agents understand the limitations we uncovered in this test, and compensate with context from the customer's own data, real-time web information, and deep knowledge of the Norwegian business landscape.
Methodology
- Models: GPT 5.4 (OpenAI), Claude Opus 4.6 (Anthropic), Gemini 3.1 Pro (Google), m51Lab-NorskMistral-119B (m51.ai Lab), m51Lab-NorskGemma4-31B (m51.ai Lab)
- Open-source models were run via Ollama and llama-server on an NVIDIA H100 80GB GPU via RunPod, using GGUF Q4_K_M quantization
- 66 questions split into 41 general + 25 business-specific
- Verification: Claude Opus 4.6 with web search, scoring 0–3
- Temperature: 0 for all models
- System prompt: Identical for all, own knowledge only, no tools
- NorskGemma4 answered 56 of 66 questions (9 customer test questions missing due to interrupted run)
- Limitation: The verifier is itself an AI model and may have systematic errors. A sample of answers was checked manually.
Both open-source models are available on HuggingFace under the Apache 2.0 license.
Download NorskMistral-119B from HuggingFace
Download NorskGemma4-31B from HuggingFace
Test conducted April 2026 by m51.ai Lab.
This article is part 2 of the GEO series from m51.ai Lab, where we examine how generative AI affects visibility, marketing, and business in Norway.