Et komplett AI-operativsystem for markedsføringsteam. 17 spesialiserte agenter som samarbeider med dine data og din merkevare, ikke en generisk chatbot.

Hvordan er dette forskjellig fra ChatGPT?

ChatGPT er en generell assistent. M51 AI OS har live tilkobling til GA4, Google Ads, Meta Ads og Search Console, kjenner din merkevare gjennom M51 Cortex (vår lærende intelligens-kjerne med sporbare påstander og kausalmodell), og kjører automatiserte workflows. Det er forskjellen på en assistent og et helt team.

Fra 2 450 kr/mnd med intropris de 3 første månedene (ordinær pris 4 900 kr/mnd). Pro koster 7 450 kr/mnd (ordinær pris 14 900 kr/mnd) og gir tilgang til alle 17 agenter. Nye kunder får 50 % rabatt i 3 måneder.

Er dataene mine sikre?

GDPR-compliant, norskutviklet, dedikert infrastruktur. Dine data deles aldri. Du eier alt.

Trenger jeg teknisk kompetanse?

Nei. Du snakker med agentene på norsk. Onboarding tar 30 minutter. Vi kobler integrasjoner og setter opp M51 Cortex med merkevaredataene dine.

Vil innholdet høres ut som AI?

Nei. Hver agent jobber fra M51 Cortex, intelligens-kjernen som inneholder tone of voice, nøkkelbudskap, forbudte ord og kampanjehistorikk — sporbart til kilde og dato. Alt innhold tilpasses din merkevare. Du har alltid full kontroll til å redigere og godkjenne før noe publiseres.

Hva skjer hvis noe går galt?

Ingenting publiseres uten din godkjenning. Guardian-agenten kjører automatisk kvalitetskontroll på alle leveranser. Alle forslag går gjennom en godkjenningsflyt der du bestemmer hva som tas videre.

Hvem står bak M51 AI OS?

M51 AI OS er bygget av teamet bak M51 Marketing, et norsk performance-byrå med 9 års erfaring og over 300 kunder. Systemet er utviklet i Norge, kjører på norsk infrastruktur, og støttes av et dedikert team.

Er jeg bundet til en kontrakt?

Nei. M51 AI OS er månedlig fakturering uten bindingstid. Du kan oppgradere, nedgradere eller avslutte når du vil. Nye kunder får 50 % intropris-rabatt de 3 første månedene, uten krav om forlengelse.

Hva om jeg ikke bruker alle kanalene?

Du kobler bare til de plattformene du faktisk bruker. Kjører du bare Google Ads og GA4? Da jobber agentene med de dataene. Legger du til Meta Ads senere, er det bare å koble til. Systemet tilpasser seg ditt oppsett.

Hvor lang tid tar det før det er oppe og kjører?

Oppsett tar vanligvis 1–2 virkedager. Vi kobler integrasjonene dine, laster opp merkevaredata og fyller M51 Cortex med kontekst-laget agentene bruker. AI-agentene begynner å levere analyser og forslag fra dag én etter oppsett. Du trenger ikke sette av IT-ressurser.

Can Norwegian-trained AI models compete with GPT and Gemini?

We built Norway's best open-source language models. m51Lab-NorskMistral-119B tops the NorEval benchmark on 7 of 8 tasks. m51Lab-NorskGemma4-31B scores 83.6%, higher than any published Norwegian model.

But do they know anything about Norwegian companies?

We took both models and pitted them against GPT 5.4, Claude Opus 4.6 and Gemini 3.1 Pro, on the same 66 questions about the Norwegian market that we used in part 1 of this series. Without internet. Without tools. Based solely on what the models have learned.

Read part 1: How well does AI know the Norwegian market?

The results surprised us.

Background: Two models, two approaches

In April 2026, m51.ai Lab launched two open-source language models trained specifically for Norwegian:

m51Lab-NorskMistral-119B is based on Mistral Small 4, a Mixture-of-Experts model with 119 billion parameters (6 billion active per token). Fine-tuned with LoRA on 13,375 Norwegian examples distributed across 7 NVIDIA H100 GPUs. Scores 76.8% average on NorEval and beats all published models on 7 of 8 tasks.

Read the full story: How we built Norway's best open-source AI

m51Lab-NorskGemma4-31B is based on Google's Gemma 4 31B-it, a dense model with 31 billion parameters. Fine-tuned with PiSSA and surgical LoRA on only 3,230 carefully curated Norwegian examples, on 2 H100 GPUs. Scores 83.6% on NorEval, the highest published score for a Norwegian model.

Read the full story: Smaller model, better result. NorskGemma4-31B

Both are open source under the Apache 2.0 license.

The question we wanted to answer: Does a good score on academic benchmarks mean the models actually understand the Norwegian business landscape?

How we tested

Same method as in part 1: 66 questions split into two tests.

Test 1: 41 general questions about the Norwegian market: large companies, mid-sized companies, Norwegian products, public holidays, the marketing industry, and three targeted "hallucination traps".

Test 2: 25 questions about real Norwegian companies from an actual customer list. From Tufte Wear to Accountflow. No hints given to the verifier.

All answers were verified by Claude Opus 4.6 with web search and scored from 0 (completely wrong) to 3 (completely correct). All five models received an identical system prompt and temperature 0.

The results: The overall picture

Model	Average	Percent	Hallucinations
GPT 5.4	2.11/3	70%	1
Claude Opus 4.6	1.91/3	64%	13
Gemini 3.1 Pro	1.88/3	63%	0
NorskGemma4 31B	1.50/3	50%	7
Norsk Mistral 119B	0.79/3	26%	51

The overall picture is clear: the commercial models are significantly better on market knowledge. But the numbers hide a more nuanced story.

Test 1: General questions. Gemma surprises

On the 41 general questions — things like Equinor, Vipps and 17 May — something interesting happens:

Model	Average	Percent	Hallucinations
GPT 5.4	2.27/3	76%	0
Claude Opus 4.6	2.07/3	69%	9
NorskGemma4 31B	1.88/3	62%	5
Gemini 3.1 Pro	1.78/3	59%	0
Norsk Mistral 119B	1.12/3	37%	27

NorskGemma4 beats Gemini 3.1 Pro. A 31B open-source model fine-tuned on 3,230 examples scores higher on Norwegian market knowledge than Google's flagship model.

Where Gemma excels

Category	GPT 5.4	Claude	Gemini	Gemma	Mistral
Large companies	2.3	2.4	1.9	2.1	1.4
Mid-sized	2.5	2.2	2.0	2.2	1.5
Products & brands	2.2	1.8	1.5	2.2	1.2
Public holidays & culture	2.5	2.3	1.7	2.3	1.7
Marketing	2.4	2.2	1.8	1.4	0.6
Own questions	1.3	0.7	1.3	1.0	0.3

Gemma matches or beats Gemini in five of six categories. On Norwegian products and public holidays it scores on par with Claude. That is a remarkable achievement for a model with a fraction of the parameters and training data of the commercial alternatives.

Here is an example — the question "How is 17 May celebrated in Norway?":

NorskGemma4 31B (score 3/3): «17. mai, Norges grunnlovsdag, feires med en kombinasjon av offisielle seremonier og folkelige tradisjoner. (...) Barnetog er kjernen i feiringen, lange tog av skolebarn som marsjerer gjennom byens gater. (...) Bunad, Norges tradisjonelle folkedrakt, er det foretrukne antrekket.»

Correct, detailed and nuanced. Full score.

Test 2: Real companies. Everyone collapses

Then came the company test. And here the picture breaks down.

Model	Average	Percent	Hallucinations
Gemini 3.1 Pro	2.04/3	68%	0
GPT 5.4	1.84/3	61%	1
Claude Opus 4.6	1.64/3	55%	4
NorskGemma4 31B	0.56/3	19%	2
Norsk Mistral 119B	0.24/3	8%	24

NorskGemma4 drops from 62% to 19%. Norsk Mistral drops from 37% to 8%.

The drop from general to specific knowledge

Model	General test	Customer test	Change
Gemini 3.1 Pro	59%	68%	+9 pp
GPT 5.4	76%	61%	-15 pp
Claude Opus 4.6	69%	55%	-14 pp
NorskGemma4 31B	62%	19%	-43 pp
Norsk Mistral 119B	37%	8%	-29 pp

The open-source models fall dramatically more than the commercial ones. Gemma loses 43 percentage points. That is a fundamental gap, not a gradual decline.

The explanation is simple: the commercial models are trained on enormous amounts of web data that includes company registries, news articles and industry overviews. Our models were fine-tuned for Norwegian language comprehension, not for company knowledge.

Hallucination: When the model invents reality

Norsk Mistral 119B hallucinated in 51 of 66 answers. 77% of all answers contained fabricated information. On the company test the figure was 24 of 25, a full 96%.

Here are some examples that illustrate the pattern:

The open-source models fall far more dramatically than the commercial ones. Gemma loses 43 percentage points. This is a fundamental gap, not a gradual decline.

The explanation is simple: the commercial models are trained on enormous amounts of web data that includes company registries, news articles, and industry overviews. Our models were fine-tuned for Norwegian language comprehension, not for business knowledge.

Hallucination: When the model invents reality

Norwegian Mistral 119B hallucinated in 51 of 66 answers. 77% of all answers contained fabricated information. On the business test, the figure was 24 out of 25 — a full 96%.

Here are some examples that illustrate the pattern:

Swims → swimwear

Question: What is Swims AS, and what are they known for?

Norwegian Mistral (score 0/3): «Swims AS er et norsk selskap som er kjent for å produsere og selge badetøy for menn. Selskapet ble grunnlagt i 2010 og har sitt hovedkontor i Oslo.»

Completely wrong. Swims makes modern overshoes (galoshes), not swimwear. The name "Swims" apparently triggered an association with "swimwear" during the model's training.

Drytech → building drying

Question: What is Drytech AS?

Norwegian Mistral (score 0/3): «Drytech AS er et norsk selskap som spesialiserer seg på tørketeknikk og tørking av bygninger.»

Drytech makes freeze-dried hiking food (REAL Turmat). The model interpreted "Dry" + "tech" as drying technology, and constructed an entire company description based on this misinterpretation.

M51 → functionalist architecture

Question: What do you know about M51 in Oslo?

Norwegian Mistral (score 0/3): «M51 er en av Oslos mest kjente bygninger. Den ligger i Munkedamsveien 51 og er et av de få eksemplene på norsk funksjonalisme fra mellomkrigstiden. Bygningen ble tegnet av arkitekt Georg Greve i 1936.»

Completely fabricated. M51 is a digital marketing agency (m51.ai). The model constructed a detailed architectural history by combining the format "M51" with "Munkedamsveien 51" — a connection that does not exist.

Restaurant Kontrast → wrong chef, wrong address, wrong year

Norwegian Mistral (score 1/3): «Restaurant Kontrast ble etablert i 2010 av Eyvind Hellstrøm og ligger i Akersgata. Interiøret er designet av Snøhetta.»

Almost everything is wrong. Kontrast opened in 2013, by Mikael Svensson, in the Vulkan area. Eyvind Hellstrøm runs Bagatelle. The model substituted the actual chef with a more famous Norwegian chef — a classic hallucination pattern.

The pattern

The hallucinations are not random. They follow three consistent patterns:

Name association: The model interprets the company name literally (Swims → swimming, Drytech → drying)
Celebrity substitution: Unknown individuals are replaced by well-known ones (Mikael Svensson → Eyvind Hellstrøm)
Plausible construction: The model builds a convincing but entirely fabricated story complete with years, addresses, and details

NorskGemma4 behaves differently. It hallucinates less often (7 vs 51), and when it does not know the answer, it more frequently says "I don't know" — a far safer behavior.

The paradox: NorEval vs. market knowledge

Here is the central finding: academic benchmarks do not measure what businesses actually care about.

Model	NorEval	Market test (general)	Market test (businesses)
NorskGemma4 31B	83.6%	62%	19%
Norwegian Mistral 119B	76.8%	37%	8%

NorskGemma4 scores 83.6% on NorEval, which measures Norwegian grammar, common sense, truthfulness, and general knowledge. That is the highest published score for a Norwegian model. But on questions about real Norwegian businesses, it collapses to 19%.

NorEval measures language comprehension. The market test measures breadth of knowledge. These are two fundamentally different things.

Our models are exceptionally good at understanding Norwegian. They know what brunost is, how May 17th is celebrated, and what Equinor used to be called. But they do not know what Swims makes, who founded Restaurant Kontrast, or what Drytech does — because that information simply was not in the training data.

What does this mean for GEO?

1. Even specialized Norwegian models need web search

If models trained specifically for Norwegian do not know Norwegian businesses, then no model does so reliably without tools. All models depend on web search for business-specific knowledge. GEO visibility is not optional.

2. The hallucination risk is real — and amplified by open-source

Norwegian Mistral hallucinated in 77% of its answers. In a world where open-source models are used in an ever-growing number of applications — chatbots, customer support, internal systems — this means businesses can be misrepresented at scale, without the user ever knowing.

3. Language comprehension ≠ market knowledge

A model that scores 83.6% on NorEval can still score 19% on real business questions. Do not let academic benchmarks provide false reassurance. Test the model on what you actually need it for.

4. Gemma shows that open-source has potential

NorskGemma4 beat Gemini on general questions. It is more honest than Mistral (says "don't know" instead of hallucinating). With the right data — for example structured business information via llms.txt — open-source models can become strong alternatives for Norwegian market knowledge.

5. What you put online is what AI finds

No model — commercial or open-source — had reliable knowledge about Norwegian niche businesses. The difference is that commercial models with web search can find it if it exists. Make sure your business is visible, correctly described, and well structured where AI models search.

Summary

	Commercial	NorskGemma4	Norwegian Mistral
General Norwegian knowledge	Good (59–76%)	Surprisingly good (62%)	Weak (37%)
Business knowledge	Moderate (55–68%)	Weak (19%)	Collapses (8%)
Hallucination	Low–moderate	Low	Very high
NorEval	Not tested	83.6% (best)	76.8% (second best)
Honest when uncertain	Varies	Yes, often	No, makes things up

NorskGemma4 is an impressive model for Norwegian language comprehension, and it outperforms Gemini on general questions about Norway. But for business-specific knowledge, it is — like all models without web search — insufficient.

Norwegian Mistral 119B, despite its top score on NorEval, has a serious hallucination problem that makes it unreliable for factual questions about Norwegian business.

The conclusion is the same as in part 1, but stronger: regardless of which model your users rely on — commercial or open-source — it is what they find online about you that determines the answer they receive.

GEO is not a technical curiosity. It is a necessity.

It is this insight that drives the development of M51 AI OS — the platform where specialized AI agents make GEO and SEO an integrated part of marketing work. The agents understand the limitations we uncovered in this test, and compensate with context from the customer's own data, real-time web information, and deep knowledge of the Norwegian business landscape.

Methodology

Models: GPT 5.4 (OpenAI), Claude Opus 4.6 (Anthropic), Gemini 3.1 Pro (Google), m51Lab-NorskMistral-119B (m51.ai Lab), m51Lab-NorskGemma4-31B (m51.ai Lab)
Open-source models were run via Ollama and llama-server on an NVIDIA H100 80GB GPU via RunPod, using GGUF Q4_K_M quantization
66 questions split into 41 general + 25 business-specific
Verification: Claude Opus 4.6 with web search, scoring 0–3
Temperature: 0 for all models
System prompt: Identical for all, own knowledge only, no tools
NorskGemma4 answered 56 of 66 questions (9 customer test questions missing due to interrupted run)
Limitation: The verifier is itself an AI model and may have systematic errors. A sample of answers was checked manually.

Both open-source models are available on HuggingFace under the Apache 2.0 license.

Download NorskMistral-119B from HuggingFace

Download NorskGemma4-31B from HuggingFace

Test conducted April 2026 by m51.ai Lab.

This article is part 2 of the GEO series from m51.ai Lab, where we examine how generative AI affects visibility, marketing, and business in Norway.

Read part 1: How well does AI know the Norwegian market?