Et komplett AI-operativsystem for markedsføringsteam. 17 spesialiserte agenter som samarbeider med dine data og din merkevare, ikke en generisk chatbot.

Hvordan er dette forskjellig fra ChatGPT?

ChatGPT er en generell assistent. m51.ai har live tilkobling til GA4, Google Ads, Meta Ads og Search Console, kjenner din merkevare gjennom M51 Cortex (vår lærende intelligens-kjerne med sporbare påstander og kausalmodell), og kjører automatiserte workflows. Det er forskjellen på en assistent og et helt team.

Fra 2 450 kr/mnd med intropris de 3 første månedene (ordinær pris 4 900 kr/mnd). Pro koster 7 450 kr/mnd (ordinær pris 14 900 kr/mnd) og gir tilgang til alle 17 agenter. Konkurranseanalyse kan legges til som tilleggsmodul på alle planer for 1 495 kr/mnd (ordinær pris 2 990 kr/mnd). Nye kunder får 50 % rabatt i 3 måneder.

Er dataene mine sikre?

GDPR-compliant, norskutviklet, dedikert infrastruktur. Dine data deles aldri. Du eier alt.

Trenger jeg teknisk kompetanse?

Nei. Du snakker med agentene på norsk. Onboarding tar 30 minutter. Vi kobler integrasjoner og setter opp M51 Cortex med merkevaredataene dine.

Vil innholdet høres ut som AI?

Nei. Hver agent jobber fra M51 Cortex, intelligens-kjernen som inneholder tone of voice, nøkkelbudskap, forbudte ord og kampanjehistorikk — sporbart til kilde og dato. Alt innhold tilpasses din merkevare. Du har alltid full kontroll til å redigere og godkjenne før noe publiseres.

Hva skjer hvis noe går galt?

Ingenting publiseres uten din godkjenning. Guardian-agenten kjører automatisk kvalitetskontroll på alle leveranser. Alle forslag går gjennom en godkjenningsflyt der du bestemmer hva som tas videre. Du har full oversikt via forslagstavlen og aktivitetsstrømmen.

Hvem står bak m51.ai?

m51.ai er bygget av teamet bak M51 Marketing, et norsk performance-byrå med 9 års erfaring og over 300 kunder. Systemet er utviklet i Norge, kjører på norsk infrastruktur, og støttes av et dedikert team. Du får en fast kontaktperson ved oppstart.

Er jeg bundet til en kontrakt?

Nei. m51.ai er månedlig fakturering uten bindingstid. Du kan oppgradere, nedgradere eller avslutte når du vil. Nye kunder får 50 % intropris-rabatt de 3 første månedene, uten krav om forlengelse.

Hva om jeg ikke bruker alle kanalene?

Du kobler bare til de plattformene du faktisk bruker. Kjører du bare Google Ads og GA4? Da jobber agentene med de dataene. Legger du til Meta Ads senere, er det bare å koble til. Systemet tilpasser seg ditt oppsett.

Hvor lang tid tar det før det er oppe og kjører?

Oppsett tar vanligvis 1–2 virkedager. Vi kobler integrasjonene dine, laster opp merkevaredata og fyller M51 Cortex med kontekst-laget agentene bruker. AI-agentene begynner å levere analyser og forslag fra dag én etter oppsett. Du trenger ikke sette av IT-ressurser.

m51 Lab Research

We fine-tuned an open-source AI to write SEO audits - and tested it against Claude Sonnet

805 training examples. 1 epoch. From 13% to 50% win rate. But we still chose to keep Claude Opus in production.

m51.ai Lab

April 202611 min read

Key Findings (TL;DR)

A 31B open-source model (fine-tuned with 805 examples) achieves 40-60% win rate against Claude Sonnet 4.6 on Norwegian SEO audits
The open-source model wins on structure and actionability; Sonnet wins on strategic depth and financial quantification
Who "wins" depends on the end user: analysts prefer Sonnet, marketing managers prefer v2
805 examples and 1 epoch of training is enough for 100% format compliance on 10-section SEO audits
Architecture understanding (data-first vs. tool-use) mattered more than data volume or model size

A 31B open-source model fine-tuned with just 805 examples (~2.5 million tokens) and 1 epoch of training achieves 100% format compliance on 10-section SEO audits and 40-60% win rate against Claude Sonnet 4.6.

Why fine-tune an open-source model for SEO audits?

m51.ai is an AI-powered marketing platform that generates professional SEO audits for Norwegian businesses. The platform collects data from Google Search Console, PageSpeed Insights, Moz, GA4, and internal systems, and uses an LLM to synthesize this into structured reports rendered as PDF.

Learn more about the m51.ai platform

Until April 2026, the workflow used Claude Opus and Sonnet 4.6 as the primary model. The motivation for exploring an open-source replacement was threefold:

Cost: Each audit uses ~$0.50-2.00 in API calls
Dependency: No fallback during Anthropic downtime
Control: Limited ability for format specialization

First attempt: v1

SeoGemma4 v1 was trained from m51Lab-NorskGemma4-31B (a Norwegian-optimized Gemma 4 with 83.6% NorEval). With 2,590 training examples, v1 achieved an SEO Quality score of 4.08/5, but only 13.3% win rate against Claude Sonnet in pairwise comparison.

The main reason: v1 produced narrative prose without the structured format (Impact/Effort scoring, tables, action plans) that characterizes professional SEO audits.

Read the full story: Smaller model, better results. NorskGemma4-31B

Challenges with v1

Machine crash during data generation at 1,475 of 3,060 examples - required crash recovery with 14 parallel agents
PyTorch 2.4 compatibility error - required upgrade to 2.8 with specific CUDA 12.8 image
Gemma4ForConditionalGeneration doesn't return loss from labels - required custom training_step with manual CrossEntropyLoss
Flash Attention incompatible with Gemma 4 (head_dim > 256) - must use eager attention
NorskGemma4 base lost native thinking behavior - the model never reasoned before answering
packing=True crashes Gemma 4 - a non-negotiable constraint

What changed between v1 and v2?

Base model pivot

The most important decision in v2 was to drop NorskGemma4 as the base and go straight to google/gemma-4-31B-it. Three reasons:

Thinking preservation: NorskGemma4's fine-tuning had washed out Gemma 4's native thinking tokens. v2 needed this capability.
Native function calling: Gemma 4 has official function calling support that was potentially degraded in NorskGemma4.
Norwegian is already good enough: NorEval tests confirmed that base Gemma 4 scores second highest of all tested models on Norwegian.

Architecture discovery: Data-first workflow

A deep analysis of the m51.ai workflow revealed that the original v2 plan was fundamentally wrong. We assumed a multi-turn tool-use architecture, but found that the workflow is data-first:

20 data sources are fetched in parallel (Search Console, PageSpeed, GA4, Moz, historical audits, AI visibility data, etc.) - BEFORE the model is called
Everything is serialized to compact markdown (~12 KB) by a DataPackage builder
The model is called ONCE with a system prompt specifying exactly 10 sections
Output is pure markdown - no JSON, no tool calls during the audit

This discovery fundamentally changed the training strategy: instead of 600 multi-turn tool-use examples, we focused on audit synthesis - the ability to take a large data package and produce a structured report.

How much training data do you need for domain-specific fine-tuning?

Data pipeline

Source	Examples	Description
Production reconstructions	13	Real audit runs with reverse-engineered DataPackage input
Hand-crafted gold standards	4	Claude Opus-generated, covering rich/sparse/anomaly/competitor scenarios
Synthetic bulk (Sonnet 4.6)	788	Generated by up to 14 parallel agents over 5 rounds

Token statistics

Metric	Value
Total examples	805
Total tokens	~2.5 million
Average tokens/example	~3,100
audit_synthesis share	57.3% of all tokens

Industry coverage

The audit_synthesis examples cover 14+ Norwegian industry verticals with domain-specific regulatory expertise:

Fintech: Financial Supervisory Authority licensing, MiFID II, PSD2, AML
Healthcare: GDPR art. 9, Health Personnel Act, Biotechnology Act
Food & Beverage: Alcohol Act, EU Health Claims Regulation 1924/2006
Construction: Central approval (DiBK), Startbank, BREEAM, EPD-Norge
Legal: Attorney license, CSRD, DORA
Real Estate: Aggregator DA gap strategy against dominant portals
Manufacturing: Hreflang (NO/EN/DE), subsea ISO/API certifications
Plus e-commerce, SaaS, automotive, education, nonprofit, media, travel

Technical details

LoRA configuration

Parameter	Value
Method	PiSSA (SVD-based LoRA initialization)
r	8
alpha	16
Target modules	q_proj, v_proj (sliding attention layers only)
Frozen layers	10 global attention layers
Trainable parameters	9,216,000 / 31,282,302,512 (0.0295%)

Hyperparameters

Parameter	Value
Epochs	1
Learning rate	5.0e-6 (cosine schedule)
Effective batch	16
Max length	4096
Precision	bfloat16
Regularization	NEFTune alpha=5, weight_decay=0.01

Hardware

GPU: 1x NVIDIA H100 NVL (94 GB VRAM). Software: PyTorch 2.8 + CUDA 12.8 + transformers 5.5.3 + trl 1.0 + peft 0.18.

Loss curve

Step 10 (epoch 0.20): loss = 5.723
Step 20 (epoch 0.40): loss = 5.051
Step 30 (epoch 0.60): loss = 4.483
Step 40 (epoch 0.80): loss = 4.225
Step 51 (epoch 1.00): loss = 4.076

Monotonically decreasing with no signs of overfitting.

Technical challenges during v2 development

OOM at max_length=6144: 94 GB VRAM is insufficient for 31B BF16 + LoRA + gradients at 6144 tokens. Solved by reducing to 4096.
Gemma 4 chat template incompatibility: trl's assistant_only_loss=True crashes because template returns 0 assistant tokens.
Thinking token format: Gemma 4 uses <|channel>thought, NOT <think>. Parameter enable_thinking (not thinking) is required and defaults to False.
llama-server reasoning format: --reasoning-format deepseek-legacy is required - plain deepseek empties the content field.

Can a fine-tuned open-source model match Claude Sonnet on SEO audits?

We conducted four separate evaluations with increasing depth to avoid overestimating the results.

Level 1: Batch format compliance (5 audits)

Metric	Result
Sections present	10/10 (100%)
Findings with complete 7-field format	100% (18/18)
Tables per audit (average)	33.6
Thinking block produced	5/5 (100%)
Norwegian bokmal markers	12/12 (100%)

Level 2: Fair AB test (equal length, two judges)

5 pairwise comparisons with approximately equal output length (~9-10K chars each). Randomized A/B assignment. Two independent judges from Anthropic.

Metric	Haiku as judge	Opus as judge
v2 win rate	60% (3/5)	40% (2/5)
Avg v2 score	8.0/10	7.4/10
Avg Sonnet score	8.1/10	8.0/10

Per dimension (consistent across both judges):

Dimension	v2 advantage?	Comment
Prioritization	Yes (+0.4)	Impact/Effort scoring, NOW/NEXT/LATER
Insight	No (-0.2 to -1.0)	Sonnet has deeper strategic analysis
Structure	Even	Both follow 10-section format
Overall	Even	~50% win rate

Level 3: Qualitative expert review (Opus 4.6)

All 10 files were evaluated holistically by Claude Opus 4.6 acting as an independent SEO consultant.

I would choose the Sonnet set, but with a clear recommendation to tighten it up. V2 looks nicer and is easier to skim, but it loses too much substance. A CEO reading V2's finding about "Catastrophic Mobile LCP" gets one sentence about remediation. In the Sonnet version, they get a step-by-step plan they can actually send to their developer. The ideal would be a hybrid: V2's consistent structure and compact action plan, combined with Sonnet's forecast depth, financial quantification, and detailed action descriptions.

Level 4: Marketing manager perspective (Gemini 3.1 Pro)

Perhaps the most revealing evaluation came from Gemini 3.1 Pro, which assessed the reports from the perspective of a marketing manager - the actual end user of SEO audits.

On Sonnet:

The analytical deep-dive report. Financial translation is Sonnet's absolute superpower - it translates technical SEO errors into lost revenue. This is invaluable when a marketing manager needs to argue for developer resources with a CFO. But: "wall of text" syndrome. Heavy reading. Weak delegation - actions are embedded in longer paragraphs. Hard to bring straight into a Monday meeting without rewriting as a task list.

On v2:

The operational management tool. Extremely action-oriented - every finding is tagged with Impact and Effort, exactly the language a marketing manager needs to prioritize the backlog with IT/development. The split into "NOW (< 1 week)", "NEXT (1-4 weeks)" and "LATER (> 4 weeks)" with clear ownership is brilliant. It eliminates the friction between report and action. You can cut the table and paste it straight into Trello/Jira. The weakness: Missing the hard financial peg.

Gemini judge's conclusion:

As a marketing manager, I would without doubt choose v2 as my standard reporting. Why? Because a marketing manager's biggest bottleneck is rarely a lack of data, but execution capability. Sonnet is excellent for building a business case once a year, but in everyday work I need a tool that drives the project forward. The ideal compromise? Choose v2 as the template, but require it to include a bullet point in the Executive Summary that translates the technical errors into estimated revenue loss.

Combined evaluation

Judge	Perspective	Winner	Reasoning
Claude Haiku	Mechanical format scoring	v2 (60%)	Better prioritization and structure
Claude Opus	Strategic SEO consultant	Sonnet (60%)	Deeper insight and action descriptions
Claude Opus (qualitative)	Holistic expert	Sonnet	Step-by-step plan you can send to the developer
Gemini 3.1 Pro	Marketing manager (end user)	v2	Eliminates friction between report and action

The decisive insight: Who you ask determines who wins. An SEO analyst prefers Sonnet's depth. A marketing manager prefers v2's actionability. The ideal report combines both.

What does this experiment prove - and what doesn't it?

Proves

Domain-specific fine-tuning works: From 13.3% to ~50% win rate with 805 examples. A 31B open-source model can compete evenly against one of the world's strongest commercial LLMs on a format-driven domain.
Format compliance can be learned with minimal data: 10/10 sections, 100% finding format, 33 tables per audit - after just 1 epoch of training.
Architecture understanding matters more than data volume: Discovering that the workflow is data-first (not tool-use) changed the entire training strategy and was the single most important decision in the project.
Different users value different things: A marketing manager prefers v2's operational format. An analyst prefers Sonnet's depth.

Does not prove

That v2 can replace Sonnet in production: The expert review is clear - Sonnet delivers deeper insight, better scenario forecasts, and more actionable recommendations for technical implementation.
That format compliance = quality: v2 produces perfect format but lacks substance. Impact scores are present, but without the 3-5 implementation steps and monetary estimates that make them actionable.
Statistical significance: 5 comparisons do not yield p<0.05. The result is directional, not conclusive.

Why we still chose to keep Claude Opus in production

Based on the evaluation results, we decided not to switch to SeoGemma4 v2 in production. Here is the reasoning:

Quality gap on insight

The expert review is unambiguous: Sonnet (and even more so Opus) delivers deeper strategic analysis. In a paid SEO service where clients are Norwegian businesses with real marketing budgets, the difference between "Optimize mobile checkout" and a step-by-step implementation plan with revenue estimates is the difference between a report and an implementation foundation.

A CEO reading v2 gets an Impact score. A CEO reading Sonnet gets a monetary figure she can bring to the board meeting. But v2's action plan is what actually drives execution.

Stability and reliability

Claude Opus 4.6 via API offers 99.9%+ uptime, deterministic output quality, automatic scaling, and no GPU infrastructure to maintain. SeoGemma4 v2 requires a dedicated GPU pod, specific server flags, manual GGUF conversion with each model update, and no automatic fallback on OOM or crash.

Cost-benefit at current volume

At m51.ai's current audit volume, the API cost is lower than a dedicated GPU pod. Self-hosting only becomes cost-effective at significantly higher volume.

Continuous improvement for free

Anthropic continuously improves Claude. Any model upgrade gives us a quality boost without effort. A self-hosted model requires active maintenance and retraining to remain competitive.

How can v3 combine the best of both?

v3 = v2's format + Sonnet's depth + financial translation

All three expert reviews point to the same recipe. Specifically this means:

Financial translation per finding: Every finding needs a monetary figure. From "Optimize JavaScript" to "Estimated loss: ~85,000 NOK/month".
Detailed implementation steps: Every finding needs 3-5 concrete steps a developer can act on without a follow-up meeting.
Scenario forecasts: The anomaly and forecast section needs best/base/worst modeling with quantified outcomes.
Honest reporting: v2 cherry-picks positive data. v3 must include negative trends equally prominently.
Correct loss masking: v2 was trained without role-based loss masking. v3 implements manual token-level masking.
Real production data: Logging the input DataPackage in production to collect 50+ real (input, output) pairs from Opus-generated audits.

Available models

Model	Size	Use
m51Lab-SeoGemma4-v2-31B (safetensors)	59 GB	Inference, further fine-tuning
m51Lab-SeoGemma4-v2-31B-F16.gguf	58 GB	llama-server full precision
m51Lab-SeoGemma4-v2-31B-Q8_0.gguf	31 GB	Recommended for inference (H200/A100)
m51Lab-SeoGemma4-v2-31B-Q4_K_M.gguf	14 GB	Low-cost deploy (A40, RTX 4090+)

Full model: m51Lab-SeoGemma4-v2-31B on HuggingFace

GGUF (local use): m51Lab-SeoGemma4-v2-31B-GGUF

License: Apache 2.0 (inherited from Google Gemma 4).

Frequently asked questions about AI-powered SEO audits

Can an open-source model replace Claude for SEO audits?

Not yet. Our fine-tuned 31B model matches Claude Sonnet 4.6 on format and structure, but Sonnet delivers deeper strategic analysis and financial quantification. For production SEO audits where quality is critical, Claude Opus remains the best choice.

How much training data is needed for domain-specific fine-tuning?

Surprisingly little. 805 examples (~2.5 million tokens) and 1 epoch of training was enough for 100% format compliance. But closing the analytical gap against frontier models will require more sophisticated training data and techniques.

What is the difference between format compliance and analytical quality?

Format compliance means the model produces correct structure: proper sections, tables, Impact/Effort scoring. Analytical quality is about the content: the depth of insights, quality of recommendations, and ability to translate technical findings into business value.

Is open-source or commercial AI better for SEO reporting?

It depends on the user. Our evaluation shows that marketing managers prefer the open-source model's action-oriented format, while SEO analysts prefer Sonnet's depth. The ideal is a hybrid that combines both strengths.

References and resources

Model card on HuggingFace: m51Lab-SeoGemma4-v2-31B

Google Gemma 4 model page

PiSSA: Principal Singular Values and Singular Vectors Adaptation (paper)

m51 Lab is the research and development division of m51.ai. The model and complete research log are available as open source.

m51.ai uses Claude Opus 4.6 to generate professional SEO audits. Want to see what AI-powered audits can do for your business?

Book a demo