Vi bruker informasjonskapsler for analyse og markedsføring. Les mer
Back to Lab
m51 Lab ResearchApril 2026 · 20 min

We fine-tuned an open-source AI to write SEO audits - and tested it against Claude Sonnet

805 training examples. 1 epoch. From 13% to 50% win rate. But we still chose to keep Claude Opus in production.

Key Findings (TL;DR)

  • A 31B open-source model (fine-tuned with 805 examples) achieves 40-60% win rate against Claude Sonnet 4.6 on Norwegian SEO audits
  • The open-source model wins on structure and actionability; Sonnet wins on strategic depth and financial quantification
  • Who "wins" depends on the end user: analysts prefer Sonnet, marketing managers prefer v2
  • 805 examples and 1 epoch of training is enough for 100% format compliance on 10-section SEO audits
  • Architecture understanding (data-first vs. tool-use) mattered more than data volume or model size

A 31B open-source model fine-tuned with just 805 examples (~2.5 million tokens) and 1 epoch of training achieves 100% format compliance on 10-section SEO audits and 40-60% win rate against Claude Sonnet 4.6.

Why fine-tune an open-source model for SEO audits?

M51 AI OS is an AI-powered marketing platform that generates professional SEO audits for Norwegian businesses. The platform collects data from Google Search Console, PageSpeed Insights, Moz, GA4, and internal systems, and uses an LLM to synthesize this into structured reports rendered as PDF.

Learn more about the M51 AI OS platform

Until April 2026, the workflow used Claude Opus and Sonnet 4.6 as the primary model. The motivation for exploring an open-source replacement was threefold:

  • Cost: Each audit uses ~$0.50-2.00 in API calls
  • Dependency: No fallback during Anthropic downtime
  • Control: Limited ability for format specialization

First attempt: v1

SeoGemma4 v1 was trained from m51Lab-NorskGemma4-31B (a Norwegian-optimized Gemma 4 with 83.6% NorEval). With 2,590 training examples, v1 achieved an SEO Quality score of 4.08/5, but only 13.3% win rate against Claude Sonnet in pairwise comparison.

The main reason: v1 produced narrative prose without the structured format (Impact/Effort scoring, tables, action plans) that characterizes professional SEO audits.

Read the full story: Smaller model, better results. NorskGemma4-31B

Challenges with v1

  • Machine crash during data generation at 1,475 of 3,060 examples - required crash recovery with 14 parallel agents
  • PyTorch 2.4 compatibility error - required upgrade to 2.8 with specific CUDA 12.8 image
  • Gemma4ForConditionalGeneration doesn't return loss from labels - required custom training_step with manual CrossEntropyLoss
  • Flash Attention incompatible with Gemma 4 (head_dim > 256) - must use eager attention
  • NorskGemma4 base lost native thinking behavior - the model never reasoned before answering
  • packing=True crashes Gemma 4 - a non-negotiable constraint

What changed between v1 and v2?

Base model pivot

The most important decision in v2 was to drop NorskGemma4 as the base and go straight to google/gemma-4-31B-it. Three reasons:

  • Thinking preservation: NorskGemma4's fine-tuning had washed out Gemma 4's native thinking tokens. v2 needed this capability.
  • Native function calling: Gemma 4 has official function calling support that was potentially degraded in NorskGemma4.
  • Norwegian is already good enough: NorEval tests confirmed that base Gemma 4 scores second highest of all tested models on Norwegian.

Architecture discovery: Data-first workflow

A deep analysis of the M51 AI OS workflow revealed that the original v2 plan was fundamentally wrong. We assumed a multi-turn tool-use architecture, but found that the workflow is data-first:

  • 20 data sources are fetched in parallel (Search Console, PageSpeed, GA4, Moz, historical audits, AI visibility data, etc.) - BEFORE the model is called
  • Everything is serialized to compact markdown (~12 KB) by a DataPackage builder
  • The model is called ONCE with a system prompt specifying exactly 10 sections
  • Output is pure markdown - no JSON, no tool calls during the audit

This discovery fundamentally changed the training strategy: instead of 600 multi-turn tool-use examples, we focused on audit synthesis - the ability to take a large data package and produce a structured report.

How much training data do you need for domain-specific fine-tuning?

Data pipeline

SourceExamplesDescription
Production reconstructions13Real audit runs with reverse-engineered DataPackage input
Hand-crafted gold standards4Claude Opus-generated, covering rich/sparse/anomaly/competitor scenarios
Synthetic bulk (Sonnet 4.6)788Generated by up to 14 parallel agents over 5 rounds

Token statistics

MetricValue
Total examples805
Total tokens~2.5 million
Average tokens/example~3,100
audit_synthesis share57.3% of all tokens

Industry coverage

The audit_synthesis examples cover 14+ Norwegian industry verticals with domain-specific regulatory expertise:

  • Fintech: Financial Supervisory Authority licensing, MiFID II, PSD2, AML
  • Healthcare: GDPR art. 9, Health Personnel Act, Biotechnology Act
  • Food & Beverage: Alcohol Act, EU Health Claims Regulation 1924/2006
  • Construction: Central approval (DiBK), Startbank, BREEAM, EPD-Norge
  • Legal: Attorney license, CSRD, DORA
  • Real Estate: Aggregator DA gap strategy against dominant portals
  • Manufacturing: Hreflang (NO/EN/DE), subsea ISO/API certifications
  • Plus e-commerce, SaaS, automotive, education, nonprofit, media, travel

Technical details

LoRA configuration

ParameterValue
MethodPiSSA (SVD-based LoRA initialization)
r8
alpha16
Target modulesq_proj, v_proj (sliding attention layers only)
Frozen layers10 global attention layers
Trainable parameters9,216,000 / 31,282,302,512 (0.0295%)

Hyperparameters

ParameterValue
Epochs1
Learning rate5.0e-6 (cosine schedule)
Effective batch16
Max length4096
Precisionbfloat16
RegularizationNEFTune alpha=5, weight_decay=0.01

Hardware

GPU: 1x NVIDIA H100 NVL (94 GB VRAM). Software: PyTorch 2.8 + CUDA 12.8 + transformers 5.5.3 + trl 1.0 + peft 0.18.

Loss curve

Step 10 (epoch 0.20): loss = 5.723
Step 20 (epoch 0.40): loss = 5.051
Step 30 (epoch 0.60): loss = 4.483
Step 40 (epoch 0.80): loss = 4.225
Step 51 (epoch 1.00): loss = 4.076

Monotonically decreasing with no signs of overfitting.

Technical challenges during v2 development

  • OOM at max_length=6144: 94 GB VRAM is insufficient for 31B BF16 + LoRA + gradients at 6144 tokens. Solved by reducing to 4096.
  • Gemma 4 chat template incompatibility: trl's assistant_only_loss=True crashes because template returns 0 assistant tokens.
  • Thinking token format: Gemma 4 uses <|channel>thought, NOT <think>. Parameter enable_thinking (not thinking) is required and defaults to False.
  • llama-server reasoning format: --reasoning-format deepseek-legacy is required - plain deepseek empties the content field.

Can a fine-tuned open-source model match Claude Sonnet on SEO audits?

We conducted four separate evaluations with increasing depth to avoid overestimating the results.

Level 1: Batch format compliance (5 audits)

MetricResult
Sections present10/10 (100%)
Findings with complete 7-field format100% (18/18)
Tables per audit (average)33.6
Thinking block produced5/5 (100%)
Norwegian bokmal markers12/12 (100%)

Level 2: Fair AB test (equal length, two judges)

5 pairwise comparisons with approximately equal output length (~9-10K chars each). Randomized A/B assignment. Two independent judges from Anthropic.

MetricHaiku as judgeOpus as judge
v2 win rate60% (3/5)40% (2/5)
Avg v2 score8.0/107.4/10
Avg Sonnet score8.1/108.0/10

Per dimension (consistent across both judges):

Dimensionv2 advantage?Comment
PrioritizationYes (+0.4)Impact/Effort scoring, NOW/NEXT/LATER
InsightNo (-0.2 to -1.0)Sonnet has deeper strategic analysis
StructureEvenBoth follow 10-section format
OverallEven~50% win rate

Level 3: Qualitative expert review (Opus 4.6)

All 10 files were evaluated holistically by Claude Opus 4.6 acting as an independent SEO consultant.

I would choose the Sonnet set, but with a clear recommendation to tighten it up. V2 looks nicer and is easier to skim, but it loses too much substance. A CEO reading V2's finding about "Catastrophic Mobile LCP" gets one sentence about remediation. In the Sonnet version, they get a step-by-step plan they can actually send to their developer. The ideal would be a hybrid: V2's consistent structure and compact action plan, combined with Sonnet's forecast depth, financial quantification, and detailed action descriptions.

Level 4: Marketing manager perspective (Gemini 3.1 Pro)

Perhaps the most revealing evaluation came from Gemini 3.1 Pro, which assessed the reports from the perspective of a marketing manager - the actual end user of SEO audits.

On Sonnet:

The analytical deep-dive report. Financial translation is Sonnet's absolute superpower - it translates technical SEO errors into lost revenue. This is invaluable when a marketing manager needs to argue for developer resources with a CFO. But: "wall of text" syndrome. Heavy reading. Weak delegation - actions are embedded in longer paragraphs. Hard to bring straight into a Monday meeting without rewriting as a task list.

On v2:

The operational management tool. Extremely action-oriented - every finding is tagged with Impact and Effort, exactly the language a marketing manager needs to prioritize the backlog with IT/development. The split into "NOW (< 1 week)", "NEXT (1-4 weeks)" and "LATER (> 4 weeks)" with clear ownership is brilliant. It eliminates the friction between report and action. You can cut the table and paste it straight into Trello/Jira. The weakness: Missing the hard financial peg.

Gemini judge's conclusion:

As a marketing manager, I would without doubt choose v2 as my standard reporting. Why? Because a marketing manager's biggest bottleneck is rarely a lack of data, but execution capability. Sonnet is excellent for building a business case once a year, but in everyday work I need a tool that drives the project forward. The ideal compromise? Choose v2 as the template, but require it to include a bullet point in the Executive Summary that translates the technical errors into estimated revenue loss.

Combined evaluation

JudgePerspectiveWinnerReasoning
Claude HaikuMechanical format scoringv2 (60%)Better prioritization and structure
Claude OpusStrategic SEO consultantSonnet (60%)Deeper insight and action descriptions
Claude Opus (qualitative)Holistic expertSonnetStep-by-step plan you can send to the developer
Gemini 3.1 ProMarketing manager (end user)v2Eliminates friction between report and action

The decisive insight: Who you ask determines who wins. An SEO analyst prefers Sonnet's depth. A marketing manager prefers v2's actionability. The ideal report combines both.

What does this experiment prove - and what doesn't it?

Proves

  • Domain-specific fine-tuning works: From 13.3% to ~50% win rate with 805 examples. A 31B open-source model can compete evenly against one of the world's strongest commercial LLMs on a format-driven domain.
  • Format compliance can be learned with minimal data: 10/10 sections, 100% finding format, 33 tables per audit - after just 1 epoch of training.
  • Architecture understanding matters more than data volume: Discovering that the workflow is data-first (not tool-use) changed the entire training strategy and was the single most important decision in the project.
  • Different users value different things: A marketing manager prefers v2's operational format. An analyst prefers Sonnet's depth.

Does not prove

  • That v2 can replace Sonnet in production: The expert review is clear - Sonnet delivers deeper insight, better scenario forecasts, and more actionable recommendations for technical implementation.
  • That format compliance = quality: v2 produces perfect format but lacks substance. Impact scores are present, but without the 3-5 implementation steps and monetary estimates that make them actionable.
  • Statistical significance: 5 comparisons do not yield p<0.05. The result is directional, not conclusive.

Why we still chose to keep Claude Opus in production

Based on the evaluation results, we decided not to switch to SeoGemma4 v2 in production. Here is the reasoning:

Quality gap on insight

The expert review is unambiguous: Sonnet (and even more so Opus) delivers deeper strategic analysis. In a paid SEO service where clients are Norwegian businesses with real marketing budgets, the difference between "Optimize mobile checkout" and a step-by-step implementation plan with revenue estimates is the difference between a report and an implementation foundation.

A CEO reading v2 gets an Impact score. A CEO reading Sonnet gets a monetary figure she can bring to the board meeting. But v2's action plan is what actually drives execution.

Stability and reliability

Claude Opus 4.6 via API offers 99.9%+ uptime, deterministic output quality, automatic scaling, and no GPU infrastructure to maintain. SeoGemma4 v2 requires a dedicated GPU pod, specific server flags, manual GGUF conversion with each model update, and no automatic fallback on OOM or crash.

Cost-benefit at current volume

At M51 AI OS's current audit volume, the API cost is lower than a dedicated GPU pod. Self-hosting only becomes cost-effective at significantly higher volume.

Continuous improvement for free

Anthropic continuously improves Claude. Any model upgrade gives us a quality boost without effort. A self-hosted model requires active maintenance and retraining to remain competitive.

How can v3 combine the best of both?

v3 = v2's format + Sonnet's depth + financial translation

All three expert reviews point to the same recipe. Specifically this means:

  • Financial translation per finding: Every finding needs a monetary figure. From "Optimize JavaScript" to "Estimated loss: ~85,000 NOK/month".
  • Detailed implementation steps: Every finding needs 3-5 concrete steps a developer can act on without a follow-up meeting.
  • Scenario forecasts: The anomaly and forecast section needs best/base/worst modeling with quantified outcomes.
  • Honest reporting: v2 cherry-picks positive data. v3 must include negative trends equally prominently.
  • Correct loss masking: v2 was trained without role-based loss masking. v3 implements manual token-level masking.
  • Real production data: Logging the input DataPackage in production to collect 50+ real (input, output) pairs from Opus-generated audits.

Available models

ModelSizeUse
m51Lab-SeoGemma4-v2-31B (safetensors)59 GBInference, further fine-tuning
m51Lab-SeoGemma4-v2-31B-F16.gguf58 GBllama-server full precision
m51Lab-SeoGemma4-v2-31B-Q8_0.gguf31 GBRecommended for inference (H200/A100)
m51Lab-SeoGemma4-v2-31B-Q4_K_M.gguf14 GBLow-cost deploy (A40, RTX 4090+)

License: Apache 2.0 (inherited from Google Gemma 4).


Frequently asked questions about AI-powered SEO audits

Can an open-source model replace Claude for SEO audits?

Not yet. Our fine-tuned 31B model matches Claude Sonnet 4.6 on format and structure, but Sonnet delivers deeper strategic analysis and financial quantification. For production SEO audits where quality is critical, Claude Opus remains the best choice.

How much training data is needed for domain-specific fine-tuning?

Surprisingly little. 805 examples (~2.5 million tokens) and 1 epoch of training was enough for 100% format compliance. But closing the analytical gap against frontier models will require more sophisticated training data and techniques.

What is the difference between format compliance and analytical quality?

Format compliance means the model produces correct structure: proper sections, tables, Impact/Effort scoring. Analytical quality is about the content: the depth of insights, quality of recommendations, and ability to translate technical findings into business value.

Is open-source or commercial AI better for SEO reporting?

It depends on the user. Our evaluation shows that marketing managers prefer the open-source model's action-oriented format, while SEO analysts prefer Sonnet's depth. The ideal is a hybrid that combines both strengths.


References and resources

Google Gemma 4 model page

PiSSA: Principal Singular Values and Singular Vectors Adaptation (paper)

Read more: How well does AI know the Norwegian market?


m51 Lab is the research and development division of M51 AI OS. The model and complete research log are available as open source.

M51 AI OS uses Claude Opus 4.6 to generate professional SEO audits. Want to see what AI-powered audits can do for your business?

Book a demo
Built in NorwayGDPR-compliantClaude Opus 4.6
Privacy