Is Claude Opus 4.8 better than Sonnet 4.6 for FP&A work?

On total quality score, Opus 4.8 High and Sonnet 4.6 High tied at 74/100 in this experiment. The difference is in composition: Opus 4.8 produced deeper causal commentary and root-cause attribution, while Sonnet 4.6 produced stronger structural clarity and actionability. For routine monthly reporting, Sonnet 4.6 High is the more resource-efficient choice. For in-depth variance analysis or root-cause diagnosis, Opus 4.8 High provides greater analytical depth.

Does the effort level setting in Claude affect FP&A output quality?

Yes, materially. In this experiment, Sonnet 4.6 High scored 74/100 while Sonnet 4.6 Low scored 66/100. More critically, Sonnet 4.6 Low produced a numerical accuracy score of 2/5, meaning it contained material errors in calculated figures. The effort level setting should be treated as a risk-control mechanism, not just a speed trade-off. Low effort outputs require heavy numerical verification before professional use.

What is the main risk of using AI for FP&A monthly close reporting?

The highest-risk scenario is not an obviously weak output — it is a structurally polished output that contains numerical errors. In this experiment, Sonnet 4.6 Low produced strong structure and actionability but material numerical inaccuracies. A practitioner relying on it without independent verification of the calculated figures could act on incorrect data. Always verify AI-generated financial figures independently before professional use.

Claude for FP&A: A Practitioner Benchmark of Model Choice and Effort Level

Q: Can Claude produce a board-ready CFO report from a single prompt?

No, not in this experiment. All five conditions scored 3 or below on Board Pack Quality (out of 5), regardless of model or effort level. Board-ready communication requires a dedicated prompt, a narrower audience definition, explicit board-level questions, and human editorial judgment. A single broad monthly-close prompt is not sufficient to produce a one-pager a CFO could present without editing.

Practitioner Summary

If you only have two minutes, this table is all you need.

One-Page Summary

Research Question	Does a more capable Claude model at higher effort produce meaningfully better FP&A output, and if so, at what resource cost?
Design	Five conditions tested: Opus 4.8 at High and Low; Sonnet 4.6 at High, Medium, and Low. Identical prompt and raw GL dataset supplied to each. Outputs scored on a seven-dimension rubric by the author — an FP&A professional with direct manufacturing CFO-level experience.
Headline Finding	Sonnet 4.6 High matched Opus 4.8 High on total quality (74/100) while consuming materially less observed session capacity. But the scores diverge at the dimension level: Opus 4.8 produced deeper causal commentary; Sonnet 4.6 produced stronger structure and actionability.
Recommendation	Use Sonnet 4.6 High as the default for routine monthly reporting. Add root-cause commentary yourself. Upgrade to Opus 4.8 High when causal diagnosis is the core deliverable. Avoid Sonnet 4.6 Low for any output you plan to act on without heavy numerical verification.
⚠ Risk Warning	The highest-risk output was not the weakest-looking one. Sonnet 4.6 Low produced strong structure and actionability but weak numerical accuracy (D6 score of 2/5). Polished presentation is not evidence of numerical correctness.
Bottom Line	Sonnet produces the structure; the analyst provides the cause. Opus produces the cause; the analyst refines the communication.
Methodological Note	The dataset, rubric, and scoring were all designed and applied by the same author without blind evaluation. These are results from a structured practitioner benchmark, not a blinded academic evaluation.

Background and Motivation

Anthropic's release of Claude Opus 4.8, together with user-selectable effort intensity levels, created a new decision variable for AI practitioners: not just which model to use, but how hard to ask it to work. For finance professionals, this raises a practical question with real cost implications.

A CFO monthly close reporting package is among the most demanding analytical tasks in FP&A. It requires multi-layer financial reasoning, precise numerical computation, causal attribution, risk escalation judgment, and executive communication ability — all applied simultaneously to raw data. It is exactly the kind of task where model and effort level selection should matter, and where the wrong choice has professional consequences.

This experiment was designed to answer one question with professional specificity: for a complex FP&A task, does a more capable model at higher effort produce meaningfully better output? And if so, at what cost in session resources?

Key Context

Most existing AI finance benchmarks evaluate narrow tasks: sentiment classification, named-entity recognition, or isolated numerical reasoning. This experiment evaluates a complete analytical workflow — ten structured deliverables from raw GL data — which is closer to what FP&A professionals actually do.

The Experiment

Task Design

The task was a CFO-ready monthly close package for Meridian Manufacturing Inc., a fictional mid-sized electrical transformer manufacturer (B2B, Ontario, ~180 employees). The company was designed to resemble the industrial manufacturing environments in which the author has direct FP&A experience, enabling professional-grade evaluation.

The prompt asked the model to produce ten deliverables: Data Quality Summary, Executive Summary, KPI Dashboard, Revenue Analysis, Income Statement and EBITDA Bridge with variance classification (OT/T/R/S/U), Cash Flow, Balance Sheet, Root-Cause and Action Plan, Reforecast Implications, and a Board Pack One-Pager. Every section had to end with a concrete recommended action.

The Dataset

Rather than handing the model pre-aggregated figures, the input was a raw GL transaction file: 42 individual April 2026 journal entries with dates, GL codes, amounts, and notes, plus a 12-month budget sheet. The model had to classify, aggregate, and analyse from first principles.

The dataset embedded seven analytical challenges the model had to detect: a cash runway approaching the 2-month threshold (2.1 months), an EBITDA shortfall of 69% against budget, GP margin compression of 8.1 percentage points from a copper cost increase, two large orders deferred to Q3, a recurring new hire salary variance, a one-time trade show overspend, and uncertain service contract renewals due in May.

Scoring Rubric

Outputs were scored on seven dimensions, each rated 1–5 by the author. Dimension weights reflect analytical importance in a real FP&A context:

D1 — Output Completeness ×2: All ten sections present and populated
D2 — Variance Classification ×2: Correct OT/T/R/S/U labels on material variances
D3 — Causal Commentary ×3: WHY not WHAT — named specific drivers (highest weight)
D4 — Red Flag Detection ×2: All embedded warning signals caught and escalated
D5 — Board Pack Quality ×2: CFO-usable one-pager without editing
D6 — Numerical Accuracy ×2: All bridges close, variances correct
D7 — Actionability ×1: Each section ends with a specific, ownable action

Total possible weighted score: 70 points, normalised to 100. All five conditions received the identical prompt and dataset. Each was run as a fresh conversation with no prior context, between 9am–5pm on Monday, June 1, 2026, with all account-level custom skills disabled.

Results

Overall Scores

Condition	Score	Time	Session %	Weekly %	Words
Opus 4.8 / High	74/100	652s	49%	3%	4,628
Opus 4.8 / Low	70/100	821s	41%	4%	3,738
Sonnet 4.6 / High	74/100	803s	27%	1%	3,681
Sonnet 4.6 / Medium	67/100	443s	18%	1%	2,943
Sonnet 4.6 / Low	66/100	286s	17%	1%	2,156

Session % and Weekly % are directional indicators only — not equivalent to token counts and cannot be treated as precise cost ratios.

Dimension Breakdown

Condition	D1 ×2	D2 ×2	D3 ×3	D4 ×2	D5 ×2	D6 ×2	D7 ×1	Score
Opus 4.8 / High	3	3	4	4	3	5	4	74
Opus 4.8 / Low	3	4	3	3	3	5	4	70
Sonnet 4.6 / High	3	4	3	4	3	5	5	74
Sonnet 4.6 / Medium	4	3	2	3	4	5	3	67
Sonnet 4.6 / Low	4	3	3	4	3	2	5	66

Key Findings

Finding 1 — Equal Total Score, Different Composition

Opus 4.8 High and Sonnet 4.6 High tied at 74/100, but the scores diverge at the dimension level in ways that matter for model selection:

Opus 4.8 High was the only condition to score 4 on D3 Causal Commentary — the highest-weighted dimension. It consistently named specific drivers: copper cost increases, the deferred order mechanism, the new hire salary load. It explained WHY results occurred, not just WHAT happened.
Sonnet 4.6 High matched on total score but produced stronger actionability (D7: 5 vs 4) and cleaner structural output. Tables were better organised, section delineation was clearer, and action recommendations were more immediately usable.

The aggregate tie is less informative than the compositional difference. Practitioners should choose based on which dimension matters most for their specific deliverable.

Efficiency Signal

Sonnet 4.6 High consumed 27% of the observed session limit versus 49% for Opus 4.8 High — a directional signal of meaningfully lower resource use for equal total quality. This is a directional finding only; without direct token counts it cannot be expressed as a precise cost ratio.

Finding 2 — Effort Level is a Risk-Control Setting

Within Sonnet 4.6, Medium effort was the least informative setting: it scored 67/100 — below High (74) but barely above Low (66) — while consuming similar session capacity to Low. It delivered neither the quality of High nor meaningful resource savings versus Low.

More importantly, effort level materially changed numerical reliability. Sonnet 4.6 Low scored 5/5 on Actionability but 2/5 on Numerical Accuracy. Sonnet 4.6 High scored 5/5 on both. The effort level selector is a risk-control mechanism, not just a speed dial.

⚠ The Central Risk Warning

The highest-risk output in this study was not the weakest-looking one. Sonnet 4.6 Low produced strong structure, reasonable completeness, and the highest actionability score in the dataset — while containing material numerical errors. A practitioner reviewing it casually would see a polished, confident-looking report. The numbers would be wrong. Polished presentation is not evidence of numerical correctness. Always verify AI-generated financial figures independently before professional use.

Finding 3 — The Board Pack Problem

No model, at any effort level, produced a board pack one-pager that a CFO could use without editing. Four of five conditions scored 3/5 on Board Pack Quality. This is not primarily a model capability limitation — it is a workflow lesson. Board-ready communication requires a dedicated prompt, a narrower audience definition, explicit board-level questions, and human editorial judgment. A single broad monthly-close prompt is not sufficient.

A Practical Division of Labour

Sonnet produces the structure; the analyst provides the cause.
Opus produces the cause; the analyst refines the communication.

The most practically useful conclusion from this experiment is not a ranking — it is a division of labour. Each model is better suited to a different type of FP&A work, and the best workflow combines model strengths with human judgment rather than delegating the entire task to either.

Sonnet 4.6 consistently produced cleaner structural outputs: better-organised tables, clearer section delineation, more concrete action recommendations. Its outputs are easier to present without heavy reformatting. The gap is that causal attribution — explaining why variances happened — tends to be descriptive rather than analytical at most effort levels.

Opus 4.8 consistently produced deeper analytical reasoning, with stronger causal attribution linking specific drivers to observed variances. Its outputs require more editorial work to make presentable but contain more of the analytical substance that management conversations require.

Neither model can be fully trusted to produce a complete, professional-grade monthly close package without human review. The practical workflow recommendation:

For routine monthly reporting: use Sonnet 4.6 High as the structural foundation. Add root-cause commentary in your own voice, drawing on operational knowledge of the business.
For in-depth variance diagnosis or CFO-level explanation: use Opus 4.8 High for the analytical foundation. Apply editorial judgment to make the output communication-ready.

Practical Model Selection Guide

FP&A Task	Recommended Setting	Why	Human Review Required
Routine monthly reporting	Sonnet 4.6 / High	Strong structure, clarity, and actionability; lower observed session consumption in this experiment	Yes — verify numbers, add causal commentary
Variance / root-cause investigation	Opus 4.8 / High	Stronger causal commentary and diagnostic depth	Yes — validate drivers against business context
Fast first draft	Sonnet 4.6 / Low	Quick structural output and coverage	Heavy numerical verification required before any professional use
Board pack / CFO one-pager	Either model, with a dedicated board-pack prompt	Single-prompt output was not board-ready in any condition	Heavy editorial review required regardless of model
Reforecast implications	Sonnet 4.6 High + analyst judgment	Good structure; business assumptions need professional ownership	Yes — especially assumptions, risks, and full-year sensitivity

Limitations and Disclosures

This is a structured practitioner benchmark, not a blinded academic evaluation. Key limitations to keep in mind:

Single evaluator who designed the experiment. The author designed the dataset, rubric, and embedded analytical traps, then scored the outputs. This dual role introduces potential bias. Future replications should use independent blind evaluators.
Single task type. Findings apply to a CFO monthly close package for a manufacturing company. Other FP&A tasks may produce different model-effort profiles.
No direct token measurement. Session and weekly usage percentages are directional proxies only. Precise cost comparisons require API access with token counts.
Output variability. Claude models do not produce deterministic outputs. A single run per condition means findings reflect one instantiation, not a stable average.
Opus 4.8 Medium not tested. The diminishing returns finding on effort level applies specifically to Sonnet 4.6. No generalisation should be made about Medium effort across all models.

Download the Full Research Paper

17 pages. Full methodology, dimension analysis, literature review, reference list, and appendices including the complete prompt and dataset summary.

Download Paper (DOCX)

Model Choice, Effort Level, and FP&A Output Quality