Practitioner Summary

If you only have two minutes, this table is all you need.

One-Page Summary
Research Question Does a more capable Claude model at higher effort produce meaningfully better FP&A output, and if so, at what resource cost?
Design Five conditions tested: Opus 4.8 at High and Low; Sonnet 4.6 at High, Medium, and Low. Identical prompt and raw GL dataset supplied to each. Outputs scored on a seven-dimension rubric by the author — an FP&A professional with direct manufacturing CFO-level experience.
Headline Finding Sonnet 4.6 High matched Opus 4.8 High on total quality (74/100) while consuming materially less observed session capacity. But the scores diverge at the dimension level: Opus 4.8 produced deeper causal commentary; Sonnet 4.6 produced stronger structure and actionability.
Recommendation Use Sonnet 4.6 High as the default for routine monthly reporting. Add root-cause commentary yourself. Upgrade to Opus 4.8 High when causal diagnosis is the core deliverable. Avoid Sonnet 4.6 Low for any output you plan to act on without heavy numerical verification.
⚠ Risk Warning The highest-risk output was not the weakest-looking one. Sonnet 4.6 Low produced strong structure and actionability but weak numerical accuracy (D6 score of 2/5). Polished presentation is not evidence of numerical correctness.
Bottom Line Sonnet produces the structure; the analyst provides the cause. Opus produces the cause; the analyst refines the communication.
Methodological Note The dataset, rubric, and scoring were all designed and applied by the same author without blind evaluation. These are results from a structured practitioner benchmark, not a blinded academic evaluation.

Background and Motivation

Anthropic's release of Claude Opus 4.8, together with user-selectable effort intensity levels, created a new decision variable for AI practitioners: not just which model to use, but how hard to ask it to work. For finance professionals, this raises a practical question with real cost implications.

A CFO monthly close reporting package is among the most demanding analytical tasks in FP&A. It requires multi-layer financial reasoning, precise numerical computation, causal attribution, risk escalation judgment, and executive communication ability — all applied simultaneously to raw data. It is exactly the kind of task where model and effort level selection should matter, and where the wrong choice has professional consequences.

This experiment was designed to answer one question with professional specificity: for a complex FP&A task, does a more capable model at higher effort produce meaningfully better output? And if so, at what cost in session resources?

Key Context

Most existing AI finance benchmarks evaluate narrow tasks: sentiment classification, named-entity recognition, or isolated numerical reasoning. This experiment evaluates a complete analytical workflow — ten structured deliverables from raw GL data — which is closer to what FP&A professionals actually do.

The Experiment

Task Design

The task was a CFO-ready monthly close package for Meridian Manufacturing Inc., a fictional mid-sized electrical transformer manufacturer (B2B, Ontario, ~180 employees). The company was designed to resemble the industrial manufacturing environments in which the author has direct FP&A experience, enabling professional-grade evaluation.

The prompt asked the model to produce ten deliverables: Data Quality Summary, Executive Summary, KPI Dashboard, Revenue Analysis, Income Statement and EBITDA Bridge with variance classification (OT/T/R/S/U), Cash Flow, Balance Sheet, Root-Cause and Action Plan, Reforecast Implications, and a Board Pack One-Pager. Every section had to end with a concrete recommended action.

The Dataset

Rather than handing the model pre-aggregated figures, the input was a raw GL transaction file: 42 individual April 2026 journal entries with dates, GL codes, amounts, and notes, plus a 12-month budget sheet. The model had to classify, aggregate, and analyse from first principles.

The dataset embedded seven analytical challenges the model had to detect: a cash runway approaching the 2-month threshold (2.1 months), an EBITDA shortfall of 69% against budget, GP margin compression of 8.1 percentage points from a copper cost increase, two large orders deferred to Q3, a recurring new hire salary variance, a one-time trade show overspend, and uncertain service contract renewals due in May.

Scoring Rubric

Outputs were scored on seven dimensions, each rated 1–5 by the author. Dimension weights reflect analytical importance in a real FP&A context:

Total possible weighted score: 70 points, normalised to 100. All five conditions received the identical prompt and dataset. Each was run as a fresh conversation with no prior context, between 9am–5pm on Monday, June 1, 2026, with all account-level custom skills disabled.

Results

Overall Scores

Condition Score Time Session % Weekly % Words
Opus 4.8 / High 74/100 652s 49% 3% 4,628
Opus 4.8 / Low 70/100 821s 41% 4% 3,738
Sonnet 4.6 / High 74/100 803s 27% 1% 3,681
Sonnet 4.6 / Medium 67/100 443s 18% 1% 2,943
Sonnet 4.6 / Low 66/100 286s 17% 1% 2,156

Session % and Weekly % are directional indicators only — not equivalent to token counts and cannot be treated as precise cost ratios.

Dimension Breakdown

Condition D1
×2
D2
×2
D3
×3
D4
×2
D5
×2
D6
×2
D7
×1
Score
Opus 4.8 / High 3 3 4 4 3 5 4 74
Opus 4.8 / Low 3 4 3 3 3 5 4 70
Sonnet 4.6 / High 3 4 3 4 3 5 5 74
Sonnet 4.6 / Medium 4 3 2 3 4 5 3 67
Sonnet 4.6 / Low 4 3 3 4 3 2 5 66

Key Findings

Finding 1 — Equal Total Score, Different Composition

Opus 4.8 High and Sonnet 4.6 High tied at 74/100, but the scores diverge at the dimension level in ways that matter for model selection:

The aggregate tie is less informative than the compositional difference. Practitioners should choose based on which dimension matters most for their specific deliverable.

Efficiency Signal

Sonnet 4.6 High consumed 27% of the observed session limit versus 49% for Opus 4.8 High — a directional signal of meaningfully lower resource use for equal total quality. This is a directional finding only; without direct token counts it cannot be expressed as a precise cost ratio.

Finding 2 — Effort Level is a Risk-Control Setting

Within Sonnet 4.6, Medium effort was the least informative setting: it scored 67/100 — below High (74) but barely above Low (66) — while consuming similar session capacity to Low. It delivered neither the quality of High nor meaningful resource savings versus Low.

More importantly, effort level materially changed numerical reliability. Sonnet 4.6 Low scored 5/5 on Actionability but 2/5 on Numerical Accuracy. Sonnet 4.6 High scored 5/5 on both. The effort level selector is a risk-control mechanism, not just a speed dial.

⚠ The Central Risk Warning

The highest-risk output in this study was not the weakest-looking one. Sonnet 4.6 Low produced strong structure, reasonable completeness, and the highest actionability score in the dataset — while containing material numerical errors. A practitioner reviewing it casually would see a polished, confident-looking report. The numbers would be wrong. Polished presentation is not evidence of numerical correctness. Always verify AI-generated financial figures independently before professional use.

Finding 3 — The Board Pack Problem

No model, at any effort level, produced a board pack one-pager that a CFO could use without editing. Four of five conditions scored 3/5 on Board Pack Quality. This is not primarily a model capability limitation — it is a workflow lesson. Board-ready communication requires a dedicated prompt, a narrower audience definition, explicit board-level questions, and human editorial judgment. A single broad monthly-close prompt is not sufficient.

A Practical Division of Labour

Sonnet produces the structure; the analyst provides the cause.
Opus produces the cause; the analyst refines the communication.

The most practically useful conclusion from this experiment is not a ranking — it is a division of labour. Each model is better suited to a different type of FP&A work, and the best workflow combines model strengths with human judgment rather than delegating the entire task to either.

Sonnet 4.6 consistently produced cleaner structural outputs: better-organised tables, clearer section delineation, more concrete action recommendations. Its outputs are easier to present without heavy reformatting. The gap is that causal attribution — explaining why variances happened — tends to be descriptive rather than analytical at most effort levels.

Opus 4.8 consistently produced deeper analytical reasoning, with stronger causal attribution linking specific drivers to observed variances. Its outputs require more editorial work to make presentable but contain more of the analytical substance that management conversations require.

Neither model can be fully trusted to produce a complete, professional-grade monthly close package without human review. The practical workflow recommendation:

Practical Model Selection Guide

FP&A Task Recommended Setting Why Human Review Required
Routine monthly reporting Sonnet 4.6 / High Strong structure, clarity, and actionability; lower observed session consumption in this experiment Yes — verify numbers, add causal commentary
Variance / root-cause investigation Opus 4.8 / High Stronger causal commentary and diagnostic depth Yes — validate drivers against business context
Fast first draft Sonnet 4.6 / Low Quick structural output and coverage Heavy numerical verification required before any professional use
Board pack / CFO one-pager Either model, with a dedicated board-pack prompt Single-prompt output was not board-ready in any condition Heavy editorial review required regardless of model
Reforecast implications Sonnet 4.6 High + analyst judgment Good structure; business assumptions need professional ownership Yes — especially assumptions, risks, and full-year sensitivity

Limitations and Disclosures

This is a structured practitioner benchmark, not a blinded academic evaluation. Key limitations to keep in mind:

Download the Full Research Paper

17 pages. Full methodology, dimension analysis, literature review, reference list, and appendices including the complete prompt and dataset summary.

Download Paper (DOCX)

Related Insights

FP&A · AI · Strategy
FP&A in the AI Era: Same Judgment, Radically Different Leverage

How AI has changed what one skilled analyst can accomplish — and what it still cannot do.

FP&A · Strategy · Practical Guide
From Reporting to Decisions: How Strategic FP&A Actually Works

The three-layer framework that separates reporting from genuine financial intelligence.