Practitioner Summary
If you only have two minutes, this table is all you need.
| Research Question | Does a more capable Claude model at higher effort produce meaningfully better FP&A output, and if so, at what resource cost? |
| Design | Five conditions tested: Opus 4.8 at High and Low; Sonnet 4.6 at High, Medium, and Low. Identical prompt and raw GL dataset supplied to each. Outputs scored on a seven-dimension rubric by the author — an FP&A professional with direct manufacturing CFO-level experience. |
| Headline Finding | Sonnet 4.6 High matched Opus 4.8 High on total quality (74/100) while consuming materially less observed session capacity. But the scores diverge at the dimension level: Opus 4.8 produced deeper causal commentary; Sonnet 4.6 produced stronger structure and actionability. |
| Recommendation | Use Sonnet 4.6 High as the default for routine monthly reporting. Add root-cause commentary yourself. Upgrade to Opus 4.8 High when causal diagnosis is the core deliverable. Avoid Sonnet 4.6 Low for any output you plan to act on without heavy numerical verification. |
| ⚠ Risk Warning | The highest-risk output was not the weakest-looking one. Sonnet 4.6 Low produced strong structure and actionability but weak numerical accuracy (D6 score of 2/5). Polished presentation is not evidence of numerical correctness. |
| Bottom Line | Sonnet produces the structure; the analyst provides the cause. Opus produces the cause; the analyst refines the communication. |
| Methodological Note | The dataset, rubric, and scoring were all designed and applied by the same author without blind evaluation. These are results from a structured practitioner benchmark, not a blinded academic evaluation. |
Background and Motivation
Anthropic's release of Claude Opus 4.8, together with user-selectable effort intensity levels, created a new decision variable for AI practitioners: not just which model to use, but how hard to ask it to work. For finance professionals, this raises a practical question with real cost implications.
A CFO monthly close reporting package is among the most demanding analytical tasks in FP&A. It requires multi-layer financial reasoning, precise numerical computation, causal attribution, risk escalation judgment, and executive communication ability — all applied simultaneously to raw data. It is exactly the kind of task where model and effort level selection should matter, and where the wrong choice has professional consequences.
This experiment was designed to answer one question with professional specificity: for a complex FP&A task, does a more capable model at higher effort produce meaningfully better output? And if so, at what cost in session resources?
Most existing AI finance benchmarks evaluate narrow tasks: sentiment classification, named-entity recognition, or isolated numerical reasoning. This experiment evaluates a complete analytical workflow — ten structured deliverables from raw GL data — which is closer to what FP&A professionals actually do.
The Experiment
Task Design
The task was a CFO-ready monthly close package for Meridian Manufacturing Inc., a fictional mid-sized electrical transformer manufacturer (B2B, Ontario, ~180 employees). The company was designed to resemble the industrial manufacturing environments in which the author has direct FP&A experience, enabling professional-grade evaluation.
The prompt asked the model to produce ten deliverables: Data Quality Summary, Executive Summary, KPI Dashboard, Revenue Analysis, Income Statement and EBITDA Bridge with variance classification (OT/T/R/S/U), Cash Flow, Balance Sheet, Root-Cause and Action Plan, Reforecast Implications, and a Board Pack One-Pager. Every section had to end with a concrete recommended action.
The Dataset
Rather than handing the model pre-aggregated figures, the input was a raw GL transaction file: 42 individual April 2026 journal entries with dates, GL codes, amounts, and notes, plus a 12-month budget sheet. The model had to classify, aggregate, and analyse from first principles.
The dataset embedded seven analytical challenges the model had to detect: a cash runway approaching the 2-month threshold (2.1 months), an EBITDA shortfall of 69% against budget, GP margin compression of 8.1 percentage points from a copper cost increase, two large orders deferred to Q3, a recurring new hire salary variance, a one-time trade show overspend, and uncertain service contract renewals due in May.
Scoring Rubric
Outputs were scored on seven dimensions, each rated 1–5 by the author. Dimension weights reflect analytical importance in a real FP&A context:
- D1 — Output Completeness ×2: All ten sections present and populated
- D2 — Variance Classification ×2: Correct OT/T/R/S/U labels on material variances
- D3 — Causal Commentary ×3: WHY not WHAT — named specific drivers (highest weight)
- D4 — Red Flag Detection ×2: All embedded warning signals caught and escalated
- D5 — Board Pack Quality ×2: CFO-usable one-pager without editing
- D6 — Numerical Accuracy ×2: All bridges close, variances correct
- D7 — Actionability ×1: Each section ends with a specific, ownable action
Total possible weighted score: 70 points, normalised to 100. All five conditions received the identical prompt and dataset. Each was run as a fresh conversation with no prior context, between 9am–5pm on Monday, June 1, 2026, with all account-level custom skills disabled.
Results
Overall Scores
| Condition | Score | Time | Session % | Weekly % | Words |
|---|---|---|---|---|---|
| Opus 4.8 / High | 74/100 | 652s | 49% | 3% | 4,628 |
| Opus 4.8 / Low | 70/100 | 821s | 41% | 4% | 3,738 |
| Sonnet 4.6 / High | 74/100 | 803s | 27% | 1% | 3,681 |
| Sonnet 4.6 / Medium | 67/100 | 443s | 18% | 1% | 2,943 |
| Sonnet 4.6 / Low | 66/100 | 286s | 17% | 1% | 2,156 |
Session % and Weekly % are directional indicators only — not equivalent to token counts and cannot be treated as precise cost ratios.
Dimension Breakdown
| Condition | D1 ×2 |
D2 ×2 |
D3 ×3 |
D4 ×2 |
D5 ×2 |
D6 ×2 |
D7 ×1 |
Score |
|---|---|---|---|---|---|---|---|---|
| Opus 4.8 / High | 3 | 3 | 4 | 4 | 3 | 5 | 4 | 74 |
| Opus 4.8 / Low | 3 | 4 | 3 | 3 | 3 | 5 | 4 | 70 |
| Sonnet 4.6 / High | 3 | 4 | 3 | 4 | 3 | 5 | 5 | 74 |
| Sonnet 4.6 / Medium | 4 | 3 | 2 | 3 | 4 | 5 | 3 | 67 |
| Sonnet 4.6 / Low | 4 | 3 | 3 | 4 | 3 | 2 | 5 | 66 |
Key Findings
Finding 1 — Equal Total Score, Different Composition
Opus 4.8 High and Sonnet 4.6 High tied at 74/100, but the scores diverge at the dimension level in ways that matter for model selection:
- Opus 4.8 High was the only condition to score 4 on D3 Causal Commentary — the highest-weighted dimension. It consistently named specific drivers: copper cost increases, the deferred order mechanism, the new hire salary load. It explained WHY results occurred, not just WHAT happened.
- Sonnet 4.6 High matched on total score but produced stronger actionability (D7: 5 vs 4) and cleaner structural output. Tables were better organised, section delineation was clearer, and action recommendations were more immediately usable.
The aggregate tie is less informative than the compositional difference. Practitioners should choose based on which dimension matters most for their specific deliverable.
Sonnet 4.6 High consumed 27% of the observed session limit versus 49% for Opus 4.8 High — a directional signal of meaningfully lower resource use for equal total quality. This is a directional finding only; without direct token counts it cannot be expressed as a precise cost ratio.
Finding 2 — Effort Level is a Risk-Control Setting
Within Sonnet 4.6, Medium effort was the least informative setting: it scored 67/100 — below High (74) but barely above Low (66) — while consuming similar session capacity to Low. It delivered neither the quality of High nor meaningful resource savings versus Low.
More importantly, effort level materially changed numerical reliability. Sonnet 4.6 Low scored 5/5 on Actionability but 2/5 on Numerical Accuracy. Sonnet 4.6 High scored 5/5 on both. The effort level selector is a risk-control mechanism, not just a speed dial.
The highest-risk output in this study was not the weakest-looking one. Sonnet 4.6 Low produced strong structure, reasonable completeness, and the highest actionability score in the dataset — while containing material numerical errors. A practitioner reviewing it casually would see a polished, confident-looking report. The numbers would be wrong. Polished presentation is not evidence of numerical correctness. Always verify AI-generated financial figures independently before professional use.
Finding 3 — The Board Pack Problem
No model, at any effort level, produced a board pack one-pager that a CFO could use without editing. Four of five conditions scored 3/5 on Board Pack Quality. This is not primarily a model capability limitation — it is a workflow lesson. Board-ready communication requires a dedicated prompt, a narrower audience definition, explicit board-level questions, and human editorial judgment. A single broad monthly-close prompt is not sufficient.
A Practical Division of Labour
Sonnet produces the structure; the analyst provides the cause.
Opus produces the cause; the analyst refines the communication.
The most practically useful conclusion from this experiment is not a ranking — it is a division of labour. Each model is better suited to a different type of FP&A work, and the best workflow combines model strengths with human judgment rather than delegating the entire task to either.
Sonnet 4.6 consistently produced cleaner structural outputs: better-organised tables, clearer section delineation, more concrete action recommendations. Its outputs are easier to present without heavy reformatting. The gap is that causal attribution — explaining why variances happened — tends to be descriptive rather than analytical at most effort levels.
Opus 4.8 consistently produced deeper analytical reasoning, with stronger causal attribution linking specific drivers to observed variances. Its outputs require more editorial work to make presentable but contain more of the analytical substance that management conversations require.
Neither model can be fully trusted to produce a complete, professional-grade monthly close package without human review. The practical workflow recommendation:
- For routine monthly reporting: use Sonnet 4.6 High as the structural foundation. Add root-cause commentary in your own voice, drawing on operational knowledge of the business.
- For in-depth variance diagnosis or CFO-level explanation: use Opus 4.8 High for the analytical foundation. Apply editorial judgment to make the output communication-ready.
Practical Model Selection Guide
| FP&A Task | Recommended Setting | Why | Human Review Required |
|---|---|---|---|
| Routine monthly reporting | Sonnet 4.6 / High | Strong structure, clarity, and actionability; lower observed session consumption in this experiment | Yes — verify numbers, add causal commentary |
| Variance / root-cause investigation | Opus 4.8 / High | Stronger causal commentary and diagnostic depth | Yes — validate drivers against business context |
| Fast first draft | Sonnet 4.6 / Low | Quick structural output and coverage | Heavy numerical verification required before any professional use |
| Board pack / CFO one-pager | Either model, with a dedicated board-pack prompt | Single-prompt output was not board-ready in any condition | Heavy editorial review required regardless of model |
| Reforecast implications | Sonnet 4.6 High + analyst judgment | Good structure; business assumptions need professional ownership | Yes — especially assumptions, risks, and full-year sensitivity |
Limitations and Disclosures
This is a structured practitioner benchmark, not a blinded academic evaluation. Key limitations to keep in mind:
- Single evaluator who designed the experiment. The author designed the dataset, rubric, and embedded analytical traps, then scored the outputs. This dual role introduces potential bias. Future replications should use independent blind evaluators.
- Single task type. Findings apply to a CFO monthly close package for a manufacturing company. Other FP&A tasks may produce different model-effort profiles.
- No direct token measurement. Session and weekly usage percentages are directional proxies only. Precise cost comparisons require API access with token counts.
- Output variability. Claude models do not produce deterministic outputs. A single run per condition means findings reflect one instantiation, not a stable average.
- Opus 4.8 Medium not tested. The diminishing returns finding on effort level applies specifically to Sonnet 4.6. No generalisation should be made about Medium effort across all models.
Download the Full Research Paper
17 pages. Full methodology, dimension analysis, literature review, reference list, and appendices including the complete prompt and dataset summary.
Related Insights
How AI has changed what one skilled analyst can accomplish — and what it still cannot do.
The three-layer framework that separates reporting from genuine financial intelligence.