Last Tuesday I handed ChatGPT o3 a messy 240,000-row sales CSV with seven encoding errors and a duplicated header row. It cleaned, joined, forecasted, and charted the thing in 94 seconds. Then on a simpler task two hours later, it confidently invented a column that did not exist.
That is o3 in 2026. Brilliant, occasionally delusional, and dramatically better at numbers than the GPT-4 generation it replaced. After three weeks of using it daily for client work, I ran six controlled tests to see exactly where it earns its $20 ChatGPT Plus seat — and where you still need a human or a real notebook.
The test setup
Every task was run three times in fresh chats to check consistency. I compared o3 against Claude 3.7 Sonnet (the closest competitor at the time of writing) and, where it mattered, against the same task executed in a Jupyter notebook by yours truly. Datasets ranged from 5,000 to 1.2 million rows: e-commerce orders, ad spend logs, SaaS churn data, and a finance ledger I borrowed from a portfolio company.
The six tests, scored honestly
| Test | o3 | Claude 3.7 | Winner |
|---|---|---|---|
| 1. Clean & dedupe a messy 240k-row CSV | 9/10 | 7/10 | o3 |
| 2. SQL query generation (Postgres, 12 joins) | 8/10 | 9/10 | Claude |
| 3. 90-day revenue forecast with seasonality | 8/10 | 6/10 | o3 |
| 4. Anomaly detection in ad spend | 9/10 | 7/10 | o3 |
| 5. Cohort retention analysis | 7/10 | 8/10 | Claude |
| 6. Executive summary chart deck | 9/10 | 6/10 | o3 |
Where o3 genuinely shines
Two patterns emerged. First, anything that requires running code beats anything that requires reading code. o3's Advanced Data Analysis mode quietly iterates — it tries a join, sees the row count looks wrong, fixes it, and reruns. I watched it correct itself four times on the anomaly test without being asked.
Second, o3 writes shockingly good matplotlib. The charts it produced for test six were boardroom-ready: proper titles, sensible color choices, no overlapping labels. Claude's charts looked like 2019.
Where it broke
Test five, the cohort analysis, is where things got ugly. On the third run, o3 hallucinated a subscription_tier column and built an entire retention table around it. The column did not exist. The output looked completely plausible. If I had not eyeballed the source data first, I would have shipped a fabricated chart to a client.
This is the o3 trap. When it is right, it is precise. When it is wrong, it is articulate. There is no middle warning state.
A repeatable workflow that actually works
- Always paste the schema first. Before you upload anything, tell o3 the exact column names and types. This single step cut my hallucination rate by roughly two-thirds.
- Ask for the plan before the code. Prompt: "Outline the steps you'll take, then wait for me to approve." o3 follows this faithfully and catches its own bad assumptions.
- Force it to print row counts. After every join or filter, ask for
df.shape. Silent row loss is the #1 way analyses go wrong. - Verify one number manually. Pick a metric, compute it yourself in a pivot table, and confirm o3 matches. If yes, trust the rest. If no, restart.
- Export the code, not just the answer. Have o3 hand you the Python script. Run it in Colab to reproduce. This is your audit trail.
What it costs versus the alternatives
ChatGPT Plus is $20/month and gets you o3 with reasonable message limits. ChatGPT Team at $25/user/month raises those limits and adds shared workspaces. The o3 API runs at $2 per million input tokens and $8 per million output tokens as of OpenAI's April 2026 pricing page — meaningful if you are wiring it into a product, irrelevant if you are using it through the web app.
For comparison, a junior data analyst on Upwork runs $25-60/hour. One careful o3 session replaces roughly two to four of those hours on routine work. The math is not subtle.
FAQ
Is o3 better than the old Advanced Data Analysis (Code Interpreter)?
Yes, meaningfully. It reasons longer before writing code, catches its own errors mid-run, and produces cleaner visualizations. The speedup on complex tasks is roughly 2x in my testing.
Can o3 handle datasets over one million rows?
Inside ChatGPT, you will hit memory limits around 1.5M r