A freelance ops consultant I know spent eleven hours last quarter cleaning a single export from a client's CRM: 47,000 rows, three date formats, country names in four languages, and a "phone" column that sometimes held emails. She redid the job last month in nine minutes using ChatGPT's Advanced Data Analysis. Same file. Same output. Different decade of effort.
That gap is the story. Advanced Data Analysis (formerly Code Interpreter) is bundled into ChatGPT Plus at $20/month and ships a sandboxed Python environment with pandas, numpy, and openpyxl preloaded. You upload a file, describe the mess in plain English, and it writes, runs, and debugs the cleaning code while you watch. No local Python install. No Stack Overflow tabs.
Here's how to use it properly — and where it still bites.
Why this beats your usual workflow
Excel chokes above a few hundred thousand rows. OpenRefine is powerful but has a learning curve measured in weekends. Hiring a VA on Upwork at $15-30/hour means waiting overnight and re-explaining the rules every time the file format shifts.
ChatGPT sits in the middle: faster than a human, more flexible than a macro, and it explains every transformation it makes. That last part matters. When a client asks why 412 rows got dropped, you get a real answer, not a shrug.
| Tool | Best for | Cost | Row ceiling |
|---|---|---|---|
| Excel / Google Sheets | Quick visual fixes | $0-12/mo | ~1M (slow past 100k) |
| OpenRefine | Repeatable, auditable cleanup | Free | Several million |
| ChatGPT Advanced Data Analysis | One-off messy files, fast | $20/mo | ~500k practical |
| Custom Python script | Recurring pipelines | Dev time | Unlimited |
The seven-step cleanup
- Upload the file. Drag your CSV into the chat. Files up to 512MB are accepted per OpenAI's current documentation, though anything above ~200MB gets sluggish.
- Ask for a profile, not a fix. Start with: "Load this CSV and give me a column-by-column profile: dtype, null count, unique count, and five sample values." You need to see the mess before you treat it.
- Define your rules out loud. Tell it the canonical format. Example: "Dates should be ISO 8601. Country should match ISO 3166 alpha-2. Emails lowercase. Trim whitespace everywhere."
- Dedupe with a key. Don't let it guess. Say: "Treat rows as duplicates when normalized email + normalized phone match. Keep the most recent by signup_date."
- Quarantine, don't delete. Ask it to move questionable rows to a second sheet — invalid emails, dates outside 1990-2026, negative revenue — instead of dropping them silently.
- Request a diff summary. "How many rows changed, by column?" This is your audit trail.
- Export both files. Cleaned CSV plus a quarantine CSV. Download both before closing the chat — the sandbox is ephemeral.
Prompts that actually work
Vague prompts produce vague code. These don't.
For mixed date formats: "Parse the signup_date column. It contains a mix of MM/DD/YYYY, DD-MM-YYYY, and Unix timestamps. Use the source_country column to disambiguate ambiguous dates — US rows are MM/DD, everything else is DD/MM. Output ISO 8601."
For inconsistent categoricals: "The industry column has 340 unique values but probably represents about 20 real categories. Cluster them using fuzzy matching, show me the proposed mapping as a table, and wait for my approval before applying it."
For dirty phone numbers: "Normalize the phone column to E.164. Use country_code where available, otherwise infer from the country column. Flag any that can't be parsed — don't drop them."
That "wait for my approval" pattern is underrated. It turns a black box into a checkpoint.
Where it breaks
Advanced Data Analysis has real limits. The sandbox times out after roughly 10 minutes of idle activity, taking your dataframes with it. Files over 200MB hit memory issues during groupby operations. And the model occasionally hallucinates column names that don't exist — always ask it to print df.columns.tolist() first.
It also can't reach the internet. If you need to enrich rows with live data — say, validating emails through a service like ZeroBounce or NeverBounce — you'll do that step outside the chat.
FAQ
Is my data safe inside ChatGPT?
Per OpenAI's data