100% Private

CSV Data Cleanup: Practical Solutions for Real Problems

Most CSV problems are predictable. Encoding corruption, wrong delimiters, misquoted fields, duplicates, date format chaos — this guide covers each one with concrete fixes, including command-line recipes, regex patterns, and SQL approaches for large files.

Encoding Problems

Garbled characters are almost always an encoding mismatch. The file was written in one encoding and you're reading it as another.

What it looks like

José García        → José García  (UTF-8 read as Windows-1252)
François            → François     (same problem)
“smart quotes†   → "smart quotes" (Windows-1252 in UTF-8)

The pattern é is the two-byte UTF-8 sequence for é (bytes C3 A9) being read as Windows-1252 characters — each byte gets decoded separately. Once you recognize this pattern, you know the file is UTF-8 being misread as something else.

Diagnosing encoding

# Linux/Mac: detect encoding
file -i data.csv

→ text/plain; charset=utf-8

Python: statistical detection

import chardet with open('data.csv', 'rb') as f: result = chardet.detect(f.read(100000)) print(result)

→ {'encoding': 'ISO-8859-1', 'confidence': 0.73}

chardet works by statistics — it's right most of the time but not always, especially for short files. If it returns low confidence, try the encoding it suggests and look at accented characters to confirm visually.

Converting to UTF-8

# iconv (Linux/Mac)
iconv -f WINDOWS-1252 -t UTF-8 input.csv > output.csv
iconv -f SHIFT-JIS -t UTF-8 japanese.csv > japanese_utf8.csv

Python

with open('input.csv', encoding='windows-1252') as f_in: data = f_in.read() with open('output.csv', 'w', encoding='utf-8') as f_out: f_out.write(data)

BOM issues

Excel adds a UTF-8 BOM (bytes EF BB BF) to CSV files it exports. Most parsers handle this silently, but some don't — they include the BOM as part of the first column header, so instead of Name you get Name. Strip it with:

# Python: read with utf-8-sig to auto-strip BOM
with open('excel-export.csv', encoding='utf-8-sig') as f:
content = f.read()

Or strip manually

tail -c +4 input.csv > output.csv # Linux/Mac

Quick fix in browser: Use the Text Encoding Converter to detect and convert CSV files without any software install. All processing is local.

Delimiters and Quoting

European semicolon CSVs

Excel in European locales uses semicolons by default because commas are decimal separators (e.g., 1.234,56 in German). If you open a German-exported CSV expecting commas, every row looks like one long field.

# Check what delimiter the file actually uses:
head -1 data.csv | cat -A

If you see ; instead of , between fields, it's semicolon-delimited

Convert with Python (handles quoted fields correctly)

import csv with open('semi.csv') as f_in, open('comma.csv', 'w') as f_out: reader = csv.reader(f_in, delimiter=';') writer = csv.writer(f_out) writer.writerows(reader)

Never do this (breaks quoted fields containing semicolons):

sed 's/;/,/g' data.csv # WRONG for complex data

Quoting problems

RFC 4180 requires double-quoting fields that contain the delimiter, newlines, or quote characters. Quotes within a quoted field get doubled:

# Valid RFC 4180 quoting
"Johnson, Inc.",12345,Active               # comma in field
"He said ""Hello""",2024-01-15,Note        # quote in field
"Line 1
Line 2",Value,Data                         # newline in field

Common broken quoting (backslash escape instead of doubling)

"He said "Hello"",2024-01-15,Note # NOT valid RFC 4180 # most parsers reject or misread this

If you're getting parse errors on rows with quotes, check whether the source system used backslash escaping instead of doubling. Python's csv module handles RFC 4180 by default; for backslash escaping, set escapechar='\' and quoting=csv.QUOTE_NONE.

Pipe-delimited and fixed-width

Database exports often use pipe (|) as a delimiter to avoid quoting issues with data containing commas. Fixed-width formats (columns at fixed character positions) require knowing the column layout from documentation or schema files. Convert fixed-width to CSV with:

pip install csvkit
in2csv --format fixed --schema schema.csv data.txt > output.csv

The Data Converter tools handle TSV, pipe-delimited, and fixed-width to CSV conversions in the browser.

Excel Gotchas

The double-click trap

Never open a CSV by double-clicking it in Windows. Excel guesses the encoding and delimiter, gets it wrong half the time, and silently corrupts data. Always use Data > Get External Data > From Text and specify encoding and delimiter explicitly.

Scientific notation on long numbers

Product codes, phone numbers, order IDs, and similar long numeric strings get auto-converted to scientific notation. 1234567890123 becomes 1.23457E+12. Once Excel does this, the precision is gone — the trailing digits are lost forever.

Prevention: Import the column as Text in the import wizard, or prefix values with an apostrophe ('1234567890123) to force text treatment.

Date format interpretation

Excel interprets 01/02/2024 as January 2 in US locale and February 1 in European locale. Worse, it converts dates to its internal serial numbers internally, so if you re-export to CSV, you may get 45293 instead of a readable date.

Always store dates in ISO 8601 format (2024-01-15). Excel won't auto-convert these to its date type, but every other system will parse them correctly.

Trailing commas

Excel sometimes exports trailing commas on every row if the selection was wider than the data. This creates phantom empty columns that confuse parsers. Check with:

awk -F',' '{print NF}' data.csv | sort | uniq -c

If rows have different field counts, you have a structure problem

Cleaning Values

Whitespace

Leading and trailing spaces break string matching. "john@example.com" and " john@example.com" are different strings to every database and API, but look identical in a spreadsheet. Also watch for non-breaking spaces (U+00A0) copied from web pages — these don't trim with a regular space strip.

# Python: trim and normalize whitespace
import re
def clean(val):
# Strip normal and non-breaking spaces, collapse multiple spaces
return re.sub(r'\s+', ' ', val.replace('\u00a0', ' ')).strip()

Null representations

A single column might contain five different ways to say "no value":

Name,Email,Phone
John,john@example.com,
Jane,jane@example.com,NULL
Bob,bob@example.com,N/A
Alice,alice@example.com,none
Carol,carol@example.com,—

Standardize before import. In Python: nulls = {'NULL', 'null', 'N/A', 'n/a', 'none', 'None', '-', '—', ''}, then replace any match with actual None or an empty string depending on target system requirements.

Phone numbers and IDs

Phone number formatting is inconsistent across sources: 555-1234, (555) 1234, 5551234, +1-555-1234. Strip to digits only if you just need to match records; normalize to a standard format if you need to display them:

import re
def normalize_phone(s):
digits = re.sub(r'\D', '', s)
# Remove leading 1 for US numbers
if len(digits) == 11 and digits[0] == '1':
digits = digits[1:]
return digits  # returns '5551234' regardless of input format

Deduplication

Duplicates come in three flavors, each needing a different strategy:

TypeExampleStrategy
ExactIdentical rowsHash or sort + unique
Key-basedSame email, different name/dateSort by date, keep first or last
Fuzzy"john doe" vs "John Doe"Normalize first, then key-based

import csv

Exact deduplication using set of tuples

seen = set() with open('input.csv') as f_in, open('deduped.csv', 'w') as f_out: reader = csv.reader(f_in) writer = csv.writer(f_out) header = next(reader) writer.writerow(header) for row in reader: key = tuple(row) if key not in seen: seen.add(key) writer.writerow(row)

Key-based: deduplicate by email, keep most recent

import pandas as pd df = pd.read_csv('input.csv') df['date'] = pd.to_datetime(df['date']) df = df.sort_values('date', ascending=False) df = df.drop_duplicates(subset=['email'], keep='first') df.to_csv('deduped.csv', index=False)

Regex Fixes

Regex is the fastest way to fix systematic problems across an entire file. These patterns handle the most common cases:

# Strip currency symbols and commas from numeric fields

"$1,234.56" → "1234.56"

re.sub(r'[$,€£¥]', '', value)

Normalize dates: "03/15/2024" or "3-15-2024" → "2024-03-15"

m = re.match(r'(\d{1,2})/-/-', value) if m: iso = f"{m.group(3)}-{m.group(1):0>2}-{m.group(2):0>2}"

Extract email from messy field: "John Doe john@example.com"

m = re.search(r'[\w.+-]+@[\w-]+.\w+', value) email = m.group() if m else ''

Normalize header names to snake_case

import re def to_snake(header): h = header.strip().lower() h = re.sub(r'[^a-z0-9]+', '', h) return h.strip('')

"E-mail Address" → "e_mail_address"

"Phone Number (Primary)" → "phone_number_primary"

Avoid regex on HTML or quoted CSV fields. Use a real CSV parser to split fields first, then apply regex to individual field values. Never use regex to split CSV rows — quoted fields containing commas will break.

Large Files

Browser-based editors start struggling above 100MB. Excel tops out around 1 million rows. For anything larger, use streaming tools.

csvkit — the Swiss Army knife

pip install csvkit

Preview column names and types

csvstat data.csv

Reorder/select columns (fast, streaming)

csvcut -c FirstName,LastName,Email data.csv > small.csv

Filter rows

csvgrep -c status -m active data.csv > active.csv

Count rows without loading entire file

wc -l data.csv

awk for multi-gigabyte files

# Print columns 1, 3, 5 (awk is ~10x faster than Python for simple ops)
awk -F',' 'BEGIN{OFS=","} {print $1,$3,$5}' data.csv > out.csv

Filter rows where column 4 equals "active"

awk -F',' '$4 == "active"' data.csv > filtered.csv

Note: awk doesn't handle quoted fields with embedded commas.

For those, use Python csv module or csvkit instead.

Chunked processing in Python

# Process in chunks without loading entire file into memory
import pandas as pd

chunks = pd.read_csv('huge.csv', chunksize=100000) with open('output.csv', 'w') as f_out: for i, chunk in enumerate(chunks): # Apply cleanup to each chunk chunk['email'] = chunk['email'].str.strip().str.lower() chunk.drop_duplicates(subset='email', inplace=True) chunk.to_csv(f_out, header=(i == 0), index=False)

Splitting large files

# Split into 100k-row chunks (preserves header)
split -l 100001 data.csv chunk_

Add header back to each chunk:

head -1 data.csv > header.csv for f in chunk_*; do cat header.csv "$f" > "${f}.csv" done

SQL for CSV Cleanup

For complex transformations — joins, aggregations, conditional logic — loading CSV into SQLite is often faster to write and more reliable than Python scripts.

# Load CSV into SQLite (no install beyond Python stdlib)
import sqlite3
import pandas as pd

df = pd.read_csv('data.csv') conn = sqlite3.connect(':memory:') df.to_sql('contacts', conn, index=False)

Now use SQL for cleanup

cleaned = pd.read_sql(""" SELECT TRIM(LOWER(email)) AS email, TRIM(first_name) AS first_name, TRIM(last_name) AS last_name, phone FROM contacts WHERE email IS NOT NULL AND email LIKE '%@%.%' GROUP BY LOWER(TRIM(email)) -- deduplicate by email HAVING MAX(created_at) -- keep most recent """, conn)

cleaned.to_csv('output.csv', index=False)

-- Useful SQL patterns for data cleanup:

-- Standardize mixed-case emails UPDATE contacts SET email = LOWER(TRIM(email));

-- Mark rows with invalid phone numbers SELECT * FROM contacts WHERE phone NOT REGEXP '^+?[0-9\s-()]{7,}$';

-- Find duplicate emails SELECT email, COUNT(*) as cnt FROM contacts GROUP BY email HAVING cnt > 1 ORDER BY cnt DESC;

You can also run SQL directly against CSV files using DuckDB syntax:

# DuckDB: SQL on CSV without loading to database
import duckdb
result = duckdb.query("""
SELECT email, count(*) as cnt
FROM 'data.csv'
GROUP BY email HAVING cnt > 1
""").df()

Validation Before Import

Running a quick validation before importing to a database or CRM catches problems before they become harder-to-fix bad data in production.

Structural checks

import csv

with open('data.csv') as f: reader = csv.reader(f) header = next(reader) expected_cols = len(header)

errors = [] for i, row in enumerate(reader, start=2): if len(row) != expected_cols: errors.append(f"Row {i}: expected {expected_cols} cols, got {len(row)}")

if errors: for e in errors[:20]: # show first 20 print(e) else: print("Structure OK")

Data quality checks

CheckWhy it mattersQuick test
No duplicate primary keysDatabase rejects or silently overwritesdf['id'].duplicated().sum()
Required fields populatedImport fails or creates orphan recordsdf['email'].isna().sum()
Dates parse correctlyWrong format causes rejectionpd.to_datetime(df['date'], errors='coerce').isna().sum()
Numeric columns are numericString "N/A" in a number column fails type checkpd.to_numeric(df['price'], errors='coerce').isna().sum()
Foreign keys existReferential integrity violationMerge against reference table and find unmatched rows

Test with a sample first

Never import a 500,000-row file without first importing 50 rows. Most database systems roll back on error, but some don't. Import head -51 data.csv > sample.csv (header + 50 rows), fix any errors you find, then run the full import.

Automation Scripts

If you run the same cleanup regularly — weekly CRM exports, monthly data feeds — a script saves time and eliminates manual errors.

Python cleanup pipeline

<code">import pandas as pd
import re

def clean_csv(input_path, output_path): df = pd.read_csv(input_path, encoding='utf-8-sig') # handles BOM

# Normalize headers df.columns = [re.sub(r'[^a-z0-9]+', '', c.strip().lower()).strip('') for c in df.columns]

# Trim all string columns str_cols = df.select_dtypes(include='object').columns df[str_cols] = df[str_cols].apply(lambda c: c.str.strip())

# Standardize nulls nulls = {'NULL', 'null', 'N/A', 'n/a', 'none', 'None', '-', ''} df.replace(list(nulls), pd.NA, inplace=True)

# Normalize emails if 'email' in df.columns: df['email'] = df['email'].str.lower()

# Drop exact duplicates df.drop_duplicates(inplace=True)

# Validate row count change print(f"Input: {len(df) + df.duplicated().sum()} rows → Output: {len(df)} rows")

df.to_csv(output_path, index=False, encoding='utf-8')

clean_csv('raw_export.csv', 'clean_export.csv')

Makefile for pipeline automation

<code">clean: raw_export.csv
python clean_csv.py raw_export.csv clean.csv

import: clean mysql -u user -p database < load.sql

.PHONY: clean import

Tools

CSV to JSON / JSON to CSV

Convert between formats for API workflows and data pipelines.

Data Converter Tools
Text Encoding Converter

Fix Windows-1252, Shift-JIS, GBK and other encoding issues in your CSV files.

Encoding Tools
CSV to Excel / Excel to CSV

Convert between CSV and XLSX formats preserving column types.

Spreadsheet Tools
CSV to SQL / SQLite

Load CSV into a queryable SQLite database for cleanup and analysis.

Spreadsheet Tools

Command-line libraries

  • csvkitpip install csvkitcsvcut, csvgrep, csvstat, csvsql
  • pandaspip install pandas — full data manipulation, best for complex transforms
  • DuckDBpip install duckdb — SQL directly on CSV files, extremely fast
  • Miller (mlr) — standalone binary, excellent streaming CSV processor

Privacy Notice: This site works entirely in your browser. We don't collect or store your data. Optional analytics help us improve the site. You can deny without affecting functionality.