CSV Data Cleanup: Practical Solutions for Real Problems
Most CSV problems are predictable. Encoding corruption, wrong delimiters, misquoted fields, duplicates, date format chaos — this guide covers each one with concrete fixes, including command-line recipes, regex patterns, and SQL approaches for large files.
Garbled characters are almost always an encoding mismatch. The file was written in one encoding and you're reading it as another.Encoding Problems
What it looks like
José GarcÃa → José García (UTF-8 read as Windows-1252)
François → François (same problem)
“smart quotes†→ "smart quotes" (Windows-1252 in UTF-8)The pattern é is the two-byte UTF-8 sequence for é (bytes C3 A9) being read as Windows-1252 characters — each byte gets decoded separately. Once you recognize this pattern, you know the file is UTF-8 being misread as something else.
Diagnosing encoding
# Linux/Mac: detect encoding
file -i data.csv
→ text/plain; charset=utf-8
Python: statistical detection
import chardet
with open('data.csv', 'rb') as f:
result = chardet.detect(f.read(100000))
print(result)
→ {'encoding': 'ISO-8859-1', 'confidence': 0.73}
chardet works by statistics — it's right most of the time but not always, especially for short files. If it returns low confidence, try the encoding it suggests and look at accented characters to confirm visually.
Converting to UTF-8
# iconv (Linux/Mac)
iconv -f WINDOWS-1252 -t UTF-8 input.csv > output.csv
iconv -f SHIFT-JIS -t UTF-8 japanese.csv > japanese_utf8.csv
Python
with open('input.csv', encoding='windows-1252') as f_in:
data = f_in.read()
with open('output.csv', 'w', encoding='utf-8') as f_out:
f_out.write(data)
BOM issues
Excel adds a UTF-8 BOM (bytes EF BB BF) to CSV files it exports. Most parsers handle this silently, but some don't — they include the BOM as part of the first column header, so instead of Name you get Name. Strip it with:
# Python: read with utf-8-sig to auto-strip BOM
with open('excel-export.csv', encoding='utf-8-sig') as f:
content = f.read()
Or strip manually
tail -c +4 input.csv > output.csv # Linux/Mac
Delimiters and Quoting
European semicolon CSVs
Excel in European locales uses semicolons by default because commas are decimal separators (e.g., 1.234,56 in German). If you open a German-exported CSV expecting commas, every row looks like one long field.
# Check what delimiter the file actually uses:
head -1 data.csv | cat -AIf you see ; instead of , between fields, it's semicolon-delimited
Convert with Python (handles quoted fields correctly)
import csv with open('semi.csv') as f_in, open('comma.csv', 'w') as f_out: reader = csv.reader(f_in, delimiter=';') writer = csv.writer(f_out) writer.writerows(reader)
Never do this (breaks quoted fields containing semicolons):
sed 's/;/,/g' data.csv # WRONG for complex data
Quoting problems
RFC 4180 requires double-quoting fields that contain the delimiter, newlines, or quote characters. Quotes within a quoted field get doubled:
# Valid RFC 4180 quoting
"Johnson, Inc.",12345,Active # comma in field
"He said ""Hello""",2024-01-15,Note # quote in field
"Line 1
Line 2",Value,Data # newline in fieldCommon broken quoting (backslash escape instead of doubling)
"He said "Hello"",2024-01-15,Note # NOT valid RFC 4180 # most parsers reject or misread this
If you're getting parse errors on rows with quotes, check whether the source system used backslash escaping instead of doubling. Python's csv module handles RFC 4180 by default; for backslash escaping, set escapechar='\' and quoting=csv.QUOTE_NONE.
Pipe-delimited and fixed-width
Database exports often use pipe (|) as a delimiter to avoid quoting issues with data containing commas. Fixed-width formats (columns at fixed character positions) require knowing the column layout from documentation or schema files. Convert fixed-width to CSV with:
pip install csvkit
in2csv --format fixed --schema schema.csv data.txt > output.csvThe Data Converter tools handle TSV, pipe-delimited, and fixed-width to CSV conversions in the browser.
Excel Gotchas
The double-click trap
Never open a CSV by double-clicking it in Windows. Excel guesses the encoding and delimiter, gets it wrong half the time, and silently corrupts data. Always use Data > Get External Data > From Text and specify encoding and delimiter explicitly.
Scientific notation on long numbers
Product codes, phone numbers, order IDs, and similar long numeric strings get auto-converted to scientific notation. 1234567890123 becomes 1.23457E+12. Once Excel does this, the precision is gone — the trailing digits are lost forever.
Prevention: Import the column as Text in the import wizard, or prefix values with an apostrophe ('1234567890123) to force text treatment.
Date format interpretation
Excel interprets 01/02/2024 as January 2 in US locale and February 1 in European locale. Worse, it converts dates to its internal serial numbers internally, so if you re-export to CSV, you may get 45293 instead of a readable date.
Always store dates in ISO 8601 format (2024-01-15). Excel won't auto-convert these to its date type, but every other system will parse them correctly.
Trailing commas
Excel sometimes exports trailing commas on every row if the selection was wider than the data. This creates phantom empty columns that confuse parsers. Check with:
awk -F',' '{print NF}' data.csv | sort | uniq -c
If rows have different field counts, you have a structure problem
Cleaning Values
Whitespace
Leading and trailing spaces break string matching. "john@example.com" and " john@example.com" are different strings to every database and API, but look identical in a spreadsheet. Also watch for non-breaking spaces (U+00A0) copied from web pages — these don't trim with a regular space strip.
# Python: trim and normalize whitespace
import re
def clean(val):
# Strip normal and non-breaking spaces, collapse multiple spaces
return re.sub(r'\s+', ' ', val.replace('\u00a0', ' ')).strip()Null representations
A single column might contain five different ways to say "no value":
Name,Email,Phone
John,john@example.com,
Jane,jane@example.com,NULL
Bob,bob@example.com,N/A
Alice,alice@example.com,none
Carol,carol@example.com,—Standardize before import. In Python: nulls = {'NULL', 'null', 'N/A', 'n/a', 'none', 'None', '-', '—', ''}, then replace any match with actual None or an empty string depending on target system requirements.
Phone numbers and IDs
Phone number formatting is inconsistent across sources: 555-1234, (555) 1234, 5551234, +1-555-1234. Strip to digits only if you just need to match records; normalize to a standard format if you need to display them:
import re
def normalize_phone(s):
digits = re.sub(r'\D', '', s)
# Remove leading 1 for US numbers
if len(digits) == 11 and digits[0] == '1':
digits = digits[1:]
return digits # returns '5551234' regardless of input formatDeduplication
Duplicates come in three flavors, each needing a different strategy:
| Type | Example | Strategy |
|---|---|---|
| Exact | Identical rows | Hash or sort + unique |
| Key-based | Same email, different name/date | Sort by date, keep first or last |
| Fuzzy | "john doe" vs "John Doe" | Normalize first, then key-based |
import csvExact deduplication using set of tuples
seen = set() with open('input.csv') as f_in, open('deduped.csv', 'w') as f_out: reader = csv.reader(f_in) writer = csv.writer(f_out) header = next(reader) writer.writerow(header) for row in reader: key = tuple(row) if key not in seen: seen.add(key) writer.writerow(row)
Key-based: deduplicate by email, keep most recent
import pandas as pd df = pd.read_csv('input.csv') df['date'] = pd.to_datetime(df['date']) df = df.sort_values('date', ascending=False) df = df.drop_duplicates(subset=['email'], keep='first') df.to_csv('deduped.csv', index=False)
Regex is the fastest way to fix systematic problems across an entire file. These patterns handle the most common cases:Regex Fixes
# Strip currency symbols and commas from numeric fields"$1,234.56" → "1234.56"
re.sub(r'[$,€£¥]', '', value)
Normalize dates: "03/15/2024" or "3-15-2024" → "2024-03-15"
m = re.match(r'(\d{1,2})/-/-', value) if m: iso = f"{m.group(3)}-{m.group(1):0>2}-{m.group(2):0>2}"
Extract email from messy field: "John Doe john@example.com"
m = re.search(r'[\w.+-]+@[\w-]+.\w+', value) email = m.group() if m else ''
Normalize header names to snake_case
import re def to_snake(header): h = header.strip().lower() h = re.sub(r'[^a-z0-9]+', '', h) return h.strip('')
"E-mail Address" → "e_mail_address"
"Phone Number (Primary)" → "phone_number_primary"
Browser-based editors start struggling above 100MB. Excel tops out around 1 million rows. For anything larger, use streaming tools.Large Files
csvkit — the Swiss Army knife
pip install csvkit
Preview column names and types
csvstat data.csv
Reorder/select columns (fast, streaming)
csvcut -c FirstName,LastName,Email data.csv > small.csv
Filter rows
csvgrep -c status -m active data.csv > active.csv
Count rows without loading entire file
wc -l data.csv
awk for multi-gigabyte files
# Print columns 1, 3, 5 (awk is ~10x faster than Python for simple ops)
awk -F',' 'BEGIN{OFS=","} {print $1,$3,$5}' data.csv > out.csv
Filter rows where column 4 equals "active"
awk -F',' '$4 == "active"' data.csv > filtered.csv
Note: awk doesn't handle quoted fields with embedded commas.
For those, use Python csv module or csvkit instead.
Chunked processing in Python
# Process in chunks without loading entire file into memory
import pandas as pd
chunks = pd.read_csv('huge.csv', chunksize=100000)
with open('output.csv', 'w') as f_out:
for i, chunk in enumerate(chunks):
# Apply cleanup to each chunk
chunk['email'] = chunk['email'].str.strip().str.lower()
chunk.drop_duplicates(subset='email', inplace=True)
chunk.to_csv(f_out, header=(i == 0), index=False)
Splitting large files
# Split into 100k-row chunks (preserves header)
split -l 100001 data.csv chunk_
Add header back to each chunk:
head -1 data.csv > header.csv
for f in chunk_*; do
cat header.csv "$f" > "${f}.csv"
done
For complex transformations — joins, aggregations, conditional logic — loading CSV into SQLite is often faster to write and more reliable than Python scripts.SQL for CSV Cleanup
# Load CSV into SQLite (no install beyond Python stdlib)
import sqlite3
import pandas as pddf = pd.read_csv('data.csv') conn = sqlite3.connect(':memory:') df.to_sql('contacts', conn, index=False)
Now use SQL for cleanup
cleaned = pd.read_sql(""" SELECT TRIM(LOWER(email)) AS email, TRIM(first_name) AS first_name, TRIM(last_name) AS last_name, phone FROM contacts WHERE email IS NOT NULL AND email LIKE '%@%.%' GROUP BY LOWER(TRIM(email)) -- deduplicate by email HAVING MAX(created_at) -- keep most recent """, conn)
cleaned.to_csv('output.csv', index=False)
-- Useful SQL patterns for data cleanup:-- Standardize mixed-case emails UPDATE contacts SET email = LOWER(TRIM(email));
-- Mark rows with invalid phone numbers SELECT * FROM contacts WHERE phone NOT REGEXP '^+?[0-9\s-()]{7,}$';
-- Find duplicate emails SELECT email, COUNT(*) as cnt FROM contacts GROUP BY email HAVING cnt > 1 ORDER BY cnt DESC;
You can also run SQL directly against CSV files using DuckDB syntax:
# DuckDB: SQL on CSV without loading to database
import duckdb
result = duckdb.query("""
SELECT email, count(*) as cnt
FROM 'data.csv'
GROUP BY email HAVING cnt > 1
""").df()Running a quick validation before importing to a database or CRM catches problems before they become harder-to-fix bad data in production.Validation Before Import
Structural checks
import csv
with open('data.csv') as f:
reader = csv.reader(f)
header = next(reader)
expected_cols = len(header)
errors = []
for i, row in enumerate(reader, start=2):
if len(row) != expected_cols:
errors.append(f"Row {i}: expected {expected_cols} cols, got {len(row)}")
if errors:
for e in errors[:20]: # show first 20
print(e)
else:
print("Structure OK")
Data quality checks
| Check | Why it matters | Quick test |
|---|---|---|
| No duplicate primary keys | Database rejects or silently overwrites | df['id'].duplicated().sum() |
| Required fields populated | Import fails or creates orphan records | df['email'].isna().sum() |
| Dates parse correctly | Wrong format causes rejection | pd.to_datetime(df['date'], errors='coerce').isna().sum() |
| Numeric columns are numeric | String "N/A" in a number column fails type check | pd.to_numeric(df['price'], errors='coerce').isna().sum() |
| Foreign keys exist | Referential integrity violation | Merge against reference table and find unmatched rows |
Test with a sample first
Never import a 500,000-row file without first importing 50 rows. Most database systems roll back on error, but some don't. Import head -51 data.csv > sample.csv (header + 50 rows), fix any errors you find, then run the full import.
If you run the same cleanup regularly — weekly CRM exports, monthly data feeds — a script saves time and eliminates manual errors.Automation Scripts
Python cleanup pipeline
<code">import pandas as pd import redef clean_csv(input_path, output_path): df = pd.read_csv(input_path, encoding='utf-8-sig') # handles BOM
# Normalize headers df.columns = [re.sub(r'[^a-z0-9]+', '', c.strip().lower()).strip('') for c in df.columns]
# Trim all string columns str_cols = df.select_dtypes(include='object').columns df[str_cols] = df[str_cols].apply(lambda c: c.str.strip())
# Standardize nulls nulls = {'NULL', 'null', 'N/A', 'n/a', 'none', 'None', '-', ''} df.replace(list(nulls), pd.NA, inplace=True)
# Normalize emails if 'email' in df.columns: df['email'] = df['email'].str.lower()
# Drop exact duplicates df.drop_duplicates(inplace=True)
# Validate row count change print(f"Input: {len(df) + df.duplicated().sum()} rows → Output: {len(df)} rows")
df.to_csv(output_path, index=False, encoding='utf-8')
clean_csv('raw_export.csv', 'clean_export.csv')
Makefile for pipeline automation
<code">clean: raw_export.csv python clean_csv.py raw_export.csv clean.csvimport: clean mysql -u user -p database < load.sql
.PHONY: clean import
Tools
CSV to JSON / JSON to CSV
Convert between formats for API workflows and data pipelines.
Data Converter ToolsText Encoding Converter
Fix Windows-1252, Shift-JIS, GBK and other encoding issues in your CSV files.
Encoding ToolsCSV to Excel / Excel to CSV
Convert between CSV and XLSX formats preserving column types.
Spreadsheet ToolsCSV to SQL / SQLite
Load CSV into a queryable SQLite database for cleanup and analysis.
Spreadsheet ToolsCommand-line libraries
- csvkit —
pip install csvkit—csvcut,csvgrep,csvstat,csvsql - pandas —
pip install pandas— full data manipulation, best for complex transforms - DuckDB —
pip install duckdb— SQL directly on CSV files, extremely fast - Miller (mlr) — standalone binary, excellent streaming CSV processor