Text Encoding: UTF-8, Unicode, Mojibake, and How to Fix It
Character encoding is invisible until it breaks. This guide covers how it works, where it goes wrong, and how to fix it — from ASCII and Latin-1 to Shift-JIS and GB2312, with concrete examples for databases, HTTP headers, and code.
Computers store everything as numbers. Text encoding is the agreement that maps characters to numbers and back. When the sender and receiver use the same mapping, text is legible. When they don't, you get garbage.How Encoding Works
Three concepts are often confused:
- Character set — the collection of characters (e.g., Latin alphabet, Kanji)
- Code point — the number assigned to each character (e.g., A = 65)
- Encoding — how those numbers are stored as bytes (e.g., UTF-8 stores code points using 1-4 bytes)
ASCII (1963) is the bedrock. It maps 128 characters to numbers 0–127 using 7 bits. That covers uppercase and lowercase English letters, digits, punctuation, and 33 control characters (newline, tab, etc.).ASCII
65 = A 97 = a 48 = 0
66 = B 98 = b 32 = space
67 = C 99 = c 10 = newline (LF)
13 = carriage return (CR)ASCII has no accents, no non-Latin scripts, no symbols beyond basic punctuation. Every encoding designed after ASCII includes it — the 0–127 range maps identically in ASCII, Latin-1, Windows-1252, UTF-8, and most others. That's why English text often opens without problems even in the wrong encoding.
Unicode and UTF-8
Unicode
Unicode (1991) solved the pre-internet chaos of incompatible encodings by defining a single universal character set. Every character in every writing system gets a unique number called a code point, written as U+XXXX.
U+0041 = A (Latin)
U+00E9 = é (Latin with accent)
U+03B1 = α (Greek alpha)
U+4E2D = 中 (Chinese "middle")
U+1F600 = 😀 (emoji)
U+0410 = А (Cyrillic capital A)Unicode 15.0 defines 149,186 characters across 161 scripts. The standard is maintained by the Unicode Consortium and updated annually. Code points go up to U+10FFFF — that's over 1 million possible values, though most are unassigned.
UTF-8: the dominant encoding
UTF-8 is how most systems store Unicode. It uses 1 to 4 bytes per character depending on the code point:
| Bytes | Code point range | Bit pattern | Examples |
|---|---|---|---|
| 1 | U+0000–U+007F | 0xxxxxxx | ASCII (A–Z, 0–9) |
| 2 | U+0080–U+07FF | 110xxxxx 10xxxxxx | é, ñ, ü, Greek, Cyrillic |
| 3 | U+0800–U+FFFF | 1110xxxx 10xxxxxx 10xxxxxx | Chinese, Japanese, Korean |
| 4 | U+10000–U+10FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | Emoji, ancient scripts |
A = U+0041 → 0x41 (1 byte)
é = U+00E9 → 0xC3 0xA9 (2 bytes)
中 = U+4E2D → 0xE4 0xB8 0xAD (3 bytes)
😀 = U+1F600 → 0xF0 0x9F 0x98 0x80 (4 bytes)UTF-8 won because it's backward-compatible with ASCII (plain English text is byte-identical), self-synchronizing (you can find character boundaries after a corrupt byte), and space-efficient for Western languages. 98%+ of websites use UTF-8 today.
UTF-16 and UTF-32
UTF-16 uses 2 bytes for most characters and 4 bytes for rare ones (code points above U+FFFF, called surrogate pairs). Java, C# .NET, JavaScript, and Windows internals use UTF-16 strings natively. It's efficient for East Asian languages where most characters need 3 bytes in UTF-8 but only 2 in UTF-16.
UTF-32 always uses 4 bytes per character. Simple to index (character N is at byte 4N) but wastes space — English text takes 4x more space than ASCII. Rarely used in practice except in some internal processing contexts.
Both UTF-16 and UTF-32 have an endianness issue: a 2-byte value 0x0041 could be stored as 00 41 (big-endian) or 41 00 (little-endian). The BOM character (see below) distinguishes them. UTF-8 has no endianness problem.
These encodings predate Unicode and are still encountered in old files, legacy databases, and regional systems.Legacy Encodings
Western European
ISO-8859-1 (Latin-1) — Covers Western European languages by adding 96 characters (128–255) to ASCII. Includes é, ü, ñ, etc. Used as the default in early HTTP and email. Doesn't include the € sign (added later in ISO-8859-15).
Windows-1252 (CP1252) — Microsoft's extension of Latin-1. Adds smart quotes, em dash, €, and other characters in the 0x80–0x9F range that Latin-1 left as control codes. Often mislabeled as ISO-8859-1 in old web pages. The difference matters when you encounter ’ (the UTF-8 bytes for ' read as Windows-1252).
| Code point | ISO-8859-1 | Windows-1252 |
|---|---|---|
| 0x80 | (undefined) | € |
| 0x91 | (undefined) | ' (left single quote) |
| 0x93 | (undefined) | " (left double quote) |
| 0x96 | (undefined) | – (en dash) |
Japanese encodings
Shift-JIS — The dominant Japanese encoding for Windows systems. Covers Hiragana, Katakana, Kanji, and ASCII. Has tricky byte sequences where some bytes can look like ASCII characters, making it prone to false detections. Also called CP932 or Windows-31J in Microsoft's variant.
EUC-JP — The Unix/Linux variant for Japanese. ASCII-compatible and less ambiguous than Shift-JIS, but both are obsolete for new work — use UTF-8 instead.
Chinese encodings
GB2312 — 1980 mainland China standard, covers 7,445 Simplified Chinese characters plus ASCII. Superseded by:
GBK — Extends GB2312 to 21,886 characters including Traditional Chinese used in Hong Kong. Used on Chinese Windows systems.
GB18030 — The current official Chinese standard. A superset of GBK that also covers all Unicode code points, making it the only legacy encoding that can represent everything UTF-8 can.
Big5 — Traditional Chinese for Taiwan and Hong Kong. ~13,000 characters. Has variants (Big5-HKSCS for Hong Kong) with different character mappings.
Cyrillic encodings
Windows-1251 — Russian, Bulgarian, Serbian in Cyrillic script. The standard for Russian Windows systems.
KOI8-R — A clever design where Cyrillic letters are placed at positions matching their Latin phonetic equivalents, so stripping the high bit gives readable (if wrong) transliteration. Used on Unix systems and old Russian internet.
Other ISO-8859 family
- ISO-8859-2 — Central European (Polish, Czech, Slovak, Hungarian, Slovenian, Croatian)
- ISO-8859-7 — Greek
- ISO-8859-8 — Hebrew
- ISO-8859-9 — Turkish
- ISO-8859-15 — Latin-1 with the € symbol, updating the gap at 0xA4
Mojibake (文字化け, Japanese for "character transformation") is what happens when text encoded in one system gets decoded by another. The bytes are intact but the interpretation is wrong.Mojibake: Garbled Text
Common patterns and their causes
| What you see | What it should be | Cause |
|---|---|---|
| é ü ñ | é ü ñ | UTF-8 decoded as Windows-1252 or Latin-1 |
| ’ “ †| ' " — | Windows-1252 smart quotes decoded as UTF-8 |
| Ð Ñ Ð± | Cyrillic | Cyrillic UTF-8 decoded as Latin-1 |
| 譌・譛ャ | 日本語 | Japanese UTF-8 decoded as Shift-JIS |
| H\0e\0l\0l\0 | Hello | UTF-16 LE decoded as UTF-8 or Latin-1 |
? or □ or � | Various | Byte has no mapping in target encoding |
Diagnosing from the garbled text
The é pattern is diagnostic: é in UTF-8 is bytes C3 A9. In Windows-1252, byte C3 = à and A9 = ©. So é always means UTF-8 file opened as Windows-1252. Similarly, ’ is the three UTF-8 bytes for Windows-1252 byte 0x92 (right single quote).
Root causes
- Missing or wrong encoding declaration — HTML without
<meta charset="UTF-8">, or with the wrong charset - Editor mismatch — File saved as Windows-1252 by Notepad, then opened as UTF-8 by another editor
- Database connection encoding mismatch — Application connecting to MySQL with
latin1while the column stores UTF-8 - FTP in ASCII mode — FTP transfers in ASCII mode rewrite line endings and can corrupt high bytes
- Double-encoding — UTF-8 bytes stored in a Latin-1 column, then read out and decoded as Latin-1, producing garbled UTF-8 sequences stored where readable text should be
The Byte Order Mark (U+FEFF) is a zero-width character placed at the start of a file to signal its encoding. For UTF-16 and UTF-32, it's necessary to distinguish big-endian from little-endian byte order. For UTF-8, it's optional and usually causes problems.The BOM Problem
| Encoding | BOM bytes (hex) | Needed? |
|---|---|---|
| UTF-8 | EF BB BF | No — causes problems in most contexts |
| UTF-16 BE | FE FF | Recommended |
| UTF-16 LE | FF FE | Recommended |
| UTF-32 BE | 00 00 FE FF | Recommended |
| UTF-32 LE | FF FE 00 00 | Recommended |
UTF-8 BOM problems
Windows Notepad and Excel add a UTF-8 BOM when saving. This causes:
- PHP — "headers already sent" errors (the BOM counts as output before
header()calls) - JSON — Invalid JSON; parsers reject files starting with bytes other than
{or[ - CSV headers — First column named
Nameinstead ofName(invisible BOM character prepended) - Shell scripts — Shebang line
#!/bin/bashpreceded by BOM bytes; shell can't find the interpreter - File concatenation — Merging BOM-prefixed files puts BOM characters in the middle of the result
Removing BOM
# Detect BOM
hexdump -C file.txt | head -1
Look for: ef bb bf at position 0
Remove BOM (Linux/Mac)
tail -c +4 file.txt > file_clean.txt
Python: read with utf-8-sig codec (strips BOM automatically)
with open('file.txt', 'r', encoding='utf-8-sig') as f:
content = f.read()
Write back without BOM
with open('output.txt', 'w', encoding='utf-8') as f:
f.write(content)
Exception: Excel requires a UTF-8 BOM to recognize that a CSV is UTF-8. If you generate CSV files for Excel users, write the BOM first:
with open('for_excel.csv', 'w', encoding='utf-8-sig') as f:
writer = csv.writer(f)
writer.writerows(data)There is no guaranteed way to detect encoding — bytes are just bytes, and most encodings share the 0–127 range. But these methods work most of the time:Detecting Encoding
1. Check for BOM
with open('file.txt', 'rb') as f:
raw = f.read(4)
boms = {
b'\xef\xbb\xbf': 'UTF-8 with BOM',
b'\xff\xfe': 'UTF-16 LE',
b'\xfe\xff': 'UTF-16 BE',
b'\xff\xfe\x00\x00': 'UTF-32 LE',
b'\x00\x00\xfe\xff': 'UTF-32 BE',
}
for bom, name in boms.items():
if raw.startswith(bom):
print(name)
break2. Statistical detection
# Python
pip install chardet
import chardet
with open('file.txt', 'rb') as f:
result = chardet.detect(f.read())
print(result)
→ {'encoding': 'Shift_JIS', 'confidence': 0.99}
Command line
chardetect file.txt
JavaScript (Node.js)
npm install jschardet
const jschardet = require('jschardet');
const buf = require('fs').readFileSync('file.txt');
console.log(jschardet.detect(buf));
chardet works well for files with enough content (500+ characters). For short strings, confidence scores below 0.7 are unreliable.
3. Context clues
- File came from a Japanese Windows user → try Shift-JIS first
- File came from a German government system → try Windows-1252 or ISO-8859-1
- File has
<meta charset>or XML declaration → trust it (but verify) - HTTP response has
Content-Type: charset=→ trust it
4. Visual confirmation
Open the file in Notepad++ or VS Code, try switching encodings, and look at the actual characters. If you expect "café" and see "café", you're reading UTF-8 as Latin-1. Switch to UTF-8 and verify.
Converting Files
iconv (Linux/Mac)
# Basic conversion
iconv -f SHIFT-JIS -t UTF-8 input.txt > output.txt
iconv -f WINDOWS-1252 -t UTF-8 input.csv > output.csv
iconv -f ISO-8859-1 -t UTF-8 input.html > output.html
Handle invalid sequences (don't stop on error)
iconv -f GB2312 -t UTF-8//IGNORE input.txt > output.txt
List available encoding names
iconv -l
Python
<code">import chardetDetect and convert
with open('unknown.txt', 'rb') as f: raw = f.read()
detected = chardet.detect(raw)['encoding'] text = raw.decode(detected, errors='replace')
with open('output.txt', 'w', encoding='utf-8') as f: f.write(text)
Convert with error handling options:
errors='strict' → raises UnicodeDecodeError on bad byte (default)
errors='ignore' → silently drops undecodable bytes
errors='replace' → substitutes U+FFFD (replacement character) for bad bytes
errors='backslashreplace' → shows \xXX escape for bad bytes
Batch conversion
<code">import chardet from pathlib import Pathsrc = Path('legacy-files') dst = Path('utf8-files') dst.mkdir(exist_ok=True)
for f in src.glob('**/*.txt'): raw = f.read_bytes() enc = chardet.detect(raw).get('encoding', 'utf-8') try: text = raw.decode(enc) except (UnicodeDecodeError, LookupError): text = raw.decode('utf-8', errors='replace') (dst / f.name).write_text(text, encoding='utf-8')
For browser-based conversion, the Text Encoding Converter handles all the encodings above, including Shift-JIS, GBK, Windows-1251, and all ISO-8859 variants.
Database Charset Settings
MySQL: use utf8mb4, not utf8
MySQL's utf8 charset is a historical mistake — it supports only 3-byte UTF-8, which means it cannot store emoji (which need 4 bytes) or characters above U+FFFF. Use utf8mb4 for real UTF-8:
<code">-- Create database with utf8mb4 CREATE DATABASE myapp CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
-- Convert existing table ALTER TABLE users CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
-- Connection string must also specify charset mysql://user:pass@host/db?charset=utf8mb4
-- Check current encoding SHOW CREATE TABLE users\G SHOW VARIABLES LIKE 'character_set%';
Double-encoding trap
A common MySQL mistake: store UTF-8 data in a latin1 column by connecting with charset latin1. The bytes are stored correctly but MySQL thinks they're Latin-1. When you read them back as Latin-1 through a UTF-8 connection, MySQL "converts" them — each UTF-8 byte gets incorrectly decoded, producing garbage. Fix:
<code">-- If column was wrongly stored: bytes are UTF-8 but MySQL thinks latin1 -- Step 1: Get binary content (bypasses charset conversion) ALTER TABLE t MODIFY col VARBINARY(255); -- Step 2: Declare it UTF-8 ALTER TABLE t MODIFY col VARCHAR(255) CHARACTER SET utf8mb4;
PostgreSQL
PostgreSQL's UTF8 is true UTF-8 with no 3-byte limitation. Set it at database creation:
<code">CREATE DATABASE myapp ENCODING 'UTF8' LC_COLLATE 'en_US.UTF-8';-- Set client encoding for a session SET client_encoding = 'UTF8';
-- Check current encoding SHOW client_encoding; SELECT pg_encoding_to_char(encoding) FROM pg_database WHERE datname = current_database();
HTTP and HTML Encoding Declarations
HTTP Content-Type header
Browsers use the HTTP header's charset parameter first, before looking at the HTML meta tag. Your server should set it correctly:
<code">Content-Type: text/html; charset=utf-8 Content-Type: text/plain; charset=utf-8 Content-Type: application/json; charset=utf-8 Content-Type: text/csv; charset=utf-8
Note: JSON (RFC 8259) requires UTF-8 and prohibits the BOM. If your server returns JSON, you should not include charset in the Content-Type since UTF-8 is mandatory, but including it doesn't break anything.
HTML
<code"><!-- HTML5: must be within first 1024 bytes --> <meta charset="UTF-8"><!-- HTML4/XHTML style (still valid) --> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<!-- XML declaration (XHTML) --> <?xml version="1.0" encoding="UTF-8"?>
XML
XML defaults to UTF-8 if no encoding is declared. Always declare it anyway for clarity:
<code"><?xml version="1.0" encoding="UTF-8"?>
CSS
Add the charset declaration at the very top of CSS files containing non-ASCII characters (font names, content values):
<code">@charset "UTF-8";
Code Examples by Language
Python 3
<code"># Always specify encoding — don't rely on locale defaults
with open('data.txt', 'r', encoding='utf-8') as f:
content = f.read()
Write with explicit encoding
with open('output.txt', 'w', encoding='utf-8') as f:
f.write(content)
Encode/decode bytes ↔ strings
text = 'café'
encoded = text.encode('utf-8') # b'caf\xc3\xa9'
decoded = encoded.decode('utf-8') # 'café'
Handle errors gracefully
text = b'\xff\xfe invalid bytes'.decode('utf-8', errors='replace')
Replaces bad bytes with U+FFFD replacement character
JavaScript / Node.js
<code">// Browser: TextEncoder / TextDecoder (built-in)
const encoder = new TextEncoder(); // always UTF-8
const bytes = encoder.encode('café');
const decoder = new TextDecoder('utf-8');
const text = decoder.decode(bytes);
// TextDecoder for other encodings
const win1252Decoder = new TextDecoder('windows-1252');
const text2 = win1252Decoder.decode(buffer);
// Node.js: fs encoding
const fs = require('fs');
const content = fs.readFileSync('file.txt', 'utf8');
fs.writeFileSync('out.txt', content, 'utf8');
// Node.js: iconv-lite for legacy encodings
const iconv = require('iconv-lite');
const buf = fs.readFileSync('file.txt');
const text3 = iconv.decode(buf, 'shift-jis');
Java
<code">// Always specify charset — don't use new String(bytes) without encoding String text = new String(bytes, StandardCharsets.UTF_8); byte[] bytes2 = text.getBytes(StandardCharsets.UTF_8);// File I/O import java.nio.file.; import java.nio.charset.; String content = Files.readString(path, StandardCharsets.UTF_8); Files.writeString(path, content, StandardCharsets.UTF_8);
Common mistakes
- MySQL: using
utf8instead ofutf8mb4— emoji and some Chinese characters get rejected - Python 2 legacy: mixing byte strings and unicode strings without explicit encode/decode
- Node.js: calling
buffer.toString()without specifying'utf8'— defaults to UTF-8 but better to be explicit - PHP: not setting
mb_internal_encoding('UTF-8')— string functions operate byte-by-byte and mangle multibyte chars - Java: using
String(bytes)instead ofString(bytes, StandardCharsets.UTF_8)— relies on platform default, which varies