Text Encoding Explained: UTF-8, ASCII, and Character Sets
A comprehensive guide to understanding text encoding, from ASCII and legacy character sets to Unicode and UTF-8. Learn how to handle international text, fix encoding issues, and avoid mojibake.
Text encoding is the system that maps human-readable characters to numbers (bytes) that computers can store and process. Every text file, database field, and string in memory uses some encoding scheme.What is Text Encoding?
Why It Matters
- International Support: Different languages require different character sets
- Data Integrity: Wrong encoding causes garbled text or data loss
- Compatibility: Systems must agree on encoding to exchange data
- Storage Efficiency: Different encodings use different amounts of space
History: From ASCII to Unicode
ASCII (1963)
The American Standard Code for Information Interchange was the first widely-adopted text encoding.
- 7-bit encoding: 128 characters total (0-127)
- Coverage: English letters, numbers, basic punctuation
- Range: 0-31 (control characters), 32-126 (printable), 127 (delete)
ASCII Examples:
65 (0x41) = 'A'
97 (0x61) = 'a'
48 (0x30) = '0'
32 (0x20) = space
10 (0x0A) = newlineExtended ASCII & Code Pages
When computers moved to 8-bit bytes (256 values), the extra 128 positions (128-255) were used for extended characters. But different systems used them differently, creating code pages.
| Code Page | Region | Characters |
|---|---|---|
| CP437 | Original IBM PC | Box drawing, accented letters |
| Windows-1252 | Western Europe | €, €, smart quotes |
| ISO-8859-1 | Western Europe | Similar to Windows-1252 |
| Windows-1251 | Cyrillic | Russian, Bulgarian, Serbian |
| Windows-1256 | Arabic | Arabic script |
Problem: A file encoded in Windows-1252 shows gibberish when opened with Windows-1251. No way to mix languages in one document.
Unicode (1991)
Unicode solved the chaos by creating a universal character set that assigns a unique number (code point) to every character across all writing systems.
- Code Points: Written as U+XXXX (e.g., U+0041 for 'A')
- Coverage: 143,859 characters across 159 scripts (as of Unicode 15.0)
- Includes: All languages, emoji, mathematical symbols, ancient scripts
Unicode Code Points:
U+0041 = A (Latin)
U+03B1 = α (Greek alpha)
U+4E2D = 中 (Chinese "middle")
U+1F600 = 😀 (grinning face emoji)
U+0410 = А (Cyrillic capital A)Important: Unicode is the "what" (which characters and their numbers). UTF-8/UTF-16/UTF-32 are the "how" (how to store those numbers as bytes).
UTF-8 (Unicode Transformation Format - 8-bit) is the dominant text encoding on the web and in modern systems.UTF-8 Explained
How UTF-8 Works
UTF-8 is a variable-width encoding: characters use 1-4 bytes depending on their code point.
| Byte Count | Code Point Range | Byte Pattern | Examples |
|---|---|---|---|
| 1 byte | U+0000 - U+007F | 0xxxxxxx | ASCII (A-Z, 0-9, punctuation) |
| 2 bytes | U+0080 - U+07FF | 110xxxxx 10xxxxxx | Latin accents (é, ñ), Greek, Cyrillic |
| 3 bytes | U+0800 - U+FFFF | 1110xxxx 10xxxxxx 10xxxxxx | Chinese, Japanese, Korean, most emoji |
| 4 bytes | U+10000 - U+10FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | Rare emoji, ancient scripts |
Examples
Character | Code Point | UTF-8 Bytes (Hex)
----------|------------|------------------
A | U+0041 | 41
é | U+00E9 | C3 A9
中 | U+4E2D | E4 B8 AD
😀 | U+1F600 | F0 9F 98 80
€ | U+20AC | E2 82 ACWhy UTF-8 Won
- Backward Compatible: Valid ASCII is valid UTF-8 (first 128 characters identical)
- Space Efficient: English text uses same space as ASCII (1 byte per char)
- Universal: Supports every language and emoji
- Self-Synchronizing: Can find character boundaries after errors
- No Byte Order Issues: Unlike UTF-16, no endianness concerns
- Web Standard: Used by 98%+ of websites
UTF-16 and UTF-32
UTF-16: Uses 2 or 4 bytes per character. Common in Java, Windows internals, JavaScript strings. Less efficient for English text.
UTF-32: Always 4 bytes per character. Simple but wastes space. Rarely used in practice.
You may encounter these when working with old files or legacy systems:Common Legacy Encodings
Japanese Encodings
Shift-JIS (Shift Japanese Industrial Standards)
- Coverage: Japanese (Hiragana, Katakana, Kanji)
- Usage: Dominant in Japanese Windows systems
- Variants: Windows-31J (Microsoft's version), CP932
- Issues: Some byte sequences look like ASCII control characters
EUC-JP (Extended Unix Code for Japanese)
- Coverage: Japanese
- Usage: Unix/Linux systems in Japan
- Advantage: ASCII-compatible (ASCII bytes unchanged)
ISO-2022-JP
- Coverage: Japanese
- Usage: Email (historically)
- Method: Uses escape sequences to switch between ASCII and Japanese
Chinese Encodings
GB2312 / GBK / GB18030
- GB2312: Simplified Chinese (1980), 7,445 characters
- GBK: Extension of GB2312, 21,886 characters
- GB18030: Official Chinese standard, includes all Unicode characters
- Usage: Mainland China systems
Big5
- Coverage: Traditional Chinese (Taiwan, Hong Kong)
- Characters: ~13,000
- Issues: Multiple variants (Big5-HKSCS, etc.)
Korean Encoding
EUC-KR
- Coverage: Korean (Hangul, Hanja)
- Usage: South Korean systems
- Extension: CP949 (Windows version with more characters)
Cyrillic Encodings
Windows-1251
- Coverage: Russian, Bulgarian, Serbian (Cyrillic alphabet)
- Usage: Windows systems in Russia/Eastern Europe
KOI8-R (Kod Obmena Informatsiey, 8-bit)
- Coverage: Russian Cyrillic
- Usage: Unix/Linux systems, Russian internet (historically)
- Unique: Cyrillic letters map to Latin letter positions
Western European Encodings
ISO-8859-1 (Latin-1)
- Coverage: Western European languages
- Characters: ASCII + Western European accents (é, ü, ñ, etc.)
- Usage: Default in early web, HTTP, email
Windows-1252 (CP1252)
- Coverage: Similar to ISO-8859-1 with additions
- Extra Characters: Smart quotes ("), €, •, etc.
- Usage: Default Windows encoding for Western languages
- Issue: Often mislabeled as ISO-8859-1
Other ISO-8859 Family
- ISO-8859-2: Central European (Polish, Czech, Hungarian)
- ISO-8859-5: Cyrillic
- ISO-8859-7: Greek
- ISO-8859-8: Hebrew
- ISO-8859-9: Turkish
- ISO-8859-15: Latin-1 with € symbol
Encoding Problems & How to Fix Them
What is Mojibake?
Mojibake (文字化け, "character transformation") is garbled text caused by decoding text with the wrong encoding.
Common Examples
| Original | Wrong Encoding | Result |
|---|---|---|
| café | UTF-8 read as Windows-1252 | café |
| 日本語 | UTF-8 read as Shift-JIS | 譌・譛ャ隱� |
| Hello | UTF-16 read as UTF-8 | H\0e\0l\0l\0o\0 |
| €100 | Windows-1252 read as ISO-8859-1 | €100 |
Causes of Encoding Problems
- Missing Declaration: HTML without
<meta charset="UTF-8"> - Editor Mismatch: Saving in one encoding, opening in another
- Database Issues: Connection encoding differs from table encoding
- File Transfer: FTP/email converting encodings
- Copy-Paste: Pasting between applications with different encodings
- Default Assumptions: Software guessing wrong encoding
Diagnostic Patterns
If you see... | Likely cause
------------------------|---------------------------
é, à , ü | UTF-8 decoded as Windows-1252
’, “, †| Windows-1252 smart quotes as UTF-8
� | Replacement character (encoding failed)
\u0000 or null bytes | UTF-16 decoded as 8-bit
Chinese/Japanese | Asian encoding decoded as Western
becoming squares
Ð, Ñ, Ð� | Cyrillic as Western encodingHow to Fix Encoding Issues
1. Identify the Encodings
# Linux: Use 'file' command
file -i document.txt
Output: text/plain; charset=utf-8
Python: Use chardet library
import chardet
with open('file.txt', 'rb') as f:
result = chardet.detect(f.read())
print(result['encoding']) # e.g., 'utf-8', 'ISO-8859-1'
Browser: Check Network tab -> Headers -> Content-Type
2. Convert to UTF-8
# Linux: Use iconv
iconv -f SHIFT-JIS -t UTF-8 input.txt > output.txt
Python
with open('input.txt', 'r', encoding='shift-jis') as f:
content = f.read()
with open('output.txt', 'w', encoding='utf-8') as f:
f.write(content)
Node.js (using iconv-lite)
const iconv = require('iconv-lite');
const fs = require('fs');
const buffer = fs.readFileSync('input.txt');
const str = iconv.decode(buffer, 'shift-jis');
fs.writeFileSync('output.txt', str, 'utf8');
3. Fix Database Encoding
-- MySQL: Check encoding
SHOW CREATE TABLE my_table;
-- Convert table to UTF-8
ALTER TABLE my_table
CONVERT TO CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;
-- PostgreSQL: Set client encoding
SET client_encoding = 'UTF8';
-- SQLite: Always stores as UTF-8/UTF-16
4. HTML/Web Fixes
<!-- Always declare charset in HTML -->
<meta charset="UTF-8">
<!-- In HTTP headers (server config) -->
Content-Type: text/html; charset=utf-8
<!-- CSS files -->
@charset "UTF-8";
<!-- XML files -->
<?xml version="1.0" encoding="UTF-8"?>
The Byte Order Mark is a special Unicode character (U+FEFF) placed at the beginning of a file to indicate encoding and byte order.Byte Order Mark (BOM)
BOM by Encoding
| Encoding | BOM Bytes (Hex) | Required? |
|---|---|---|
| UTF-8 | EF BB BF | No (optional, rarely used) |
| UTF-16 BE | FE FF | Recommended |
| UTF-16 LE | FF FE | Recommended |
| UTF-32 BE | 00 00 FE FF | Recommended |
| UTF-32 LE | FF FE 00 00 | Recommended |
Should You Use UTF-8 BOM?
Generally NO for web files and most modern systems.
Problems with UTF-8 BOM
- PHP: BOM causes "headers already sent" errors
- JSON: Invalid JSON (must not have BOM)
- Scripts: Shebangs (#!/bin/bash) fail if preceded by BOM
- Parsing: Some parsers choke on unexpected BOM
- Concatenation: Merging files creates BOM in middle
When UTF-8 BOM is OK
- Plain text files opened in Windows Notepad
- CSV files for Excel (Excel requires BOM to detect UTF-8)
- Legacy Windows applications that expect it
Detecting and Removing BOM
# Check for BOM (Linux/Mac)
hexdump -C file.txt | head -n 1
Look for: ef bb bf at start
Remove BOM (Linux/Mac)
tail -c +4 file.txt > file_no_bom.txt
Python: Remove BOM
with open('file.txt', 'r', encoding='utf-8-sig') as f:
content = f.read()
with open('output.txt', 'w', encoding='utf-8') as f:
f.write(content)
There's no 100% reliable way to detect encoding (bytes are just bytes), but these methods work most of the time:Detecting File Encoding
1. Check File Headers/BOM
# First 4 bytes tell you UTF-16/32 with BOM
EF BB BF → UTF-8 with BOM
FE FF → UTF-16 BE
FF FE → UTF-16 LE (or UTF-32 LE if followed by 00 00)
00 00 FE FF → UTF-32 BE2. Check Metadata
- HTML:
<meta charset="..."> - XML:
<?xml encoding="..."?> - HTTP Headers: Content-Type charset parameter
- Email: Content-Type in MIME headers
3. Use Detection Tools
# Linux: file command
file -i document.txt
Python: chardet library
pip install chardet
chardetect document.txt
JavaScript: jschardet
npm install jschardet
const jschardet = require('jschardet');
const fs = require('fs');
const buffer = fs.readFileSync('file.txt');
const detected = jschardet.detect(buffer);
console.log(detected.encoding);
Text editors
- Notepad++: Encoding menu
- VS Code: Bottom right status bar
- Sublime Text: View → Encoding
4. Try Common Encodings
Based on language/region of content:
- English/Western: Try UTF-8, Windows-1252, ISO-8859-1
- Japanese: Try UTF-8, Shift-JIS, EUC-JP
- Chinese (Simplified): Try UTF-8, GBK, GB2312
- Chinese (Traditional): Try UTF-8, Big5
- Korean: Try UTF-8, EUC-KR
- Russian: Try UTF-8, Windows-1251, KOI8-R
5. Validation Techniques
Some encodings have patterns that help validation:
UTF-8 validity check:
- No byte can be 0xC0, 0xC1, or 0xF5-0xFF
- Multi-byte sequences must follow patterns
- Continuation bytes must be 10xxxxxx
ASCII validity check:
-
All bytes must be 0x00-0x7F
-
No high-bit set
Best Practices for International Text
1. Always Use UTF-8
- For all new files, databases, APIs, and web content
- Set UTF-8 in HTML meta tags, HTTP headers, database connections
- Configure text editors to save as UTF-8 (without BOM)
2. Declare Encoding Explicitly
<!-- HTML5 -->
<meta charset="UTF-8">
Python files (PEP 263)
-- coding: utf-8 --
// Database connection (MySQL)
mysql://user:pass@host/db?charset=utf8mb4
// HTTP Response
Content-Type: text/html; charset=utf-8
3. Database Configuration
-- MySQL: Use utf8mb4 (not utf8!)
CREATE DATABASE mydb
CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;
-- PostgreSQL
CREATE DATABASE mydb
ENCODING 'UTF8';
-- Connection string
mysql://user:pass@host/db?charset=utf8mb4
4. File I/O
# Python 3: Always specify encoding
with open('file.txt', 'r', encoding='utf-8') as f:
content = f.read()
Node.js
fs.readFileSync('file.txt', 'utf8');
fs.writeFileSync('file.txt', content, 'utf8');
Java
Files.readString(path, StandardCharsets.UTF_8);
Files.writeString(path, content, StandardCharsets.UTF_8);
5. Testing
- Test with non-ASCII characters: é, ñ, 中, 日, Ω, €
- Test with emoji: 😀 🎉 👍
- Test with right-to-left languages: العربية, עברית
- Use actual user data from target markets
6. Version Control
# Git: Normalize line endings, ensure UTF-8
.gitattributes
-
text=auto eol=lf
*.txt text encoding=utf-8
*.md text encoding=utf-8
7. Avoid These Mistakes
- Don't use
utf8in MySQL (useutf8mb4) - Don't assume default encoding (always specify)
- Don't mix encodings in the same project
- Don't trust user-supplied encoding claims (validate)
- Don't use UTF-8 BOM for web files
Programming Tips for Encoding Conversion
Python
# Read file with specific encoding
with open('file.txt', 'r', encoding='shift-jis') as f:
content = f.read()
Write as UTF-8
with open('output.txt', 'w', encoding='utf-8') as f:
f.write(content)
Handle errors gracefully
with open('file.txt', 'r', encoding='utf-8', errors='replace') as f:
content = f.read() # Invalid bytes become �
Convert bytes
text_bytes = b'\xe3\x81\x82' # UTF-8 bytes for あ
text = text_bytes.decode('utf-8') # → 'あ'
back = text.encode('utf-8') # → b'\xe3\x81\x82'
Auto-detect encoding
import chardet
with open('file.txt', 'rb') as f:
rawdata = f.read()
result = chardet.detect(rawdata)
encoding = result['encoding']
text = rawdata.decode(encoding)
Node.js
const fs = require('fs');
const iconv = require('iconv-lite');
// Read with specific encoding
const buffer = fs.readFileSync('file.txt');
const text = iconv.decode(buffer, 'shift-jis');
// Write as UTF-8
fs.writeFileSync('output.txt', text, 'utf8');
// Convert buffer between encodings
const utf8Buffer = iconv.encode(text, 'utf-8');
// Check encoding support
if (iconv.encodingExists('windows-1252')) {
// ...
}
// Detect encoding (using jschardet)
const jschardet = require('jschardet');
const detected = jschardet.detect(buffer);
console.log(detected.encoding);
Java
import java.nio.charset.;
import java.nio.file.;
// Read with specific encoding
String content = Files.readString(
Paths.get("file.txt"),
Charset.forName("Shift_JIS")
);
// Write as UTF-8
Files.writeString(
Paths.get("output.txt"),
content,
StandardCharsets.UTF_8
);
// Convert between encodings
byte[] bytes = content.getBytes(Charset.forName("Shift_JIS"));
String text = new String(bytes, StandardCharsets.UTF_8);
// String is UTF-16 internally in Java
// Always specify charset for bytes ↔ String conversion
PHP
<?php
// Convert encoding
$text = mb_convert_encoding($input, 'UTF-8', 'SJIS');
// Auto-detect source encoding
$encoding = mb_detect_encoding($input, ['UTF-8', 'SJIS', 'EUC-JP']);
$text = mb_convert_encoding($input, 'UTF-8', $encoding);
// Set internal encoding
mb_internal_encoding('UTF-8');
// Read file
$content = file_get_contents('file.txt');
$utf8 = mb_convert_encoding($content, 'UTF-8', 'auto');
// Write file
file_put_contents('output.txt', $utf8);
?>
Command Line Tools
# iconv: Convert between encodings
iconv -f SHIFT-JIS -t UTF-8 input.txt > output.txt
iconv -f UTF-16LE -t UTF-8 input.txt > output.txt
List available encodings
iconv -l
recode (alternative to iconv)
recode shift-jis..utf-8 input.txt
dos2unix: Fix line endings + encoding
dos2unix -c UTF-8 file.txt
uchardet: Detect encoding
uchardet file.txt
Convert between character encodings in your browser Encode text to Base64 format Encode special characters for URLs Encode characters as HTML entitiesRelated Tools
Quick Reference