Text Encoding Explained: UTF-8, ASCII, and Character Sets

A comprehensive guide to understanding text encoding, from ASCII and legacy character sets to Unicode and UTF-8. Learn how to handle international text, fix encoding issues, and avoid mojibake.

What is Text Encoding?

Text encoding is the system that maps human-readable characters to numbers (bytes) that computers can store and process. Every text file, database field, and string in memory uses some encoding scheme.

Why It Matters

International Support: Different languages require different character sets
Data Integrity: Wrong encoding causes garbled text or data loss
Compatibility: Systems must agree on encoding to exchange data
Storage Efficiency: Different encodings use different amounts of space

Quick Answer: Use UTF-8 for everything new. It's the universal standard that supports all languages, emoji, and symbols while remaining compatible with ASCII.

History: From ASCII to Unicode

ASCII (1963)

The American Standard Code for Information Interchange was the first widely-adopted text encoding.

7-bit encoding: 128 characters total (0-127)
Coverage: English letters, numbers, basic punctuation
Range: 0-31 (control characters), 32-126 (printable), 127 (delete)

ASCII Examples:
65 (0x41) = 'A'
97 (0x61) = 'a'
48 (0x30) = '0'
32 (0x20) = space
10 (0x0A) = newline

Extended ASCII & Code Pages

When computers moved to 8-bit bytes (256 values), the extra 128 positions (128-255) were used for extended characters. But different systems used them differently, creating code pages.

Code Page	Region	Characters
CP437	Original IBM PC	Box drawing, accented letters
Windows-1252	Western Europe	€, €, smart quotes
ISO-8859-1	Western Europe	Similar to Windows-1252
Windows-1251	Cyrillic	Russian, Bulgarian, Serbian
Windows-1256	Arabic	Arabic script

Problem: A file encoded in Windows-1252 shows gibberish when opened with Windows-1251. No way to mix languages in one document.

Unicode (1991)

Unicode solved the chaos by creating a universal character set that assigns a unique number (code point) to every character across all writing systems.

Code Points: Written as U+XXXX (e.g., U+0041 for 'A')
Coverage: 143,859 characters across 159 scripts (as of Unicode 15.0)
Includes: All languages, emoji, mathematical symbols, ancient scripts

Unicode Code Points:
U+0041 = A (Latin)
U+03B1 = α (Greek alpha)
U+4E2D = 中 (Chinese "middle")
U+1F600 = 😀 (grinning face emoji)
U+0410 = А (Cyrillic capital A)

Important: Unicode is the "what" (which characters and their numbers). UTF-8/UTF-16/UTF-32 are the "how" (how to store those numbers as bytes).

UTF-8 Explained

UTF-8 (Unicode Transformation Format - 8-bit) is the dominant text encoding on the web and in modern systems.

How UTF-8 Works

UTF-8 is a variable-width encoding: characters use 1-4 bytes depending on their code point.

Byte Count	Code Point Range	Byte Pattern	Examples
1 byte	U+0000 - U+007F	`0xxxxxxx`	ASCII (A-Z, 0-9, punctuation)
2 bytes	U+0080 - U+07FF	`110xxxxx 10xxxxxx`	Latin accents (é, ñ), Greek, Cyrillic
3 bytes	U+0800 - U+FFFF	`1110xxxx 10xxxxxx 10xxxxxx`	Chinese, Japanese, Korean, most emoji
4 bytes	U+10000 - U+10FFFF	`11110xxx 10xxxxxx 10xxxxxx 10xxxxxx`	Rare emoji, ancient scripts

Examples

Character | Code Point | UTF-8 Bytes (Hex)
----------|------------|------------------
A         | U+0041     | 41
é         | U+00E9     | C3 A9
中        | U+4E2D     | E4 B8 AD
😀        | U+1F600    | F0 9F 98 80
€         | U+20AC     | E2 82 AC

Why UTF-8 Won

Backward Compatible: Valid ASCII is valid UTF-8 (first 128 characters identical)
Space Efficient: English text uses same space as ASCII (1 byte per char)
Universal: Supports every language and emoji
Self-Synchronizing: Can find character boundaries after errors
No Byte Order Issues: Unlike UTF-16, no endianness concerns
Web Standard: Used by 98%+ of websites

UTF-16 and UTF-32

UTF-16: Uses 2 or 4 bytes per character. Common in Java, Windows internals, JavaScript strings. Less efficient for English text.

UTF-32: Always 4 bytes per character. Simple but wastes space. Rarely used in practice.

Common Legacy Encodings

You may encounter these when working with old files or legacy systems:

Japanese Encodings

Shift-JIS (Shift Japanese Industrial Standards)

Coverage: Japanese (Hiragana, Katakana, Kanji)
Usage: Dominant in Japanese Windows systems
Variants: Windows-31J (Microsoft's version), CP932
Issues: Some byte sequences look like ASCII control characters

EUC-JP (Extended Unix Code for Japanese)

Coverage: Japanese
Usage: Unix/Linux systems in Japan
Advantage: ASCII-compatible (ASCII bytes unchanged)

ISO-2022-JP

Coverage: Japanese
Usage: Email (historically)
Method: Uses escape sequences to switch between ASCII and Japanese

Chinese Encodings

GB2312 / GBK / GB18030

GB2312: Simplified Chinese (1980), 7,445 characters
GBK: Extension of GB2312, 21,886 characters
GB18030: Official Chinese standard, includes all Unicode characters
Usage: Mainland China systems

Big5

Coverage: Traditional Chinese (Taiwan, Hong Kong)
Characters: ~13,000
Issues: Multiple variants (Big5-HKSCS, etc.)

Korean Encoding

EUC-KR

Coverage: Korean (Hangul, Hanja)
Usage: South Korean systems
Extension: CP949 (Windows version with more characters)

Cyrillic Encodings

Windows-1251

Coverage: Russian, Bulgarian, Serbian (Cyrillic alphabet)
Usage: Windows systems in Russia/Eastern Europe

KOI8-R (Kod Obmena Informatsiey, 8-bit)

Coverage: Russian Cyrillic
Usage: Unix/Linux systems, Russian internet (historically)
Unique: Cyrillic letters map to Latin letter positions

Western European Encodings

ISO-8859-1 (Latin-1)

Coverage: Western European languages
Characters: ASCII + Western European accents (é, ü, ñ, etc.)
Usage: Default in early web, HTTP, email

Windows-1252 (CP1252)

Coverage: Similar to ISO-8859-1 with additions
Extra Characters: Smart quotes ("), €, •, etc.
Usage: Default Windows encoding for Western languages
Issue: Often mislabeled as ISO-8859-1

Other ISO-8859 Family

ISO-8859-2: Central European (Polish, Czech, Hungarian)
ISO-8859-5: Cyrillic
ISO-8859-7: Greek
ISO-8859-8: Hebrew
ISO-8859-9: Turkish
ISO-8859-15: Latin-1 with € symbol

Encoding Problems & How to Fix Them

What is Mojibake?

Mojibake (文字化け, "character transformation") is garbled text caused by decoding text with the wrong encoding.

Common Examples

Original	Wrong Encoding	Result
café	UTF-8 read as Windows-1252	cafÃ©
日本語	UTF-8 read as Shift-JIS	譌・譛ャ隱�
Hello	UTF-16 read as UTF-8	H\0e\0l\0l\0o\0
€100	Windows-1252 read as ISO-8859-1	€100

Causes of Encoding Problems

Missing Declaration: HTML without <meta charset="UTF-8">
Editor Mismatch: Saving in one encoding, opening in another
Database Issues: Connection encoding differs from table encoding
File Transfer: FTP/email converting encodings
Copy-Paste: Pasting between applications with different encodings
Default Assumptions: Software guessing wrong encoding

Diagnostic Patterns

If you see...           | Likely cause
------------------------|---------------------------
Ã©, Ã , Ã¼             | UTF-8 decoded as Windows-1252
â€™, â€œ, â€            | Windows-1252 smart quotes as UTF-8
�                       | Replacement character (encoding failed)
\u0000 or null bytes   | UTF-16 decoded as 8-bit
Chinese/Japanese       | Asian encoding decoded as Western
becoming squares
Ð, Ñ, Ð�               | Cyrillic as Western encoding

How to Fix Encoding Issues

1. Identify the Encodings

# Linux: Use 'file' command
file -i document.txt
Output: text/plain; charset=utf-8
Python: Use chardet library
import chardet
with open('file.txt', 'rb') as f:
result = chardet.detect(f.read())
print(result['encoding'])  # e.g., 'utf-8', 'ISO-8859-1'
Browser: Check Network tab -> Headers -> Content-Type

2. Convert to UTF-8

# Linux: Use iconv
iconv -f SHIFT-JIS -t UTF-8 input.txt > output.txt
Python
with open('input.txt', 'r', encoding='shift-jis') as f:
content = f.read()
with open('output.txt', 'w', encoding='utf-8') as f:
f.write(content)
Node.js (using iconv-lite)
const iconv = require('iconv-lite');
const fs = require('fs');
const buffer = fs.readFileSync('input.txt');
const str = iconv.decode(buffer, 'shift-jis');
fs.writeFileSync('output.txt', str, 'utf8');

3. Fix Database Encoding

-- MySQL: Check encoding
SHOW CREATE TABLE my_table;
-- Convert table to UTF-8
ALTER TABLE my_table
CONVERT TO CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;
-- PostgreSQL: Set client encoding
SET client_encoding = 'UTF8';
-- SQLite: Always stores as UTF-8/UTF-16

4. HTML/Web Fixes

<!-- Always declare charset in HTML -->
<meta charset="UTF-8">
<!-- In HTTP headers (server config) -->
Content-Type: text/html; charset=utf-8
<!-- CSS files -->
@charset "UTF-8";
<!-- XML files -->
<?xml version="1.0" encoding="UTF-8"?>

Byte Order Mark (BOM)

The Byte Order Mark is a special Unicode character (U+FEFF) placed at the beginning of a file to indicate encoding and byte order.

BOM by Encoding

Encoding	BOM Bytes (Hex)	Required?
UTF-8	EF BB BF	No (optional, rarely used)
UTF-16 BE	FE FF	Recommended
UTF-16 LE	FF FE	Recommended
UTF-32 BE	00 00 FE FF	Recommended
UTF-32 LE	FF FE 00 00	Recommended

Should You Use UTF-8 BOM?

Generally NO for web files and most modern systems.

Problems with UTF-8 BOM

PHP: BOM causes "headers already sent" errors
JSON: Invalid JSON (must not have BOM)
Scripts: Shebangs (#!/bin/bash) fail if preceded by BOM
Parsing: Some parsers choke on unexpected BOM
Concatenation: Merging files creates BOM in middle

When UTF-8 BOM is OK

Plain text files opened in Windows Notepad
CSV files for Excel (Excel requires BOM to detect UTF-8)
Legacy Windows applications that expect it

Detecting and Removing BOM

# Check for BOM (Linux/Mac)
hexdump -C file.txt | head -n 1
Look for: ef bb bf at start
Remove BOM (Linux/Mac)
tail -c +4 file.txt > file_no_bom.txt
Python: Remove BOM
with open('file.txt', 'r', encoding='utf-8-sig') as f:
content = f.read()
with open('output.txt', 'w', encoding='utf-8') as f:
f.write(content)

Detecting File Encoding

There's no 100% reliable way to detect encoding (bytes are just bytes), but these methods work most of the time:

1. Check File Headers/BOM

# First 4 bytes tell you UTF-16/32 with BOM
EF BB BF        → UTF-8 with BOM
FE FF           → UTF-16 BE
FF FE           → UTF-16 LE (or UTF-32 LE if followed by 00 00)
00 00 FE FF     → UTF-32 BE

2. Check Metadata

HTML: <meta charset="...">
XML: <?xml encoding="..."?>
HTTP Headers: Content-Type charset parameter
Email: Content-Type in MIME headers

3. Use Detection Tools

# Linux: file command
file -i document.txt
Python: chardet library
pip install chardet
chardetect document.txt
JavaScript: jschardet
npm install jschardet
const jschardet = require('jschardet');
const fs = require('fs');
const buffer = fs.readFileSync('file.txt');
const detected = jschardet.detect(buffer);
console.log(detected.encoding);
Text editors
- Notepad++: Encoding menu
- VS Code: Bottom right status bar
- Sublime Text: View → Encoding

4. Try Common Encodings

Based on language/region of content:

English/Western: Try UTF-8, Windows-1252, ISO-8859-1
Japanese: Try UTF-8, Shift-JIS, EUC-JP
Chinese (Simplified): Try UTF-8, GBK, GB2312
Chinese (Traditional): Try UTF-8, Big5
Korean: Try UTF-8, EUC-KR
Russian: Try UTF-8, Windows-1251, KOI8-R

5. Validation Techniques

Some encodings have patterns that help validation:

UTF-8 validity check:

No byte can be 0xC0, 0xC1, or 0xF5-0xFF
Multi-byte sequences must follow patterns
Continuation bytes must be 10xxxxxx

ASCII validity check:


All bytes must be 0x00-0x7F


No high-bit set

Best Practices for International Text

1. Always Use UTF-8

For all new files, databases, APIs, and web content
Set UTF-8 in HTML meta tags, HTTP headers, database connections
Configure text editors to save as UTF-8 (without BOM)

2. Declare Encoding Explicitly

<!-- HTML5 -->
<meta charset="UTF-8">


Python files (PEP 263)
-- coding: utf-8 --
// Database connection (MySQL)
mysql://user:pass@host/db?charset=utf8mb4
// HTTP Response
Content-Type: text/html; charset=utf-8

3. Database Configuration

-- MySQL: Use utf8mb4 (not utf8!)
CREATE DATABASE mydb
CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;
-- PostgreSQL
CREATE DATABASE mydb
ENCODING 'UTF8';
-- Connection string
mysql://user:pass@host/db?charset=utf8mb4

4. File I/O

# Python 3: Always specify encoding
with open('file.txt', 'r', encoding='utf-8') as f:
content = f.read()
Node.js
fs.readFileSync('file.txt', 'utf8');
fs.writeFileSync('file.txt', content, 'utf8');
Java
Files.readString(path, StandardCharsets.UTF_8);
Files.writeString(path, content, StandardCharsets.UTF_8);

5. Testing

Test with non-ASCII characters: é, ñ, 中, 日, Ω, €
Test with emoji: 😀 🎉 👍
Test with right-to-left languages: العربية, עברית
Use actual user data from target markets

6. Version Control

# Git: Normalize line endings, ensure UTF-8
.gitattributes


text=auto eol=lf
*.txt text encoding=utf-8
*.md text encoding=utf-8

7. Avoid These Mistakes

Don't use utf8 in MySQL (use utf8mb4)
Don't assume default encoding (always specify)
Don't mix encodings in the same project
Don't trust user-supplied encoding claims (validate)
Don't use UTF-8 BOM for web files

Programming Tips for Encoding Conversion

Python

# Read file with specific encoding
with open('file.txt', 'r', encoding='shift-jis') as f:
content = f.read()


Write as UTF-8
with open('output.txt', 'w', encoding='utf-8') as f:
f.write(content)
Handle errors gracefully
with open('file.txt', 'r', encoding='utf-8', errors='replace') as f:
content = f.read()  # Invalid bytes become �
Convert bytes
text_bytes = b'\xe3\x81\x82'  # UTF-8 bytes for あ
text = text_bytes.decode('utf-8')  # → 'あ'
back = text.encode('utf-8')  # → b'\xe3\x81\x82'
Auto-detect encoding
import chardet
with open('file.txt', 'rb') as f:
rawdata = f.read()
result = chardet.detect(rawdata)
encoding = result['encoding']
text = rawdata.decode(encoding)

Node.js

const fs = require('fs');
const iconv = require('iconv-lite');
// Read with specific encoding
const buffer = fs.readFileSync('file.txt');
const text = iconv.decode(buffer, 'shift-jis');
// Write as UTF-8
fs.writeFileSync('output.txt', text, 'utf8');
// Convert buffer between encodings
const utf8Buffer = iconv.encode(text, 'utf-8');
// Check encoding support
if (iconv.encodingExists('windows-1252')) {
// ...
}
// Detect encoding (using jschardet)
const jschardet = require('jschardet');
const detected = jschardet.detect(buffer);
console.log(detected.encoding);

Java

import java.nio.charset.;
import java.nio.file.;
// Read with specific encoding
String content = Files.readString(
Paths.get("file.txt"),
Charset.forName("Shift_JIS")
);
// Write as UTF-8
Files.writeString(
Paths.get("output.txt"),
content,
StandardCharsets.UTF_8
);
// Convert between encodings
byte[] bytes = content.getBytes(Charset.forName("Shift_JIS"));
String text = new String(bytes, StandardCharsets.UTF_8);
// String is UTF-16 internally in Java
// Always specify charset for bytes ↔ String conversion

PHP

<?php
// Convert encoding
$text = mb_convert_encoding($input, 'UTF-8', 'SJIS');
// Auto-detect source encoding
$encoding = mb_detect_encoding($input, ['UTF-8', 'SJIS', 'EUC-JP']);
$text = mb_convert_encoding($input, 'UTF-8', $encoding);
// Set internal encoding
mb_internal_encoding('UTF-8');
// Read file
$content = file_get_contents('file.txt');
$utf8 = mb_convert_encoding($content, 'UTF-8', 'auto');
// Write file
file_put_contents('output.txt', $utf8);
?>

Command Line Tools

# iconv: Convert between encodings
iconv -f SHIFT-JIS -t UTF-8 input.txt > output.txt
iconv -f UTF-16LE -t UTF-8 input.txt > output.txt
List available encodings
iconv -l
recode (alternative to iconv)
recode shift-jis..utf-8 input.txt
dos2unix: Fix line endings + encoding
dos2unix -c UTF-8 file.txt
uchardet: Detect encoding
uchardet file.txt

Related Tools

Text Encoding Converter

Convert between character encodings in your browser

Base64 Encoder

Encode text to Base64 format

URL Encoder

Encode special characters for URLs

HTML Entity Encoder

Encode characters as HTML entities

Quick Reference

Always Use

UTF-8 for new projects
utf8mb4 in MySQL
Explicit encoding declarations
No BOM for web files

Never Do

Assume default encoding
Use MySQL 'utf8' (use utf8mb4)
Mix encodings in one project
Ignore encoding in file I/O

Frequently Asked Questions

Text encoding is the system that maps characters (letters, symbols, emoji) to numbers that computers can store and process. It matters because using the wrong encoding causes garbled text (mojibake), data corruption, or complete loss of information. Modern systems use UTF-8 as the universal standard, but legacy files may use older encodings like Windows-1252 or Shift-JIS.

ASCII is a 7-bit encoding supporting only 128 characters (English letters, numbers, basic punctuation). UTF-8 is a variable-width encoding that supports all 143,859 Unicode characters including every language, emoji, and symbol. UTF-8 is backward compatible with ASCII - the first 128 UTF-8 characters are identical to ASCII, so pure ASCII text is also valid UTF-8.

This is called mojibake, caused by decoding text with the wrong encoding. Common causes: (1) File saved in one encoding but opened in another, (2) Missing encoding declaration in HTML/XML, (3) Database using different encoding than application, (4) Email or FTP transfer changing encoding. Fix by identifying the original encoding and converting to UTF-8.

Always use UTF-8 for new projects unless you have a specific legacy requirement. UTF-8 is the web standard (used by 98% of websites), supports all languages and emoji, is space-efficient for English text, and is compatible with ASCII. Only use other encodings when interfacing with legacy systems that cannot handle UTF-8.

Windows-1252 (Western European, default in old Windows), ISO-8859-1/Latin-1 (Western European standard), Shift-JIS (Japanese), GB2312/GBK/GB18030 (Chinese Simplified), Big5 (Chinese Traditional), EUC-KR (Korean), Windows-1251 (Cyrillic), and ISO-8859-2 (Central European). Each was designed for specific languages before Unicode standardization.

A BOM is an invisible character (U+FEFF) at the start of a file indicating encoding and byte order. For UTF-8, it's the bytes EF BB BF. You should NOT use UTF-8 BOM for web files (HTML, CSS, JS, JSON) as it can cause parsing errors and whitespace issues. Use BOM only for UTF-16/UTF-32 files where it's required, or when interfacing with Windows software that expects it.

There's no foolproof method, but approaches include: (1) Check file headers/BOM (UTF-8 BOM is EF BB BF), (2) Use detection libraries like chardet (Python) or jschardet (JavaScript), (3) Look at file metadata/HTTP headers, (4) Try common encodings for the language (Shift-JIS for Japanese, GBK for Chinese), (5) Use tools like 'file' command (Unix) or Notepad++ encoding detection.

Unicode is a standard that assigns unique numbers (code points) to every character across all languages - it's the 'what'. UTF-8 is an encoding that defines 'how' to store those numbers as bytes. Other UTF encodings include UTF-16 (uses 2-4 bytes) and UTF-32 (always 4 bytes). UTF-8 is most popular because it's space-efficient and backward-compatible with ASCII.

Steps: (1) Identify the current (wrong) encoding and correct encoding, (2) Open file using detected wrong encoding, (3) Re-save using correct encoding (usually UTF-8), (4) For databases, use ALTER TABLE to change column encoding and run conversion scripts, (5) For web files, add charset declarations and convert all files to UTF-8. Always backup before converting.

Use UTF-8 (specifically utf8mb4 in MySQL, UTF8 in PostgreSQL). The utf8mb4 encoding is crucial in MySQL because the 'utf8' alias only supports 3-byte UTF-8, missing emoji and rare characters. PostgreSQL's UTF8 is true UTF-8. Set encoding at database creation and ensure connection strings specify the same encoding. Avoid legacy encodings like latin1 unless maintaining old systems.