100% Private

Text Encoding Explained: UTF-8, ASCII, and Character Sets

A comprehensive guide to understanding text encoding, from ASCII and legacy character sets to Unicode and UTF-8. Learn how to handle international text, fix encoding issues, and avoid mojibake.

What is Text Encoding?

Text encoding is the system that maps human-readable characters to numbers (bytes) that computers can store and process. Every text file, database field, and string in memory uses some encoding scheme.

Why It Matters

  • International Support: Different languages require different character sets
  • Data Integrity: Wrong encoding causes garbled text or data loss
  • Compatibility: Systems must agree on encoding to exchange data
  • Storage Efficiency: Different encodings use different amounts of space

Quick Answer: Use UTF-8 for everything new. It's the universal standard that supports all languages, emoji, and symbols while remaining compatible with ASCII.

History: From ASCII to Unicode

ASCII (1963)

The American Standard Code for Information Interchange was the first widely-adopted text encoding.

  • 7-bit encoding: 128 characters total (0-127)
  • Coverage: English letters, numbers, basic punctuation
  • Range: 0-31 (control characters), 32-126 (printable), 127 (delete)

ASCII Examples:
65 (0x41) = 'A'
97 (0x61) = 'a'
48 (0x30) = '0'
32 (0x20) = space
10 (0x0A) = newline

Extended ASCII & Code Pages

When computers moved to 8-bit bytes (256 values), the extra 128 positions (128-255) were used for extended characters. But different systems used them differently, creating code pages.

Code PageRegionCharacters
CP437Original IBM PCBox drawing, accented letters
Windows-1252Western Europe€, €, smart quotes
ISO-8859-1Western EuropeSimilar to Windows-1252
Windows-1251CyrillicRussian, Bulgarian, Serbian
Windows-1256ArabicArabic script

Problem: A file encoded in Windows-1252 shows gibberish when opened with Windows-1251. No way to mix languages in one document.

Unicode (1991)

Unicode solved the chaos by creating a universal character set that assigns a unique number (code point) to every character across all writing systems.

  • Code Points: Written as U+XXXX (e.g., U+0041 for 'A')
  • Coverage: 143,859 characters across 159 scripts (as of Unicode 15.0)
  • Includes: All languages, emoji, mathematical symbols, ancient scripts

Unicode Code Points:
U+0041 = A (Latin)
U+03B1 = α (Greek alpha)
U+4E2D = 中 (Chinese "middle")
U+1F600 = 😀 (grinning face emoji)
U+0410 = А (Cyrillic capital A)

Important: Unicode is the "what" (which characters and their numbers). UTF-8/UTF-16/UTF-32 are the "how" (how to store those numbers as bytes).

UTF-8 Explained

UTF-8 (Unicode Transformation Format - 8-bit) is the dominant text encoding on the web and in modern systems.

How UTF-8 Works

UTF-8 is a variable-width encoding: characters use 1-4 bytes depending on their code point.

Byte CountCode Point RangeByte PatternExamples
1 byteU+0000 - U+007F0xxxxxxxASCII (A-Z, 0-9, punctuation)
2 bytesU+0080 - U+07FF110xxxxx 10xxxxxxLatin accents (é, ñ), Greek, Cyrillic
3 bytesU+0800 - U+FFFF1110xxxx 10xxxxxx 10xxxxxxChinese, Japanese, Korean, most emoji
4 bytesU+10000 - U+10FFFF11110xxx 10xxxxxx 10xxxxxx 10xxxxxxRare emoji, ancient scripts

Examples

Character | Code Point | UTF-8 Bytes (Hex)
----------|------------|------------------
A         | U+0041     | 41
é         | U+00E9     | C3 A9
中        | U+4E2D     | E4 B8 AD
😀        | U+1F600    | F0 9F 98 80
€         | U+20AC     | E2 82 AC

Why UTF-8 Won

  • Backward Compatible: Valid ASCII is valid UTF-8 (first 128 characters identical)
  • Space Efficient: English text uses same space as ASCII (1 byte per char)
  • Universal: Supports every language and emoji
  • Self-Synchronizing: Can find character boundaries after errors
  • No Byte Order Issues: Unlike UTF-16, no endianness concerns
  • Web Standard: Used by 98%+ of websites

UTF-16 and UTF-32

UTF-16: Uses 2 or 4 bytes per character. Common in Java, Windows internals, JavaScript strings. Less efficient for English text.

UTF-32: Always 4 bytes per character. Simple but wastes space. Rarely used in practice.

Common Legacy Encodings

You may encounter these when working with old files or legacy systems:

Japanese Encodings

Shift-JIS (Shift Japanese Industrial Standards)

  • Coverage: Japanese (Hiragana, Katakana, Kanji)
  • Usage: Dominant in Japanese Windows systems
  • Variants: Windows-31J (Microsoft's version), CP932
  • Issues: Some byte sequences look like ASCII control characters

EUC-JP (Extended Unix Code for Japanese)

  • Coverage: Japanese
  • Usage: Unix/Linux systems in Japan
  • Advantage: ASCII-compatible (ASCII bytes unchanged)

ISO-2022-JP

  • Coverage: Japanese
  • Usage: Email (historically)
  • Method: Uses escape sequences to switch between ASCII and Japanese

Chinese Encodings

GB2312 / GBK / GB18030

  • GB2312: Simplified Chinese (1980), 7,445 characters
  • GBK: Extension of GB2312, 21,886 characters
  • GB18030: Official Chinese standard, includes all Unicode characters
  • Usage: Mainland China systems

Big5

  • Coverage: Traditional Chinese (Taiwan, Hong Kong)
  • Characters: ~13,000
  • Issues: Multiple variants (Big5-HKSCS, etc.)

Korean Encoding

EUC-KR

  • Coverage: Korean (Hangul, Hanja)
  • Usage: South Korean systems
  • Extension: CP949 (Windows version with more characters)

Cyrillic Encodings

Windows-1251

  • Coverage: Russian, Bulgarian, Serbian (Cyrillic alphabet)
  • Usage: Windows systems in Russia/Eastern Europe

KOI8-R (Kod Obmena Informatsiey, 8-bit)

  • Coverage: Russian Cyrillic
  • Usage: Unix/Linux systems, Russian internet (historically)
  • Unique: Cyrillic letters map to Latin letter positions

Western European Encodings

ISO-8859-1 (Latin-1)

  • Coverage: Western European languages
  • Characters: ASCII + Western European accents (é, ü, ñ, etc.)
  • Usage: Default in early web, HTTP, email

Windows-1252 (CP1252)

  • Coverage: Similar to ISO-8859-1 with additions
  • Extra Characters: Smart quotes ("), €, •, etc.
  • Usage: Default Windows encoding for Western languages
  • Issue: Often mislabeled as ISO-8859-1

Other ISO-8859 Family

  • ISO-8859-2: Central European (Polish, Czech, Hungarian)
  • ISO-8859-5: Cyrillic
  • ISO-8859-7: Greek
  • ISO-8859-8: Hebrew
  • ISO-8859-9: Turkish
  • ISO-8859-15: Latin-1 with € symbol

Encoding Problems & How to Fix Them

What is Mojibake?

Mojibake (文字化け, "character transformation") is garbled text caused by decoding text with the wrong encoding.

Common Examples

OriginalWrong EncodingResult
caféUTF-8 read as Windows-1252café
日本語UTF-8 read as Shift-JIS譌・譛ャ隱�
HelloUTF-16 read as UTF-8H\0e\0l\0l\0o\0
€100Windows-1252 read as ISO-8859-1€100

Causes of Encoding Problems

  1. Missing Declaration: HTML without <meta charset="UTF-8">
  2. Editor Mismatch: Saving in one encoding, opening in another
  3. Database Issues: Connection encoding differs from table encoding
  4. File Transfer: FTP/email converting encodings
  5. Copy-Paste: Pasting between applications with different encodings
  6. Default Assumptions: Software guessing wrong encoding

Diagnostic Patterns

If you see...           | Likely cause
------------------------|---------------------------
é, à , ü             | UTF-8 decoded as Windows-1252
’, “, †           | Windows-1252 smart quotes as UTF-8
�                       | Replacement character (encoding failed)
\u0000 or null bytes   | UTF-16 decoded as 8-bit
Chinese/Japanese       | Asian encoding decoded as Western
becoming squares
Ð, Ñ, Ð�               | Cyrillic as Western encoding

How to Fix Encoding Issues

1. Identify the Encodings

# Linux: Use 'file' command
file -i document.txt

Output: text/plain; charset=utf-8

Python: Use chardet library

import chardet with open('file.txt', 'rb') as f: result = chardet.detect(f.read()) print(result['encoding']) # e.g., 'utf-8', 'ISO-8859-1'

Browser: Check Network tab -> Headers -> Content-Type

2. Convert to UTF-8

# Linux: Use iconv
iconv -f SHIFT-JIS -t UTF-8 input.txt > output.txt

Python

with open('input.txt', 'r', encoding='shift-jis') as f: content = f.read() with open('output.txt', 'w', encoding='utf-8') as f: f.write(content)

Node.js (using iconv-lite)

const iconv = require('iconv-lite'); const fs = require('fs');

const buffer = fs.readFileSync('input.txt'); const str = iconv.decode(buffer, 'shift-jis'); fs.writeFileSync('output.txt', str, 'utf8');

3. Fix Database Encoding

-- MySQL: Check encoding
SHOW CREATE TABLE my_table;

-- Convert table to UTF-8 ALTER TABLE my_table CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

-- PostgreSQL: Set client encoding SET client_encoding = 'UTF8';

-- SQLite: Always stores as UTF-8/UTF-16

4. HTML/Web Fixes

<!-- Always declare charset in HTML -->
<meta charset="UTF-8">

<!-- In HTTP headers (server config) --> Content-Type: text/html; charset=utf-8

<!-- CSS files --> @charset "UTF-8";

<!-- XML files --> <?xml version="1.0" encoding="UTF-8"?>

Byte Order Mark (BOM)

The Byte Order Mark is a special Unicode character (U+FEFF) placed at the beginning of a file to indicate encoding and byte order.

BOM by Encoding

EncodingBOM Bytes (Hex)Required?
UTF-8EF BB BFNo (optional, rarely used)
UTF-16 BEFE FFRecommended
UTF-16 LEFF FERecommended
UTF-32 BE00 00 FE FFRecommended
UTF-32 LEFF FE 00 00Recommended

Should You Use UTF-8 BOM?

Generally NO for web files and most modern systems.

Problems with UTF-8 BOM

  • PHP: BOM causes "headers already sent" errors
  • JSON: Invalid JSON (must not have BOM)
  • Scripts: Shebangs (#!/bin/bash) fail if preceded by BOM
  • Parsing: Some parsers choke on unexpected BOM
  • Concatenation: Merging files creates BOM in middle

When UTF-8 BOM is OK

  • Plain text files opened in Windows Notepad
  • CSV files for Excel (Excel requires BOM to detect UTF-8)
  • Legacy Windows applications that expect it

Detecting and Removing BOM

# Check for BOM (Linux/Mac)
hexdump -C file.txt | head -n 1

Look for: ef bb bf at start

Remove BOM (Linux/Mac)

tail -c +4 file.txt > file_no_bom.txt

Python: Remove BOM

with open('file.txt', 'r', encoding='utf-8-sig') as f: content = f.read() with open('output.txt', 'w', encoding='utf-8') as f: f.write(content)

Detecting File Encoding

There's no 100% reliable way to detect encoding (bytes are just bytes), but these methods work most of the time:

1. Check File Headers/BOM

# First 4 bytes tell you UTF-16/32 with BOM
EF BB BF        → UTF-8 with BOM
FE FF           → UTF-16 BE
FF FE           → UTF-16 LE (or UTF-32 LE if followed by 00 00)
00 00 FE FF     → UTF-32 BE

2. Check Metadata

  • HTML: <meta charset="...">
  • XML: <?xml encoding="..."?>
  • HTTP Headers: Content-Type charset parameter
  • Email: Content-Type in MIME headers

3. Use Detection Tools

# Linux: file command
file -i document.txt

Python: chardet library

pip install chardet chardetect document.txt

JavaScript: jschardet

npm install jschardet const jschardet = require('jschardet'); const fs = require('fs'); const buffer = fs.readFileSync('file.txt'); const detected = jschardet.detect(buffer); console.log(detected.encoding);

Text editors

- Notepad++: Encoding menu

- VS Code: Bottom right status bar

- Sublime Text: View → Encoding

4. Try Common Encodings

Based on language/region of content:

  • English/Western: Try UTF-8, Windows-1252, ISO-8859-1
  • Japanese: Try UTF-8, Shift-JIS, EUC-JP
  • Chinese (Simplified): Try UTF-8, GBK, GB2312
  • Chinese (Traditional): Try UTF-8, Big5
  • Korean: Try UTF-8, EUC-KR
  • Russian: Try UTF-8, Windows-1251, KOI8-R

5. Validation Techniques

Some encodings have patterns that help validation:

UTF-8 validity check:

  • No byte can be 0xC0, 0xC1, or 0xF5-0xFF
  • Multi-byte sequences must follow patterns
  • Continuation bytes must be 10xxxxxx

ASCII validity check:

  • All bytes must be 0x00-0x7F

  • No high-bit set

Best Practices for International Text

1. Always Use UTF-8

  • For all new files, databases, APIs, and web content
  • Set UTF-8 in HTML meta tags, HTTP headers, database connections
  • Configure text editors to save as UTF-8 (without BOM)

2. Declare Encoding Explicitly

<!-- HTML5 -->
<meta charset="UTF-8">

Python files (PEP 263)

-- coding: utf-8 --

// Database connection (MySQL) mysql://user:pass@host/db?charset=utf8mb4

// HTTP Response Content-Type: text/html; charset=utf-8

3. Database Configuration

-- MySQL: Use utf8mb4 (not utf8!)
CREATE DATABASE mydb
CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;

-- PostgreSQL CREATE DATABASE mydb ENCODING 'UTF8';

-- Connection string mysql://user:pass@host/db?charset=utf8mb4

4. File I/O

# Python 3: Always specify encoding
with open('file.txt', 'r', encoding='utf-8') as f:
content = f.read()

Node.js

fs.readFileSync('file.txt', 'utf8'); fs.writeFileSync('file.txt', content, 'utf8');

Java

Files.readString(path, StandardCharsets.UTF_8); Files.writeString(path, content, StandardCharsets.UTF_8);

5. Testing

  • Test with non-ASCII characters: é, ñ, 中, 日, Ω, €
  • Test with emoji: 😀 🎉 👍
  • Test with right-to-left languages: العربية, עברית
  • Use actual user data from target markets

6. Version Control

# Git: Normalize line endings, ensure UTF-8

.gitattributes

  • text=auto eol=lf *.txt text encoding=utf-8 *.md text encoding=utf-8

7. Avoid These Mistakes

  • Don't use utf8 in MySQL (use utf8mb4)
  • Don't assume default encoding (always specify)
  • Don't mix encodings in the same project
  • Don't trust user-supplied encoding claims (validate)
  • Don't use UTF-8 BOM for web files

Programming Tips for Encoding Conversion

Python

# Read file with specific encoding
with open('file.txt', 'r', encoding='shift-jis') as f:
content = f.read()

Write as UTF-8

with open('output.txt', 'w', encoding='utf-8') as f: f.write(content)

Handle errors gracefully

with open('file.txt', 'r', encoding='utf-8', errors='replace') as f: content = f.read() # Invalid bytes become �

Convert bytes

text_bytes = b'\xe3\x81\x82' # UTF-8 bytes for あ text = text_bytes.decode('utf-8') # → 'あ' back = text.encode('utf-8') # → b'\xe3\x81\x82'

Auto-detect encoding

import chardet with open('file.txt', 'rb') as f: rawdata = f.read() result = chardet.detect(rawdata) encoding = result['encoding'] text = rawdata.decode(encoding)

Node.js

const fs = require('fs');
const iconv = require('iconv-lite');

// Read with specific encoding const buffer = fs.readFileSync('file.txt'); const text = iconv.decode(buffer, 'shift-jis');

// Write as UTF-8 fs.writeFileSync('output.txt', text, 'utf8');

// Convert buffer between encodings const utf8Buffer = iconv.encode(text, 'utf-8');

// Check encoding support if (iconv.encodingExists('windows-1252')) { // ... }

// Detect encoding (using jschardet) const jschardet = require('jschardet'); const detected = jschardet.detect(buffer); console.log(detected.encoding);

Java

import java.nio.charset.;
import java.nio.file.;

// Read with specific encoding String content = Files.readString( Paths.get("file.txt"), Charset.forName("Shift_JIS") );

// Write as UTF-8 Files.writeString( Paths.get("output.txt"), content, StandardCharsets.UTF_8 );

// Convert between encodings byte[] bytes = content.getBytes(Charset.forName("Shift_JIS")); String text = new String(bytes, StandardCharsets.UTF_8);

// String is UTF-16 internally in Java // Always specify charset for bytes ↔ String conversion

PHP

<?php
// Convert encoding
$text = mb_convert_encoding($input, 'UTF-8', 'SJIS');

// Auto-detect source encoding $encoding = mb_detect_encoding($input, ['UTF-8', 'SJIS', 'EUC-JP']); $text = mb_convert_encoding($input, 'UTF-8', $encoding);

// Set internal encoding mb_internal_encoding('UTF-8');

// Read file $content = file_get_contents('file.txt'); $utf8 = mb_convert_encoding($content, 'UTF-8', 'auto');

// Write file file_put_contents('output.txt', $utf8); ?>

Command Line Tools

# iconv: Convert between encodings
iconv -f SHIFT-JIS -t UTF-8 input.txt > output.txt
iconv -f UTF-16LE -t UTF-8 input.txt > output.txt

List available encodings

iconv -l

recode (alternative to iconv)

recode shift-jis..utf-8 input.txt

dos2unix: Fix line endings + encoding

dos2unix -c UTF-8 file.txt

uchardet: Detect encoding

uchardet file.txt

Related Tools

Text Encoding Converter

Convert between character encodings in your browser

Base64 Encoder

Encode text to Base64 format

URL Encoder

Encode special characters for URLs

HTML Entity Encoder

Encode characters as HTML entities

Quick Reference

Always Use
  • UTF-8 for new projects
  • utf8mb4 in MySQL
  • Explicit encoding declarations
  • No BOM for web files
Never Do
  • Assume default encoding
  • Use MySQL 'utf8' (use utf8mb4)
  • Mix encodings in one project
  • Ignore encoding in file I/O

Last updated: December 2024

All encoding conversions on ToolsDock happen in your browser. No data is sent to any server.

Privacy Notice: This site works entirely in your browser. We don't collect or store your data. Optional analytics help us improve the site. You can deny without affecting functionality.