Text Encoding: UTF-8, Unicode, Mojibake, and How to Fix It

Character encoding is invisible until it breaks. This guide covers how it works, where it goes wrong, and how to fix it — from ASCII and Latin-1 to Shift-JIS and GB2312, with concrete examples for databases, HTTP headers, and code.

How Encoding Works

Computers store everything as numbers. Text encoding is the agreement that maps characters to numbers and back. When the sender and receiver use the same mapping, text is legible. When they don't, you get garbage.

Three concepts are often confused:

Character set — the collection of characters (e.g., Latin alphabet, Kanji)
Code point — the number assigned to each character (e.g., A = 65)
Encoding — how those numbers are stored as bytes (e.g., UTF-8 stores code points using 1-4 bytes)

ASCII

ASCII (1963) is the bedrock. It maps 128 characters to numbers 0–127 using 7 bits. That covers uppercase and lowercase English letters, digits, punctuation, and 33 control characters (newline, tab, etc.).

65 = A    97 = a    48 = 0
66 = B    98 = b    32 = space
67 = C    99 = c    10 = newline (LF)
13 = carriage return (CR)

ASCII has no accents, no non-Latin scripts, no symbols beyond basic punctuation. Every encoding designed after ASCII includes it — the 0–127 range maps identically in ASCII, Latin-1, Windows-1252, UTF-8, and most others. That's why English text often opens without problems even in the wrong encoding.

Unicode and UTF-8

Unicode

Unicode (1991) solved the pre-internet chaos of incompatible encodings by defining a single universal character set. Every character in every writing system gets a unique number called a code point, written as U+XXXX.

U+0041 = A          (Latin)
U+00E9 = é          (Latin with accent)
U+03B1 = α          (Greek alpha)
U+4E2D = 中          (Chinese "middle")
U+1F600 = 😀        (emoji)
U+0410 = А          (Cyrillic capital A)

Unicode 15.0 defines 149,186 characters across 161 scripts. The standard is maintained by the Unicode Consortium and updated annually. Code points go up to U+10FFFF — that's over 1 million possible values, though most are unassigned.

UTF-8: the dominant encoding

UTF-8 is how most systems store Unicode. It uses 1 to 4 bytes per character depending on the code point:

Bytes	Code point range	Bit pattern	Examples
1	U+0000–U+007F	`0xxxxxxx`	ASCII (A–Z, 0–9)
2	U+0080–U+07FF	`110xxxxx 10xxxxxx`	é, ñ, ü, Greek, Cyrillic
3	U+0800–U+FFFF	`1110xxxx 10xxxxxx 10xxxxxx`	Chinese, Japanese, Korean
4	U+10000–U+10FFFF	`11110xxx 10xxxxxx 10xxxxxx 10xxxxxx`	Emoji, ancient scripts

A   = U+0041 → 0x41         (1 byte)
é   = U+00E9 → 0xC3 0xA9   (2 bytes)
中  = U+4E2D → 0xE4 0xB8 0xAD  (3 bytes)
😀 = U+1F600 → 0xF0 0x9F 0x98 0x80  (4 bytes)

UTF-8 won because it's backward-compatible with ASCII (plain English text is byte-identical), self-synchronizing (you can find character boundaries after a corrupt byte), and space-efficient for Western languages. 98%+ of websites use UTF-8 today.

UTF-16 and UTF-32

UTF-16 uses 2 bytes for most characters and 4 bytes for rare ones (code points above U+FFFF, called surrogate pairs). Java, C# .NET, JavaScript, and Windows internals use UTF-16 strings natively. It's efficient for East Asian languages where most characters need 3 bytes in UTF-8 but only 2 in UTF-16.

UTF-32 always uses 4 bytes per character. Simple to index (character N is at byte 4N) but wastes space — English text takes 4x more space than ASCII. Rarely used in practice except in some internal processing contexts.

Both UTF-16 and UTF-32 have an endianness issue: a 2-byte value 0x0041 could be stored as 00 41 (big-endian) or 41 00 (little-endian). The BOM character (see below) distinguishes them. UTF-8 has no endianness problem.

Legacy Encodings

These encodings predate Unicode and are still encountered in old files, legacy databases, and regional systems.

Western European

ISO-8859-1 (Latin-1) — Covers Western European languages by adding 96 characters (128–255) to ASCII. Includes é, ü, ñ, etc. Used as the default in early HTTP and email. Doesn't include the € sign (added later in ISO-8859-15).

Windows-1252 (CP1252) — Microsoft's extension of Latin-1. Adds smart quotes, em dash, €, and other characters in the 0x80–0x9F range that Latin-1 left as control codes. Often mislabeled as ISO-8859-1 in old web pages. The difference matters when you encounter â€™ (the UTF-8 bytes for ' read as Windows-1252).

Code point	ISO-8859-1	Windows-1252
0x80	(undefined)	€
0x91	(undefined)	' (left single quote)
0x93	(undefined)	" (left double quote)
0x96	(undefined)	– (en dash)

Japanese encodings

Shift-JIS — The dominant Japanese encoding for Windows systems. Covers Hiragana, Katakana, Kanji, and ASCII. Has tricky byte sequences where some bytes can look like ASCII characters, making it prone to false detections. Also called CP932 or Windows-31J in Microsoft's variant.

EUC-JP — The Unix/Linux variant for Japanese. ASCII-compatible and less ambiguous than Shift-JIS, but both are obsolete for new work — use UTF-8 instead.

Chinese encodings

GB2312 — 1980 mainland China standard, covers 7,445 Simplified Chinese characters plus ASCII. Superseded by:

GBK — Extends GB2312 to 21,886 characters including Traditional Chinese used in Hong Kong. Used on Chinese Windows systems.

GB18030 — The current official Chinese standard. A superset of GBK that also covers all Unicode code points, making it the only legacy encoding that can represent everything UTF-8 can.

Big5 — Traditional Chinese for Taiwan and Hong Kong. ~13,000 characters. Has variants (Big5-HKSCS for Hong Kong) with different character mappings.

Cyrillic encodings

Windows-1251 — Russian, Bulgarian, Serbian in Cyrillic script. The standard for Russian Windows systems.

KOI8-R — A clever design where Cyrillic letters are placed at positions matching their Latin phonetic equivalents, so stripping the high bit gives readable (if wrong) transliteration. Used on Unix systems and old Russian internet.

Other ISO-8859 family

ISO-8859-2 — Central European (Polish, Czech, Slovak, Hungarian, Slovenian, Croatian)
ISO-8859-7 — Greek
ISO-8859-8 — Hebrew
ISO-8859-9 — Turkish
ISO-8859-15 — Latin-1 with the € symbol, updating the gap at 0xA4

Mojibake: Garbled Text

Mojibake (文字化け, Japanese for "character transformation") is what happens when text encoded in one system gets decoded by another. The bytes are intact but the interpretation is wrong.

Common patterns and their causes

What you see	What it should be	Cause
Ã© Ã¼ Ã±	é ü ñ	UTF-8 decoded as Windows-1252 or Latin-1
â€™ â€œ â€	' " —	Windows-1252 smart quotes decoded as UTF-8
Ð Ñ Ð±	Cyrillic	Cyrillic UTF-8 decoded as Latin-1
譌･譛ャ	日本語	Japanese UTF-8 decoded as Shift-JIS
H\0e\0l\0l\0	Hello	UTF-16 LE decoded as UTF-8 or Latin-1
? or □ or `�`	Various	Byte has no mapping in target encoding

Diagnosing from the garbled text

The Ã© pattern is diagnostic: é in UTF-8 is bytes C3 A9. In Windows-1252, byte C3 = Ã and A9 = ©. So Ã© always means UTF-8 file opened as Windows-1252. Similarly, â€™ is the three UTF-8 bytes for Windows-1252 byte 0x92 (right single quote).

Root causes

Missing or wrong encoding declaration — HTML without <meta charset="UTF-8">, or with the wrong charset
Editor mismatch — File saved as Windows-1252 by Notepad, then opened as UTF-8 by another editor
Database connection encoding mismatch — Application connecting to MySQL with latin1 while the column stores UTF-8
FTP in ASCII mode — FTP transfers in ASCII mode rewrite line endings and can corrupt high bytes
Double-encoding — UTF-8 bytes stored in a Latin-1 column, then read out and decoded as Latin-1, producing garbled UTF-8 sequences stored where readable text should be

Fix encoding issues in the browser with the Text Encoding Converter — detect encoding automatically and convert to UTF-8 without uploading files.

The BOM Problem

The Byte Order Mark (U+FEFF) is a zero-width character placed at the start of a file to signal its encoding. For UTF-16 and UTF-32, it's necessary to distinguish big-endian from little-endian byte order. For UTF-8, it's optional and usually causes problems.

Encoding	BOM bytes (hex)	Needed?
UTF-8	EF BB BF	No — causes problems in most contexts
UTF-16 BE	FE FF	Recommended
UTF-16 LE	FF FE	Recommended
UTF-32 BE	00 00 FE FF	Recommended
UTF-32 LE	FF FE 00 00	Recommended

UTF-8 BOM problems

Windows Notepad and Excel add a UTF-8 BOM when saving. This causes:

PHP — "headers already sent" errors (the BOM counts as output before header() calls)
JSON — Invalid JSON; parsers reject files starting with bytes other than { or [
CSV headers — First column named Name instead of Name (invisible BOM character prepended)
Shell scripts — Shebang line #!/bin/bash preceded by BOM bytes; shell can't find the interpreter
File concatenation — Merging BOM-prefixed files puts BOM characters in the middle of the result

Removing BOM

# Detect BOM
hexdump -C file.txt | head -1
Look for: ef bb bf  at position 0
Remove BOM (Linux/Mac)
tail -c +4 file.txt > file_clean.txt
Python: read with utf-8-sig codec (strips BOM automatically)
with open('file.txt', 'r', encoding='utf-8-sig') as f:
content = f.read()
Write back without BOM
with open('output.txt', 'w', encoding='utf-8') as f:
f.write(content)

Exception: Excel requires a UTF-8 BOM to recognize that a CSV is UTF-8. If you generate CSV files for Excel users, write the BOM first:

with open('for_excel.csv', 'w', encoding='utf-8-sig') as f:
writer = csv.writer(f)
writer.writerows(data)

Detecting Encoding

There is no guaranteed way to detect encoding — bytes are just bytes, and most encodings share the 0–127 range. But these methods work most of the time:

1. Check for BOM

with open('file.txt', 'rb') as f:
raw = f.read(4)
boms = {
b'\xef\xbb\xbf':     'UTF-8 with BOM',
b'\xff\xfe':         'UTF-16 LE',
b'\xfe\xff':         'UTF-16 BE',
b'\xff\xfe\x00\x00': 'UTF-32 LE',
b'\x00\x00\xfe\xff': 'UTF-32 BE',
}
for bom, name in boms.items():
if raw.startswith(bom):
print(name)
break

2. Statistical detection

# Python
pip install chardet
import chardet
with open('file.txt', 'rb') as f:
result = chardet.detect(f.read())
print(result)
→ {'encoding': 'Shift_JIS', 'confidence': 0.99}
Command line
chardetect file.txt
JavaScript (Node.js)
npm install jschardet
const jschardet = require('jschardet');
const buf = require('fs').readFileSync('file.txt');
console.log(jschardet.detect(buf));

chardet works well for files with enough content (500+ characters). For short strings, confidence scores below 0.7 are unreliable.

3. Context clues

File came from a Japanese Windows user → try Shift-JIS first
File came from a German government system → try Windows-1252 or ISO-8859-1
File has <meta charset> or XML declaration → trust it (but verify)
HTTP response has Content-Type: charset= → trust it

4. Visual confirmation

Open the file in Notepad++ or VS Code, try switching encodings, and look at the actual characters. If you expect "café" and see "cafÃ©", you're reading UTF-8 as Latin-1. Switch to UTF-8 and verify.

Converting Files

iconv (Linux/Mac)

# Basic conversion
iconv -f SHIFT-JIS -t UTF-8 input.txt > output.txt
iconv -f WINDOWS-1252 -t UTF-8 input.csv > output.csv
iconv -f ISO-8859-1 -t UTF-8 input.html > output.html
Handle invalid sequences (don't stop on error)
iconv -f GB2312 -t UTF-8//IGNORE input.txt > output.txt
List available encoding names
iconv -l

Python

<code">import chardet
Detect and convert
with open('unknown.txt', 'rb') as f:
raw = f.read()
detected = chardet.detect(raw)['encoding']
text = raw.decode(detected, errors='replace')
with open('output.txt', 'w', encoding='utf-8') as f:
f.write(text)
Convert with error handling options:
errors='strict'  → raises UnicodeDecodeError on bad byte (default)
errors='ignore'  → silently drops undecodable bytes
errors='replace' → substitutes U+FFFD (replacement character) for bad bytes
errors='backslashreplace' → shows \xXX escape for bad bytes

Batch conversion

<code">import chardet
from pathlib import Path
src = Path('legacy-files')
dst = Path('utf8-files')
dst.mkdir(exist_ok=True)
for f in src.glob('**/*.txt'):
raw = f.read_bytes()
enc = chardet.detect(raw).get('encoding', 'utf-8')
try:
text = raw.decode(enc)
except (UnicodeDecodeError, LookupError):
text = raw.decode('utf-8', errors='replace')
(dst / f.name).write_text(text, encoding='utf-8')

For browser-based conversion, the Text Encoding Converter handles all the encodings above, including Shift-JIS, GBK, Windows-1251, and all ISO-8859 variants.

Database Charset Settings

MySQL: use utf8mb4, not utf8

MySQL's utf8 charset is a historical mistake — it supports only 3-byte UTF-8, which means it cannot store emoji (which need 4 bytes) or characters above U+FFFF. Use utf8mb4 for real UTF-8:

<code">-- Create database with utf8mb4
CREATE DATABASE myapp
CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;

-- Convert existing table ALTER TABLE users CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

-- Connection string must also specify charset mysql://user:pass@host/db?charset=utf8mb4

-- Check current encoding SHOW CREATE TABLE users\G SHOW VARIABLES LIKE 'character_set%';

Double-encoding trap

A common MySQL mistake: store UTF-8 data in a latin1 column by connecting with charset latin1. The bytes are stored correctly but MySQL thinks they're Latin-1. When you read them back as Latin-1 through a UTF-8 connection, MySQL "converts" them — each UTF-8 byte gets incorrectly decoded, producing garbage. Fix:

<code">-- If column was wrongly stored: bytes are UTF-8 but MySQL thinks latin1
-- Step 1: Get binary content (bypasses charset conversion)
ALTER TABLE t MODIFY col VARBINARY(255);
-- Step 2: Declare it UTF-8
ALTER TABLE t MODIFY col VARCHAR(255) CHARACTER SET utf8mb4;

PostgreSQL

PostgreSQL's UTF8 is true UTF-8 with no 3-byte limitation. Set it at database creation:

<code">CREATE DATABASE myapp ENCODING 'UTF8' LC_COLLATE 'en_US.UTF-8';
-- Set client encoding for a session
SET client_encoding = 'UTF8';
-- Check current encoding
SHOW client_encoding;
SELECT pg_encoding_to_char(encoding) FROM pg_database WHERE datname = current_database();

HTTP and HTML Encoding Declarations

HTTP Content-Type header

Browsers use the HTTP header's charset parameter first, before looking at the HTML meta tag. Your server should set it correctly:

<code">Content-Type: text/html; charset=utf-8
Content-Type: text/plain; charset=utf-8
Content-Type: application/json; charset=utf-8
Content-Type: text/csv; charset=utf-8

Note: JSON (RFC 8259) requires UTF-8 and prohibits the BOM. If your server returns JSON, you should not include charset in the Content-Type since UTF-8 is mandatory, but including it doesn't break anything.

HTML

<code"><!-- HTML5: must be within first 1024 bytes -->
<meta charset="UTF-8">
<!-- HTML4/XHTML style (still valid) -->
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<!-- XML declaration (XHTML) -->
<?xml version="1.0" encoding="UTF-8"?>

XML

XML defaults to UTF-8 if no encoding is declared. Always declare it anyway for clarity:

<code"><?xml version="1.0" encoding="UTF-8"?>

CSS

Add the charset declaration at the very top of CSS files containing non-ASCII characters (font names, content values):

<code">@charset "UTF-8";

Code Examples by Language

Python 3

<code"># Always specify encoding — don't rely on locale defaults
with open('data.txt', 'r', encoding='utf-8') as f:
content = f.read()
Write with explicit encoding
with open('output.txt', 'w', encoding='utf-8') as f:
f.write(content)
Encode/decode bytes ↔ strings
text = 'café'
encoded = text.encode('utf-8')   # b'caf\xc3\xa9'
decoded = encoded.decode('utf-8') # 'café'
Handle errors gracefully
text = b'\xff\xfe invalid bytes'.decode('utf-8', errors='replace')
Replaces bad bytes with U+FFFD replacement character

JavaScript / Node.js

<code">// Browser: TextEncoder / TextDecoder (built-in)
const encoder = new TextEncoder();  // always UTF-8
const bytes = encoder.encode('café');
const decoder = new TextDecoder('utf-8');
const text = decoder.decode(bytes);
// TextDecoder for other encodings
const win1252Decoder = new TextDecoder('windows-1252');
const text2 = win1252Decoder.decode(buffer);
// Node.js: fs encoding
const fs = require('fs');
const content = fs.readFileSync('file.txt', 'utf8');
fs.writeFileSync('out.txt', content, 'utf8');
// Node.js: iconv-lite for legacy encodings
const iconv = require('iconv-lite');
const buf = fs.readFileSync('file.txt');
const text3 = iconv.decode(buf, 'shift-jis');

Java

<code">// Always specify charset — don't use new String(bytes) without encoding
String text = new String(bytes, StandardCharsets.UTF_8);
byte[] bytes2 = text.getBytes(StandardCharsets.UTF_8);
// File I/O
import java.nio.file.;
import java.nio.charset.;
String content = Files.readString(path, StandardCharsets.UTF_8);
Files.writeString(path, content, StandardCharsets.UTF_8);

Common mistakes

MySQL: using utf8 instead of utf8mb4 — emoji and some Chinese characters get rejected
Python 2 legacy: mixing byte strings and unicode strings without explicit encode/decode
Node.js: calling buffer.toString() without specifying 'utf8' — defaults to UTF-8 but better to be explicit
PHP: not setting mb_internal_encoding('UTF-8') — string functions operate byte-by-byte and mangle multibyte chars
Java: using String(bytes) instead of String(bytes, StandardCharsets.UTF_8) — relies on platform default, which varies

Frequently Asked Questions

Text encoding is the system that maps characters (letters, symbols, emoji) to numbers that computers can store and process. It matters because using the wrong encoding causes garbled text (mojibake), data corruption, or complete loss of information. Modern systems use UTF-8 as the universal standard, but legacy files may use older encodings like Windows-1252 or Shift-JIS.

ASCII is a 7-bit encoding supporting only 128 characters (English letters, numbers, basic punctuation). UTF-8 is a variable-width encoding that supports all 143,859 Unicode characters including every language, emoji, and symbol. UTF-8 is backward compatible with ASCII - the first 128 UTF-8 characters are identical to ASCII, so pure ASCII text is also valid UTF-8.

This is called mojibake, caused by decoding text with the wrong encoding. Common causes: (1) File saved in one encoding but opened in another, (2) Missing encoding declaration in HTML/XML, (3) Database using different encoding than application, (4) Email or FTP transfer changing encoding. Fix by identifying the original encoding and converting to UTF-8.

Always use UTF-8 for new projects unless you have a specific legacy requirement. UTF-8 is the web standard (used by 98% of websites), supports all languages and emoji, is space-efficient for English text, and is compatible with ASCII. Only use other encodings when interfacing with legacy systems that cannot handle UTF-8.

Windows-1252 (Western European, default in old Windows), ISO-8859-1/Latin-1 (Western European standard), Shift-JIS (Japanese), GB2312/GBK/GB18030 (Chinese Simplified), Big5 (Chinese Traditional), EUC-KR (Korean), Windows-1251 (Cyrillic), and ISO-8859-2 (Central European). Each was designed for specific languages before Unicode standardization.

A BOM is an invisible character (U+FEFF) at the start of a file indicating encoding and byte order. For UTF-8, it's the bytes EF BB BF. You should NOT use UTF-8 BOM for web files (HTML, CSS, JS, JSON) as it can cause parsing errors and whitespace issues. Use BOM only for UTF-16/UTF-32 files where it's required, or when interfacing with Windows software that expects it.

There's no foolproof method, but approaches include: (1) Check file headers/BOM (UTF-8 BOM is EF BB BF), (2) Use detection libraries like chardet (Python) or jschardet (JavaScript), (3) Look at file metadata/HTTP headers, (4) Try common encodings for the language (Shift-JIS for Japanese, GBK for Chinese), (5) Use tools like 'file' command (Unix) or Notepad++ encoding detection.

Unicode is a standard that assigns unique numbers (code points) to every character across all languages - it's the 'what'. UTF-8 is an encoding that defines 'how' to store those numbers as bytes. Other UTF encodings include UTF-16 (uses 2-4 bytes) and UTF-32 (always 4 bytes). UTF-8 is most popular because it's space-efficient and backward-compatible with ASCII.

Steps: (1) Identify the current encoding and the correct one, (2) Open file using the detected encoding, (3) Re-save using UTF-8, (4) For databases, use ALTER TABLE to change column encoding, (5) For web files, add charset declarations and convert all files to UTF-8. Always backup before converting.

Use UTF-8 (specifically utf8mb4 in MySQL, UTF8 in PostgreSQL). The utf8mb4 encoding is crucial in MySQL because the 'utf8' alias only supports 3-byte UTF-8, missing emoji and rare characters. PostgreSQL's UTF8 is true UTF-8. Set encoding at database creation and ensure connection strings specify the same encoding.