100% Private

Text Encoding: UTF-8, Unicode, Mojibake, and How to Fix It

Character encoding is invisible until it breaks. This guide covers how it works, where it goes wrong, and how to fix it — from ASCII and Latin-1 to Shift-JIS and GB2312, with concrete examples for databases, HTTP headers, and code.

How Encoding Works

Computers store everything as numbers. Text encoding is the agreement that maps characters to numbers and back. When the sender and receiver use the same mapping, text is legible. When they don't, you get garbage.

Three concepts are often confused:

  • Character set — the collection of characters (e.g., Latin alphabet, Kanji)
  • Code point — the number assigned to each character (e.g., A = 65)
  • Encoding — how those numbers are stored as bytes (e.g., UTF-8 stores code points using 1-4 bytes)

ASCII

ASCII (1963) is the bedrock. It maps 128 characters to numbers 0–127 using 7 bits. That covers uppercase and lowercase English letters, digits, punctuation, and 33 control characters (newline, tab, etc.).

65 = A    97 = a    48 = 0
66 = B    98 = b    32 = space
67 = C    99 = c    10 = newline (LF)
13 = carriage return (CR)

ASCII has no accents, no non-Latin scripts, no symbols beyond basic punctuation. Every encoding designed after ASCII includes it — the 0–127 range maps identically in ASCII, Latin-1, Windows-1252, UTF-8, and most others. That's why English text often opens without problems even in the wrong encoding.

Unicode and UTF-8

Unicode

Unicode (1991) solved the pre-internet chaos of incompatible encodings by defining a single universal character set. Every character in every writing system gets a unique number called a code point, written as U+XXXX.

U+0041 = A          (Latin)
U+00E9 = é          (Latin with accent)
U+03B1 = α          (Greek alpha)
U+4E2D = 中          (Chinese "middle")
U+1F600 = 😀        (emoji)
U+0410 = А          (Cyrillic capital A)

Unicode 15.0 defines 149,186 characters across 161 scripts. The standard is maintained by the Unicode Consortium and updated annually. Code points go up to U+10FFFF — that's over 1 million possible values, though most are unassigned.

UTF-8: the dominant encoding

UTF-8 is how most systems store Unicode. It uses 1 to 4 bytes per character depending on the code point:

BytesCode point rangeBit patternExamples
1U+0000–U+007F0xxxxxxxASCII (A–Z, 0–9)
2U+0080–U+07FF110xxxxx 10xxxxxxé, ñ, ü, Greek, Cyrillic
3U+0800–U+FFFF1110xxxx 10xxxxxx 10xxxxxxChinese, Japanese, Korean
4U+10000–U+10FFFF11110xxx 10xxxxxx 10xxxxxx 10xxxxxxEmoji, ancient scripts

A   = U+0041 → 0x41         (1 byte)
é   = U+00E9 → 0xC3 0xA9   (2 bytes)
中  = U+4E2D → 0xE4 0xB8 0xAD  (3 bytes)
😀 = U+1F600 → 0xF0 0x9F 0x98 0x80  (4 bytes)

UTF-8 won because it's backward-compatible with ASCII (plain English text is byte-identical), self-synchronizing (you can find character boundaries after a corrupt byte), and space-efficient for Western languages. 98%+ of websites use UTF-8 today.

UTF-16 and UTF-32

UTF-16 uses 2 bytes for most characters and 4 bytes for rare ones (code points above U+FFFF, called surrogate pairs). Java, C# .NET, JavaScript, and Windows internals use UTF-16 strings natively. It's efficient for East Asian languages where most characters need 3 bytes in UTF-8 but only 2 in UTF-16.

UTF-32 always uses 4 bytes per character. Simple to index (character N is at byte 4N) but wastes space — English text takes 4x more space than ASCII. Rarely used in practice except in some internal processing contexts.

Both UTF-16 and UTF-32 have an endianness issue: a 2-byte value 0x0041 could be stored as 00 41 (big-endian) or 41 00 (little-endian). The BOM character (see below) distinguishes them. UTF-8 has no endianness problem.

Legacy Encodings

These encodings predate Unicode and are still encountered in old files, legacy databases, and regional systems.

Western European

ISO-8859-1 (Latin-1) — Covers Western European languages by adding 96 characters (128–255) to ASCII. Includes é, ü, ñ, etc. Used as the default in early HTTP and email. Doesn't include the € sign (added later in ISO-8859-15).

Windows-1252 (CP1252) — Microsoft's extension of Latin-1. Adds smart quotes, em dash, €, and other characters in the 0x80–0x9F range that Latin-1 left as control codes. Often mislabeled as ISO-8859-1 in old web pages. The difference matters when you encounter ’ (the UTF-8 bytes for ' read as Windows-1252).

Code pointISO-8859-1Windows-1252
0x80(undefined)
0x91(undefined)' (left single quote)
0x93(undefined)" (left double quote)
0x96(undefined)– (en dash)

Japanese encodings

Shift-JIS — The dominant Japanese encoding for Windows systems. Covers Hiragana, Katakana, Kanji, and ASCII. Has tricky byte sequences where some bytes can look like ASCII characters, making it prone to false detections. Also called CP932 or Windows-31J in Microsoft's variant.

EUC-JP — The Unix/Linux variant for Japanese. ASCII-compatible and less ambiguous than Shift-JIS, but both are obsolete for new work — use UTF-8 instead.

Chinese encodings

GB2312 — 1980 mainland China standard, covers 7,445 Simplified Chinese characters plus ASCII. Superseded by:

GBK — Extends GB2312 to 21,886 characters including Traditional Chinese used in Hong Kong. Used on Chinese Windows systems.

GB18030 — The current official Chinese standard. A superset of GBK that also covers all Unicode code points, making it the only legacy encoding that can represent everything UTF-8 can.

Big5 — Traditional Chinese for Taiwan and Hong Kong. ~13,000 characters. Has variants (Big5-HKSCS for Hong Kong) with different character mappings.

Cyrillic encodings

Windows-1251 — Russian, Bulgarian, Serbian in Cyrillic script. The standard for Russian Windows systems.

KOI8-R — A clever design where Cyrillic letters are placed at positions matching their Latin phonetic equivalents, so stripping the high bit gives readable (if wrong) transliteration. Used on Unix systems and old Russian internet.

Other ISO-8859 family

  • ISO-8859-2 — Central European (Polish, Czech, Slovak, Hungarian, Slovenian, Croatian)
  • ISO-8859-7 — Greek
  • ISO-8859-8 — Hebrew
  • ISO-8859-9 — Turkish
  • ISO-8859-15 — Latin-1 with the € symbol, updating the gap at 0xA4

Mojibake: Garbled Text

Mojibake (文字化け, Japanese for "character transformation") is what happens when text encoded in one system gets decoded by another. The bytes are intact but the interpretation is wrong.

Common patterns and their causes

What you seeWhat it should beCause
é ü ñé ü ñUTF-8 decoded as Windows-1252 or Latin-1
’ “ â€' " —Windows-1252 smart quotes decoded as UTF-8
Ð Ñ Ð±CyrillicCyrillic UTF-8 decoded as Latin-1
譌・譛ャ日本語Japanese UTF-8 decoded as Shift-JIS
H\0e\0l\0l\0HelloUTF-16 LE decoded as UTF-8 or Latin-1
? or □ or VariousByte has no mapping in target encoding

Diagnosing from the garbled text

The é pattern is diagnostic: é in UTF-8 is bytes C3 A9. In Windows-1252, byte C3 = à and A9 = ©. So é always means UTF-8 file opened as Windows-1252. Similarly, ’ is the three UTF-8 bytes for Windows-1252 byte 0x92 (right single quote).

Root causes

  1. Missing or wrong encoding declaration — HTML without <meta charset="UTF-8">, or with the wrong charset
  2. Editor mismatch — File saved as Windows-1252 by Notepad, then opened as UTF-8 by another editor
  3. Database connection encoding mismatch — Application connecting to MySQL with latin1 while the column stores UTF-8
  4. FTP in ASCII mode — FTP transfers in ASCII mode rewrite line endings and can corrupt high bytes
  5. Double-encoding — UTF-8 bytes stored in a Latin-1 column, then read out and decoded as Latin-1, producing garbled UTF-8 sequences stored where readable text should be

Fix encoding issues in the browser with the Text Encoding Converter — detect encoding automatically and convert to UTF-8 without uploading files.

The BOM Problem

The Byte Order Mark (U+FEFF) is a zero-width character placed at the start of a file to signal its encoding. For UTF-16 and UTF-32, it's necessary to distinguish big-endian from little-endian byte order. For UTF-8, it's optional and usually causes problems.

EncodingBOM bytes (hex)Needed?
UTF-8EF BB BFNo — causes problems in most contexts
UTF-16 BEFE FFRecommended
UTF-16 LEFF FERecommended
UTF-32 BE00 00 FE FFRecommended
UTF-32 LEFF FE 00 00Recommended

UTF-8 BOM problems

Windows Notepad and Excel add a UTF-8 BOM when saving. This causes:

  • PHP — "headers already sent" errors (the BOM counts as output before header() calls)
  • JSON — Invalid JSON; parsers reject files starting with bytes other than { or [
  • CSV headers — First column named Name instead of Name (invisible BOM character prepended)
  • Shell scripts — Shebang line #!/bin/bash preceded by BOM bytes; shell can't find the interpreter
  • File concatenation — Merging BOM-prefixed files puts BOM characters in the middle of the result

Removing BOM

# Detect BOM
hexdump -C file.txt | head -1

Look for: ef bb bf at position 0

Remove BOM (Linux/Mac)

tail -c +4 file.txt > file_clean.txt

Python: read with utf-8-sig codec (strips BOM automatically)

with open('file.txt', 'r', encoding='utf-8-sig') as f: content = f.read()

Write back without BOM

with open('output.txt', 'w', encoding='utf-8') as f: f.write(content)

Exception: Excel requires a UTF-8 BOM to recognize that a CSV is UTF-8. If you generate CSV files for Excel users, write the BOM first:

with open('for_excel.csv', 'w', encoding='utf-8-sig') as f:
writer = csv.writer(f)
writer.writerows(data)

Detecting Encoding

There is no guaranteed way to detect encoding — bytes are just bytes, and most encodings share the 0–127 range. But these methods work most of the time:

1. Check for BOM

with open('file.txt', 'rb') as f:
raw = f.read(4)
boms = {
b'\xef\xbb\xbf':     'UTF-8 with BOM',
b'\xff\xfe':         'UTF-16 LE',
b'\xfe\xff':         'UTF-16 BE',
b'\xff\xfe\x00\x00': 'UTF-32 LE',
b'\x00\x00\xfe\xff': 'UTF-32 BE',
}
for bom, name in boms.items():
if raw.startswith(bom):
print(name)
break

2. Statistical detection

# Python
pip install chardet

import chardet with open('file.txt', 'rb') as f: result = chardet.detect(f.read()) print(result)

→ {'encoding': 'Shift_JIS', 'confidence': 0.99}

Command line

chardetect file.txt

JavaScript (Node.js)

npm install jschardet const jschardet = require('jschardet'); const buf = require('fs').readFileSync('file.txt'); console.log(jschardet.detect(buf));

chardet works well for files with enough content (500+ characters). For short strings, confidence scores below 0.7 are unreliable.

3. Context clues

  • File came from a Japanese Windows user → try Shift-JIS first
  • File came from a German government system → try Windows-1252 or ISO-8859-1
  • File has <meta charset> or XML declaration → trust it (but verify)
  • HTTP response has Content-Type: charset= → trust it

4. Visual confirmation

Open the file in Notepad++ or VS Code, try switching encodings, and look at the actual characters. If you expect "café" and see "café", you're reading UTF-8 as Latin-1. Switch to UTF-8 and verify.

Converting Files

iconv (Linux/Mac)

# Basic conversion
iconv -f SHIFT-JIS -t UTF-8 input.txt > output.txt
iconv -f WINDOWS-1252 -t UTF-8 input.csv > output.csv
iconv -f ISO-8859-1 -t UTF-8 input.html > output.html

Handle invalid sequences (don't stop on error)

iconv -f GB2312 -t UTF-8//IGNORE input.txt > output.txt

List available encoding names

iconv -l

Python

<code">import chardet

Detect and convert

with open('unknown.txt', 'rb') as f: raw = f.read()

detected = chardet.detect(raw)['encoding'] text = raw.decode(detected, errors='replace')

with open('output.txt', 'w', encoding='utf-8') as f: f.write(text)

Convert with error handling options:

errors='strict' → raises UnicodeDecodeError on bad byte (default)

errors='ignore' → silently drops undecodable bytes

errors='replace' → substitutes U+FFFD (replacement character) for bad bytes

errors='backslashreplace' → shows \xXX escape for bad bytes

Batch conversion

<code">import chardet
from pathlib import Path

src = Path('legacy-files') dst = Path('utf8-files') dst.mkdir(exist_ok=True)

for f in src.glob('**/*.txt'): raw = f.read_bytes() enc = chardet.detect(raw).get('encoding', 'utf-8') try: text = raw.decode(enc) except (UnicodeDecodeError, LookupError): text = raw.decode('utf-8', errors='replace') (dst / f.name).write_text(text, encoding='utf-8')

For browser-based conversion, the Text Encoding Converter handles all the encodings above, including Shift-JIS, GBK, Windows-1251, and all ISO-8859 variants.

Database Charset Settings

MySQL: use utf8mb4, not utf8

MySQL's utf8 charset is a historical mistake — it supports only 3-byte UTF-8, which means it cannot store emoji (which need 4 bytes) or characters above U+FFFF. Use utf8mb4 for real UTF-8:

<code">-- Create database with utf8mb4
CREATE DATABASE myapp
CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;

-- Convert existing table ALTER TABLE users CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

-- Connection string must also specify charset mysql://user:pass@host/db?charset=utf8mb4

-- Check current encoding SHOW CREATE TABLE users\G SHOW VARIABLES LIKE 'character_set%';

Double-encoding trap

A common MySQL mistake: store UTF-8 data in a latin1 column by connecting with charset latin1. The bytes are stored correctly but MySQL thinks they're Latin-1. When you read them back as Latin-1 through a UTF-8 connection, MySQL "converts" them — each UTF-8 byte gets incorrectly decoded, producing garbage. Fix:

<code">-- If column was wrongly stored: bytes are UTF-8 but MySQL thinks latin1
-- Step 1: Get binary content (bypasses charset conversion)
ALTER TABLE t MODIFY col VARBINARY(255);
-- Step 2: Declare it UTF-8
ALTER TABLE t MODIFY col VARCHAR(255) CHARACTER SET utf8mb4;

PostgreSQL

PostgreSQL's UTF8 is true UTF-8 with no 3-byte limitation. Set it at database creation:

<code">CREATE DATABASE myapp ENCODING 'UTF8' LC_COLLATE 'en_US.UTF-8';

-- Set client encoding for a session SET client_encoding = 'UTF8';

-- Check current encoding SHOW client_encoding; SELECT pg_encoding_to_char(encoding) FROM pg_database WHERE datname = current_database();

HTTP and HTML Encoding Declarations

HTTP Content-Type header

Browsers use the HTTP header's charset parameter first, before looking at the HTML meta tag. Your server should set it correctly:

<code">Content-Type: text/html; charset=utf-8
Content-Type: text/plain; charset=utf-8
Content-Type: application/json; charset=utf-8
Content-Type: text/csv; charset=utf-8

Note: JSON (RFC 8259) requires UTF-8 and prohibits the BOM. If your server returns JSON, you should not include charset in the Content-Type since UTF-8 is mandatory, but including it doesn't break anything.

HTML

<code"><!-- HTML5: must be within first 1024 bytes -->
<meta charset="UTF-8">

<!-- HTML4/XHTML style (still valid) --> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

<!-- XML declaration (XHTML) --> <?xml version="1.0" encoding="UTF-8"?>

XML

XML defaults to UTF-8 if no encoding is declared. Always declare it anyway for clarity:

<code"><?xml version="1.0" encoding="UTF-8"?>

CSS

Add the charset declaration at the very top of CSS files containing non-ASCII characters (font names, content values):

<code">@charset "UTF-8";

Code Examples by Language

Python 3

<code"># Always specify encoding — don't rely on locale defaults
with open('data.txt', 'r', encoding='utf-8') as f:
content = f.read()

Write with explicit encoding

with open('output.txt', 'w', encoding='utf-8') as f: f.write(content)

Encode/decode bytes ↔ strings

text = 'café' encoded = text.encode('utf-8') # b'caf\xc3\xa9' decoded = encoded.decode('utf-8') # 'café'

Handle errors gracefully

text = b'\xff\xfe invalid bytes'.decode('utf-8', errors='replace')

Replaces bad bytes with U+FFFD replacement character

JavaScript / Node.js

<code">// Browser: TextEncoder / TextDecoder (built-in)
const encoder = new TextEncoder();  // always UTF-8
const bytes = encoder.encode('café');

const decoder = new TextDecoder('utf-8'); const text = decoder.decode(bytes);

// TextDecoder for other encodings const win1252Decoder = new TextDecoder('windows-1252'); const text2 = win1252Decoder.decode(buffer);

// Node.js: fs encoding const fs = require('fs'); const content = fs.readFileSync('file.txt', 'utf8'); fs.writeFileSync('out.txt', content, 'utf8');

// Node.js: iconv-lite for legacy encodings const iconv = require('iconv-lite'); const buf = fs.readFileSync('file.txt'); const text3 = iconv.decode(buf, 'shift-jis');

Java

<code">// Always specify charset — don't use new String(bytes) without encoding
String text = new String(bytes, StandardCharsets.UTF_8);
byte[] bytes2 = text.getBytes(StandardCharsets.UTF_8);

// File I/O import java.nio.file.; import java.nio.charset.; String content = Files.readString(path, StandardCharsets.UTF_8); Files.writeString(path, content, StandardCharsets.UTF_8);

Common mistakes

  • MySQL: using utf8 instead of utf8mb4 — emoji and some Chinese characters get rejected
  • Python 2 legacy: mixing byte strings and unicode strings without explicit encode/decode
  • Node.js: calling buffer.toString() without specifying 'utf8' — defaults to UTF-8 but better to be explicit
  • PHP: not setting mb_internal_encoding('UTF-8') — string functions operate byte-by-byte and mangle multibyte chars
  • Java: using String(bytes) instead of String(bytes, StandardCharsets.UTF_8) — relies on platform default, which varies

Privacy Notice: This site works entirely in your browser. We don't collect or store your data. Optional analytics help us improve the site. You can deny without affecting functionality.