Text Manipulation for Developers: Essential Techniques
String processing is something developers do constantly and mostly take for granted until something breaks in an unexpected way—Unicode characters that don't compare equal, regexes that match too much, slugs with invisible characters. This guide covers the patterns that matter, with the edge cases that trip people up.
Different parts of a codebase use different conventions. Converting between them is something you'll do when normalizing API responses, generating code, or processing user input.Case Conversion and Naming Conventions
| Convention | Example | Where it appears |
|---|---|---|
| lowercase | myvalue | SQL keywords (by convention), URL paths |
| UPPERCASE | MY_VALUE | Constants, environment variables, SQL identifiers |
| Title Case | My Value | UI labels, headings, proper nouns |
| Sentence case | My value | Body text, descriptions, error messages |
| camelCase | myValue | JavaScript/Java variables, JSON keys |
| PascalCase | MyValue | Classes, React components, TypeScript types |
| snake_case | my_value | Python, Ruby, database columns, Rust variables |
| kebab-case | my-value | CSS classes, HTML attributes, URL slugs, filenames |
| SCREAMING_SNAKE | MY_VALUE | Environment variables, C macros, constants |
JavaScript conversions
// camelCase → snake_case
function toSnakeCase(str) {
return str
.replace(/([A-Z])/g, '$1')
.toLowerCase()
.replace(/^/, ''); // Strip leading underscore if input started capital
}
toSnakeCase('helloWorldFoo'); // 'hello_world_foo'
// camelCase → kebab-case
function toKebabCase(str) {
return str
.replace(/([A-Z])/g, '-$1')
.toLowerCase()
.replace(/^-/, '');
}
toKebabCase('myVariableName'); // 'my-variable-name'
// snake_case or kebab-case → camelCase
function toCamelCase(str) {
return str
.toLowerCase()
.replace(/-_/g, (_, char) => char.toUpperCase());
}
toCamelCase('hello_world'); // 'helloWorld'
toCamelCase('hello-world'); // 'helloWorld'
// camelCase → PascalCase
function toPascalCase(str) {
const camel = toCamelCase(str);
return camel.charAt(0).toUpperCase() + camel.slice(1);
}
// Title Case (naive — doesn't handle articles)
function toTitleCase(str) {
return str.replace(/\b\w/g, char => char.toUpperCase());
}
// Proper Title Case (skips articles/prepositions mid-sentence)
const minorWords = new Set(['a','an','the','and','but','or','in','on','at','to','for','of']);
function toProperTitleCase(str) {
return str
.toLowerCase()
.replace(/\b\w+/g, (word, offset) => {
if (offset === 0 || !minorWords.has(word)) {
return word.charAt(0).toUpperCase() + word.slice(1);
}
return word;
});
}
toProperTitleCase('the quick brown fox'); // 'The Quick Brown Fox'
toProperTitleCase('war and peace'); // 'War and Peace'
Python
<code">import recamelCase → snake_case
def to_snake_case(s): s = re.sub(r'([A-Z]+)([A-Z][a-z])', r'\1_\2', s) s = re.sub(r'([a-z])([A-Z])', r'\1_\2', s) return s.lower()
to_snake_case('helloWorldFoo') # 'hello_world_foo' to_snake_case('parseHTTPRequest') # 'parse_http_request' (handles acronyms)
snake_case → camelCase
def to_camel_case(s): parts = s.split('_') return parts[0] + ''.join(w.capitalize() for w in parts[1:])
to_camel_case('hello_world_foo') # 'helloWorldFoo'
These patterns cover the extractions and validations you'll reach for repeatedly. None of them are perfect—the RFC specs for email and URLs are byzantine—but they handle 99% of real-world input.Regex Patterns Every Developer Needs
<code">// Email addresses
const EMAIL = /[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}/g;
// URLs (http and https)
const URL = /https?://[^\s<>"{}|\^`[]]+/g;
// IP addresses (IPv4)
const IPV4 = /\b(?:25[0-5]|2[0-4]\d|[01]?\d\d?)(?:.(?:25[0-5]|2[0-4]\d|[01]?\d\d?)){3}\b/g;
// CIDR notation
const CIDR = /\b(?:\d{1,3}.){3}\d{1,3}/(?:[0-9]|[1-2][0-9]|3[0-2])\b/g;
// Dates (YYYY-MM-DD)
const DATE_ISO = /\b\d{4}-(?:0[1-9]|1[0-2])-(?:0[1-9]|[12]\d|3[01])\b/g;
// Phone numbers (loose — matches many formats)
const PHONE = /(?:+?1[-.\s]?)?(?[2-9]\d{2})?[-.\s]?\d{3}[-.\s]?\d{4}/g;
// Credit card numbers (with or without spaces/dashes)
const CC = /\b(?:\d[ -]?){13,16}\b/g;
// Hex color codes
const HEX_COLOR = /#(?:[0-9a-fA-F]{3}){1,2}\b/g;
// HTML tags (for stripping, not parsing HTML generally)
const HTML_TAG = /<[^>]+>/g;
// Markdown links text
const MD_LINK = /[([^]]+)](([^)]+))/g;
// UUIDs (v1-v5)
const UUID = /[0-9a-f]{8}-[0-9a-f]{4}-[1-5][0-9a-f]{3}-[89ab][0-9a-f]{3}-[0-9a-f]{12}/gi;
// Extracting matches
const text = 'Contact us at hello@example.com or see https://example.com';
const emails = [...text.matchAll(EMAIL)].map(m => m[0]);
const urls = [...text.matchAll(URL)].map(m => m[0]);
Named capture groups
<code">// Named groups make extraction readable const LOG_LINE = /^(?\d{4}-\d{2}-\d{2}T[\d:]+Z)\s+(? ERROR|WARN|INFO|DEBUG)\s+(? .+)$/; const line = '2026-03-15T14:23:01Z ERROR Connection timeout after 30s'; const match = line.match(LOG_LINE); if (match) { const { timestamp, level, message } = match.groups; console.log(timestamp, level, message); }
// Extract all named matches from multiple lines function parseLogs(logText) { return [...logText.matchAll(new RegExp(LOG_LINE.source, 'gm'))] .map(m => m.groups); }
Getting encoding wrong causes security vulnerabilities (XSS), broken URLs, and corrupted data. Each context has its own escaping requirements.Encoding and Escaping
HTML escaping
<code">// Escape for insertion into HTML content (not attributes)
function escapeHtml(str) {
return str
.replace(/&/g, '&')
.replace(/</g, '<')
.replace(/>/g, '>')
.replace(/"/g, '"')
.replace(/'/g, ''');
}
// Never do: element.innerHTML = userInput
// Always: element.textContent = userInput (auto-escapes)
// Or: element.innerHTML = escapeHtml(userInput)
// WARNING: Don't call .toUpperCase() on already-escaped HTML
// escapeHtml('<script>').toUpperCase()
// → '<SCRIPT>' — still parsed as a tag by HTML5!
// Always escape AFTER any case transformations.
URL encoding
<code">// encodeURIComponent — encode a parameter value const param = 'hello world & more'; encodeURIComponent(param); // 'hello%20world%20%26%20more'// encodeURI — encode a complete URL (preserves : / ? # & = etc.) encodeURI('https://example.com/path with spaces'); // 'https://example.com/path with spaces'
// Build a query string properly function buildQueryString(params) { return Object.entries(params) .map(([k, v]) => encodeURIComponent(k) + '=' + encodeURIComponent(v)) .join('&'); }
buildQueryString({ q: 'hello world', page: 2, filter: 'a&b' }); // 'q=hello%20world&page=2&filter=a%26b'
// Parse a query string function parseQueryString(qs) { return Object.fromEntries( new URLSearchParams(qs) ); }
Base64
<code">// Browser
btoa('hello world'); // 'aGVsbG8gd29ybGQ='
atob('aGVsbG8gd29ybGQ='); // 'hello world'
// Node.js (Buffer-based, handles binary correctly)
Buffer.from('hello world').toString('base64'); // encode
Buffer.from('aGVsbG8gd29ybGQ=', 'base64').toString('utf8'); // decode
// URL-safe Base64 (replaces + with - and / with )
function toBase64Url(str) {
return btoa(str).replace(/+/g, '-').replace(///g, '').replace(/=/g, '');
}
function fromBase64Url(str) {
str = str.replace(/-/g, '+').replace(/_/g, '/');
while (str.length % 4) str += '=';
return atob(str);
}
Unicode has multiple ways to represent the same visible character. This causes comparison failures that are infuriating to debug because the strings look identical on screen.Unicode Normalization
<code">// The problem const s1 = 'é'; // U+00E9 — precomposed const s2 = 'e\u0301'; // e + combining acute accent — decomposed s1 === s2; // false — different byte sequences s1.length; // 1 s2.length; // 2// The fix: normalize before comparing s1.normalize('NFC') === s2.normalize('NFC'); // true
// Normalization forms: // NFC — Canonical Decomposition, then Canonical Composition (precomposed) // Best for storage, comparison, most string operations // NFD — Canonical Decomposition (fully decomposed) // Useful when you want to strip accents // NFKC — Compatibility Decomposition, then Composition // Folds ligatures, subscripts, Roman numerals to base chars // NFKD — Compatibility Decomposition // Most aggressive normalization
// Strip accents using NFD + filter function stripAccents(str) { return str.normalize('NFD').replace(/[\u0300-\u036f]/g, ''); } stripAccents('Ångström café naïve'); // 'Angstrom cafe naive'
// Unicode-aware string length (grapheme clusters) // '👨👩👧👦' is a family emoji — one visible character, but: '👨👩👧👦'.length; // 11 (code units) [...'👨👩👧👦'].length; // 7 (code points — spread operator) // For true grapheme count, use Intl.Segmenter: const seg = new Intl.Segmenter(); [...seg.segment('👨👩👧👦')].length; // 1 (actual visible characters)
Safe string reversal
<code">// Naive — breaks surrogate pairs
'hello 👋'.split('').reverse().join(''); // 'hello ' + broken bytes
// Code-point safe — works for most emoji
[...'hello 👋'].reverse().join(''); // '👋 olleh' ✓
// Grapheme-cluster safe — handles emoji with modifiers
function reverseString(str) {
const seg = new Intl.Segmenter();
return [...seg.segment(str)].map(s => s.segment).reverse().join('');
}
reverseString('hello 👍🏿'); // '👍🏿 olleh' — skin tone modifier preserved
Slugs are URL path segments. The rules: lowercase, alphanumerics and hyphens only, no leading/trailing/consecutive hyphens, ASCII only. The tricky parts are accent stripping and what to do with non-Latin scripts.Slug Generation
<code">function slugify(str, separator = '-') {
return str
.normalize('NFD') // Decompose accented chars
.replace(/[\u0300-\u036f]/g, '') // Strip combining marks (accents)
.toLowerCase()
.trim()
.replace(/[^\w\s-]/g, '') // Remove non-word chars except spaces and hyphens
.replace(/[\s_-]+/g, separator) // Spaces, underscores, hyphens → separator
.replace(new RegExp(^${separator}+|${separator}+$, 'g'), ''); // Trim separator
}
slugify('Hello World!'); // 'hello-world'
slugify('Café au lait'); // 'cafe-au-lait'
slugify(' Multiple Spaces '); // 'multiple-spaces'
slugify('C++ Programming'); // 'c-programming'
slugify('Ångström Measurement'); // 'angstrom-measurement'
slugify('hello_world', '_'); // 'hello_world' (custom separator)
For non-Latin scripts (Chinese, Arabic, Japanese), transliteration libraries like slugify (npm) or Python's python-slugify handle conversion to ASCII romanization. The native approach above just strips them, which gives empty slugs for CJK-only titles.
<code"># Python from slugify import slugify # pip install python-slugifyslugify('Hello World!') # 'hello-world' slugify('Café Résumé') # 'cafe-resume' slugify('日本語タイトル') # 'ri-ben-yu-taitorudang' (romanized) slugify('foo bar', separator='_') # 'foo_bar'
Deduplication and Sorting
Removing duplicate lines
<code">// Preserve insertion order (first occurrence wins)
function deduplicateLines(text) {
const seen = new Set();
return text.split('\n')
.filter(line => {
if (seen.has(line)) return false;
seen.add(line);
return true;
})
.join('\n');
}
// Case-insensitive deduplication (keeps original case of first occurrence)
function deduplicateCaseInsensitive(text) {
const seen = new Set();
return text.split('\n')
.filter(line => {
const key = line.toLowerCase().trim();
if (seen.has(key)) return false;
seen.add(key);
return true;
})
.join('\n');
}
// Remove empty lines too
function deduplicateAndClean(text) {
const seen = new Set();
return text.split('\n')
.filter(line => {
const trimmed = line.trim();
if (!trimmed || seen.has(trimmed)) return false;
seen.add(trimmed);
return true;
})
.join('\n');
}
Sorting strategies
<code">const lines = text.split('\n').filter(Boolean);
// Alphabetical (case-insensitive)
lines.sort((a, b) => a.toLowerCase().localeCompare(b.toLowerCase()));
// Natural sort — file1, file2, file10 (not file1, file10, file2)
lines.sort((a, b) => a.localeCompare(b, undefined, { numeric: true }));
// Sort by line length (shortest first)
lines.sort((a, b) => a.length - b.length);
// Sort numerically (when lines are numbers)
lines.sort((a, b) => parseFloat(a) - parseFloat(b));
// Fisher-Yates shuffle (random order)
function shuffle(arr) {
for (let i = arr.length - 1; i > 0; i--) {
const j = Math.floor(Math.random() * (i + 1));
[arr[i], arr[j]] = [arr[j], arr[i]];
}
return arr;
}
Computing diffs between text versions is useful for code review, changelog generation, and content audit tools.Text Diffing
Simple line diff (Myers algorithm)
<code">// Using the 'diff' npm package (widely used)
import { createTwoFilesPatch, diffLines } from 'diff';
const original = 'line one\nline two\nline three';
const modified = 'line one\nline TWO\nline three\nline four';
// Unified diff format
const patch = createTwoFilesPatch('original.txt', 'modified.txt', original, modified);
console.log(patch);
// Line-by-line diff with change objects
const changes = diffLines(original, modified);
for (const change of changes) {
const prefix = change.added ? '+' : change.removed ? '-' : ' ';
process.stdout.write(change.value.replace(/^/gm, prefix));
}
// Character-level diff within lines
import { diffChars } from 'diff';
const charDiff = diffChars('the cat sat', 'the dog sat');
charDiff.forEach(part => {
const color = part.added ? '\x1b[32m' : part.removed ? '\x1b[31m' : '';
process.stdout.write(color + part.value + '\x1b[0m');
});
Python difflib
<code"># Built into Python standard library import diffliboriginal = ['line one\n', 'line two\n', 'line three\n'] modified = ['line one\n', 'line TWO\n', 'line three\n', 'line four\n']
Unified diff
diff = difflib.unified_diff(original, modified, fromfile='original', tofile='modified') print(''.join(diff))
HTML diff
htmldiff = difflib.HtmlDiff() html = htmldiff.make_table(original, modified, fromdesc='Before', todesc='After')
Similarity ratio (0.0 to 1.0)
ratio = difflib.SequenceMatcher(None, 'hello world', 'hello earth').ratio()
0.727...
Word count sounds trivial until you hit edge cases: hyphenated words, contractions, CJK text (no spaces), emoji clusters.Word and Character Counting
<code">function analyzeText(text) {
// Normalize line endings
const normalized = text.replace(/\r\n/g, '\n').replace(/\r/g, '\n');
// Character counts
const charCount = normalized.length; // includes whitespace
const charNoSpaces = normalized.replace(/\s/g, '').length;
// Line count
const lineCount = normalized.split('\n').length;
const nonEmptyLines = normalized.split('\n').filter(l => l.trim()).length;
// Word count — split on whitespace, filter empty strings
const words = normalized.trim().split(/\s+/).filter(Boolean);
const wordCount = words.length;
// Reading time (average 200 words per minute)
const readingMinutes = Math.ceil(wordCount / 200);
// Word frequency map
const frequency = {};
for (const word of words) {
const key = word.toLowerCase().replace(/[^\w]/g, '');
if (key) frequency[key] = (frequency[key] || 0) + 1;
}
const topWords = Object.entries(frequency)
.sort(([,a], [,b]) => b - a)
.slice(0, 10);
// Sentence count (rough)
const sentences = normalized.split(/[.!?]+/).filter(s => s.trim()).length;
return { charCount, charNoSpaces, lineCount, nonEmptyLines,
wordCount, readingMinutes, topWords, sentences };
}
CSV seems simple. Then you encounter quoted fields, embedded commas, newlines inside quotes, and the lack of a definitive standard. Don't write your own parser for anything beyond toy use cases.CSV and TSV Parsing
The edge cases that break naive parsers
<code">// This is a single CSV row with three fields: '"Smith, John",30,"New York, NY"' // Field 1: Smith, John (quoted because it contains a comma) // Field 2: 30 // Field 3: New York, NY// Embedded newlines are valid in quoted fields: '"First line\nSecond line",value2' // Field 1 spans two lines // A line-by-line reader breaks here
// Escaped quotes (double-quote escaping): '"She said ""hello""",value2' // Field 1: She said "hello"
// naive split(',') fails on all of these
Parsing with a library
<code">// Browser or Node.js — Papa Parse (most popular) import Papa from 'papaparse';const result = Papa.parse(csvString, { header: true, // First row is header dynamicTyping: true, // Auto-convert numbers/booleans skipEmptyLines: true, trimHeaders: true });
// result.data — array of objects (if header: true) // result.errors — parse errors // result.meta — delimiter detected, line count, etc.
// Streaming large files (Node.js) Papa.parse(fs.createReadStream('large.csv'), { header: true, step: (row) => processRow(row.data), complete: () => console.log('Done') });
// Generate CSV from data const csv = Papa.unparse([ { name: 'Alice', age: 30 }, { name: 'Bob', age: 25 } ]); // → 'name,age\r\nAlice,30\r\nBob,25'
<code"># Python — csv module (standard library) import csvReading
with open('data.csv', newline='', encoding='utf-8') as f: reader = csv.DictReader(f) for row in reader: print(row['name'], row['age'])
Writing
with open('output.csv', 'w', newline='', encoding='utf-8') as f: writer = csv.DictWriter(f, fieldnames=['name', 'age']) writer.writeheader() writer.writerows([{'name': 'Alice', 'age': 30}])
TSV — just change the delimiter
reader = csv.DictReader(f, delimiter='\t')
TSV vs CSV
TSV (tab-separated values) is simpler than CSV: tabs in field values are rare, so quoting is rarely needed. It's the preferred export format for spreadsheet data that might contain commas (addresses, descriptions). The tradeoff: tabs are invisible in many text editors and can be accidentally converted to spaces.
Lorem ipsum is Latin from Cicero's de Finibus, scrambled to look like natural text without being readable. It's been the standard placeholder since the 1960s when Letraset used it for dry-transfer lettering sheets.Placeholder Text Generation
<code">// Generate n words of lorem ipsum (browser or Node)
// Using the 'lorem-ipsum' npm package
import { LoremIpsum } from 'lorem-ipsum';
const lorem = new LoremIpsum({
sentencesPerParagraph: { max: 8, min: 4 },
wordsPerSentence: { max: 16, min: 4 }
});
lorem.generateWords(10); // 10 random words
lorem.generateSentences(3); // 3 sentences
lorem.generateParagraphs(2); // 2 paragraphs
// For testing with varied, realistic-looking text,
// use 'casual' package or Faker.js instead:
<code"># Python from faker import Faker # pip install faker fake = Faker()fake.text(max_nb_chars=200) # Paragraph of random text fake.sentence() # Single sentence fake.paragraph(nb_sentences=5) # 5-sentence paragraph fake.name() # Realistic name fake.address() # Realistic address
The Unix text processing tools are worth knowing even if you work primarily in Python or JavaScript. They're faster for quick one-liners on files than writing a script.CLI Tools
| Command | What it does | Essential flag |
|---|---|---|
sort | Sort lines | -n numeric, -r reverse, -u unique |
uniq | Remove consecutive duplicates | -c count, -i case-insensitive; must sort first |
tr | Translate or delete characters | -d delete, -s squeeze repeats |
sed | Stream editor (substitution, deletion) | -i in-place edit |
awk | Column processing, arithmetic | -F delimiter |
grep | Search by regex | -o print only match, -v invert, -E extended regex |
cut | Extract fields by delimiter or position | -d delimiter, -f field number |
wc | Count words, lines, characters | -l lines, -w words, -c bytes |
Common one-liners
<code"># Sort and deduplicate sort file.txt | uniqTop 10 most frequent lines
sort file.txt | uniq -c | sort -rn | head -10
Extract all email addresses
grep -oE '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}' file.txt | sort -u
Extract all URLs
grep -oE 'https?://[^[:space:]]+' file.txt | sort -u
Convert to uppercase
tr '[:lower:]' '[:upper:]' < file.txt
Remove blank lines
sed '/^[[:space:]]*$/d' file.txt
Strip leading/trailing whitespace from each line
sed 's/^[[:space:]]//;s/[[:space:]]$//' file.txt
Replace all occurrences of a word in-place
sed -i 's/oldword/newword/g' file.txt
Extract second column from CSV
cut -d',' -f2 file.csv
Count words in file
wc -w file.txt
Reverse line order (opposite of head/tail)
tac file.txt
Count occurrences of each unique line
sort file.txt | uniq -c
Find lines in file1 not in file2
comm -23 <(sort file1.txt) <(sort file2.txt)
For quick transformations without writing code:Online Tools
Case Converter
Convert between camelCase, PascalCase, snake_case, kebab-case, Title Case, and more.
Convert CaseRemove Duplicates
Deduplicate lines with options for case sensitivity, whitespace trimming, and blank line removal.
Remove DuplicatesSort Lines
Sort text alphabetically, numerically, by length, randomly, or in natural order.
Sort LinesSlug Generator
Generate URL-friendly slugs with accent stripping and custom separators.
Generate Slug