100% Private

Text Manipulation for Developers: Essential Techniques

String processing is something developers do constantly and mostly take for granted until something breaks in an unexpected way—Unicode characters that don't compare equal, regexes that match too much, slugs with invisible characters. This guide covers the patterns that matter, with the edge cases that trip people up.

Case Conversion and Naming Conventions

Different parts of a codebase use different conventions. Converting between them is something you'll do when normalizing API responses, generating code, or processing user input.

ConventionExampleWhere it appears
lowercasemyvalueSQL keywords (by convention), URL paths
UPPERCASEMY_VALUEConstants, environment variables, SQL identifiers
Title CaseMy ValueUI labels, headings, proper nouns
Sentence caseMy valueBody text, descriptions, error messages
camelCasemyValueJavaScript/Java variables, JSON keys
PascalCaseMyValueClasses, React components, TypeScript types
snake_casemy_valuePython, Ruby, database columns, Rust variables
kebab-casemy-valueCSS classes, HTML attributes, URL slugs, filenames
SCREAMING_SNAKEMY_VALUEEnvironment variables, C macros, constants

JavaScript conversions

// camelCase → snake_case
function toSnakeCase(str) {
return str
.replace(/([A-Z])/g, '$1')
.toLowerCase()
.replace(/^/, '');  // Strip leading underscore if input started capital
}
toSnakeCase('helloWorldFoo'); // 'hello_world_foo'

// camelCase → kebab-case function toKebabCase(str) { return str .replace(/([A-Z])/g, '-$1') .toLowerCase() .replace(/^-/, ''); } toKebabCase('myVariableName'); // 'my-variable-name'

// snake_case or kebab-case → camelCase function toCamelCase(str) { return str .toLowerCase() .replace(/-_/g, (_, char) => char.toUpperCase()); } toCamelCase('hello_world'); // 'helloWorld' toCamelCase('hello-world'); // 'helloWorld'

// camelCase → PascalCase function toPascalCase(str) { const camel = toCamelCase(str); return camel.charAt(0).toUpperCase() + camel.slice(1); }

// Title Case (naive — doesn't handle articles) function toTitleCase(str) { return str.replace(/\b\w/g, char => char.toUpperCase()); }

// Proper Title Case (skips articles/prepositions mid-sentence) const minorWords = new Set(['a','an','the','and','but','or','in','on','at','to','for','of']); function toProperTitleCase(str) { return str .toLowerCase() .replace(/\b\w+/g, (word, offset) => { if (offset === 0 || !minorWords.has(word)) { return word.charAt(0).toUpperCase() + word.slice(1); } return word; }); } toProperTitleCase('the quick brown fox'); // 'The Quick Brown Fox' toProperTitleCase('war and peace'); // 'War and Peace'

Python

<code">import re

camelCase → snake_case

def to_snake_case(s): s = re.sub(r'([A-Z]+)([A-Z][a-z])', r'\1_\2', s) s = re.sub(r'([a-z])([A-Z])', r'\1_\2', s) return s.lower()

to_snake_case('helloWorldFoo') # 'hello_world_foo' to_snake_case('parseHTTPRequest') # 'parse_http_request' (handles acronyms)

snake_case → camelCase

def to_camel_case(s): parts = s.split('_') return parts[0] + ''.join(w.capitalize() for w in parts[1:])

to_camel_case('hello_world_foo') # 'helloWorldFoo'

Regex Patterns Every Developer Needs

These patterns cover the extractions and validations you'll reach for repeatedly. None of them are perfect—the RFC specs for email and URLs are byzantine—but they handle 99% of real-world input.

<code">// Email addresses
const EMAIL = /[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}/g;

// URLs (http and https) const URL = /https?://[^\s<>"{}|\^`[]]+/g;

// IP addresses (IPv4) const IPV4 = /\b(?:25[0-5]|2[0-4]\d|[01]?\d\d?)(?:.(?:25[0-5]|2[0-4]\d|[01]?\d\d?)){3}\b/g;

// CIDR notation const CIDR = /\b(?:\d{1,3}.){3}\d{1,3}/(?:[0-9]|[1-2][0-9]|3[0-2])\b/g;

// Dates (YYYY-MM-DD) const DATE_ISO = /\b\d{4}-(?:0[1-9]|1[0-2])-(?:0[1-9]|[12]\d|3[01])\b/g;

// Phone numbers (loose — matches many formats) const PHONE = /(?:+?1[-.\s]?)?(?[2-9]\d{2})?[-.\s]?\d{3}[-.\s]?\d{4}/g;

// Credit card numbers (with or without spaces/dashes) const CC = /\b(?:\d[ -]?){13,16}\b/g;

// Hex color codes const HEX_COLOR = /#(?:[0-9a-fA-F]{3}){1,2}\b/g;

// HTML tags (for stripping, not parsing HTML generally) const HTML_TAG = /<[^>]+>/g;

// Markdown links text const MD_LINK = /[([^]]+)](([^)]+))/g;

// UUIDs (v1-v5) const UUID = /[0-9a-f]{8}-[0-9a-f]{4}-[1-5][0-9a-f]{3}-[89ab][0-9a-f]{3}-[0-9a-f]{12}/gi;

// Extracting matches const text = 'Contact us at hello@example.com or see https://example.com'; const emails = [...text.matchAll(EMAIL)].map(m => m[0]); const urls = [...text.matchAll(URL)].map(m => m[0]);

Named capture groups

<code">// Named groups make extraction readable
const LOG_LINE = /^(?\d{4}-\d{2}-\d{2}T[\d:]+Z)\s+(?ERROR|WARN|INFO|DEBUG)\s+(?.+)$/;

const line = '2026-03-15T14:23:01Z ERROR Connection timeout after 30s'; const match = line.match(LOG_LINE); if (match) { const { timestamp, level, message } = match.groups; console.log(timestamp, level, message); }

// Extract all named matches from multiple lines function parseLogs(logText) { return [...logText.matchAll(new RegExp(LOG_LINE.source, 'gm'))] .map(m => m.groups); }

Encoding and Escaping

Getting encoding wrong causes security vulnerabilities (XSS), broken URLs, and corrupted data. Each context has its own escaping requirements.

HTML escaping

<code">// Escape for insertion into HTML content (not attributes)
function escapeHtml(str) {
return str
.replace(/&/g, '&')
.replace(/</g, '<')
.replace(/>/g, '>')
.replace(/"/g, '"')
.replace(/'/g, ''');
}

// Never do: element.innerHTML = userInput // Always: element.textContent = userInput (auto-escapes) // Or: element.innerHTML = escapeHtml(userInput)

// WARNING: Don't call .toUpperCase() on already-escaped HTML // escapeHtml('<script>').toUpperCase() // → '<SCRIPT>' — still parsed as a tag by HTML5! // Always escape AFTER any case transformations.

URL encoding

<code">// encodeURIComponent — encode a parameter value
const param = 'hello world & more';
encodeURIComponent(param);  // 'hello%20world%20%26%20more'

// encodeURI — encode a complete URL (preserves : / ? # & = etc.) encodeURI('https://example.com/path with spaces'); // 'https://example.com/path with spaces'

// Build a query string properly function buildQueryString(params) { return Object.entries(params) .map(([k, v]) => encodeURIComponent(k) + '=' + encodeURIComponent(v)) .join('&'); }

buildQueryString({ q: 'hello world', page: 2, filter: 'a&b' }); // 'q=hello%20world&page=2&filter=a%26b'

// Parse a query string function parseQueryString(qs) { return Object.fromEntries( new URLSearchParams(qs) ); }

Base64

<code">// Browser
btoa('hello world');             // 'aGVsbG8gd29ybGQ='
atob('aGVsbG8gd29ybGQ=');       // 'hello world'

// Node.js (Buffer-based, handles binary correctly) Buffer.from('hello world').toString('base64'); // encode Buffer.from('aGVsbG8gd29ybGQ=', 'base64').toString('utf8'); // decode

// URL-safe Base64 (replaces + with - and / with ) function toBase64Url(str) { return btoa(str).replace(/+/g, '-').replace(///g, '').replace(/=/g, ''); } function fromBase64Url(str) { str = str.replace(/-/g, '+').replace(/_/g, '/'); while (str.length % 4) str += '='; return atob(str); }

Unicode Normalization

Unicode has multiple ways to represent the same visible character. This causes comparison failures that are infuriating to debug because the strings look identical on screen.

<code">// The problem
const s1 = 'é';          // U+00E9 — precomposed
const s2 = 'e\u0301';    // e + combining acute accent — decomposed
s1 === s2;               // false — different byte sequences
s1.length;               // 1
s2.length;               // 2

// The fix: normalize before comparing s1.normalize('NFC') === s2.normalize('NFC'); // true

// Normalization forms: // NFC — Canonical Decomposition, then Canonical Composition (precomposed) // Best for storage, comparison, most string operations // NFD — Canonical Decomposition (fully decomposed) // Useful when you want to strip accents // NFKC — Compatibility Decomposition, then Composition // Folds ligatures, subscripts, Roman numerals to base chars // NFKD — Compatibility Decomposition // Most aggressive normalization

// Strip accents using NFD + filter function stripAccents(str) { return str.normalize('NFD').replace(/[\u0300-\u036f]/g, ''); } stripAccents('Ångström café naïve'); // 'Angstrom cafe naive'

// Unicode-aware string length (grapheme clusters) // '👨‍👩‍👧‍👦' is a family emoji — one visible character, but: '👨‍👩‍👧‍👦'.length; // 11 (code units) [...'👨‍👩‍👧‍👦'].length; // 7 (code points — spread operator) // For true grapheme count, use Intl.Segmenter: const seg = new Intl.Segmenter(); [...seg.segment('👨‍👩‍👧‍👦')].length; // 1 (actual visible characters)

Safe string reversal

<code">// Naive — breaks surrogate pairs
'hello 👋'.split('').reverse().join('');   // 'hello ' + broken bytes

// Code-point safe — works for most emoji [...'hello 👋'].reverse().join(''); // '👋 olleh' ✓

// Grapheme-cluster safe — handles emoji with modifiers function reverseString(str) { const seg = new Intl.Segmenter(); return [...seg.segment(str)].map(s => s.segment).reverse().join(''); } reverseString('hello 👍🏿'); // '👍🏿 olleh' — skin tone modifier preserved

Slug Generation

Slugs are URL path segments. The rules: lowercase, alphanumerics and hyphens only, no leading/trailing/consecutive hyphens, ASCII only. The tricky parts are accent stripping and what to do with non-Latin scripts.

<code">function slugify(str, separator = '-') {
return str
.normalize('NFD')                          // Decompose accented chars
.replace(/[\u0300-\u036f]/g, '')           // Strip combining marks (accents)
.toLowerCase()
.trim()
.replace(/[^\w\s-]/g, '')                  // Remove non-word chars except spaces and hyphens
.replace(/[\s_-]+/g, separator)            // Spaces, underscores, hyphens → separator
.replace(new RegExp(^${separator}+|${separator}+$, 'g'), '');  // Trim separator
}

slugify('Hello World!'); // 'hello-world' slugify('Café au lait'); // 'cafe-au-lait' slugify(' Multiple Spaces '); // 'multiple-spaces' slugify('C++ Programming'); // 'c-programming' slugify('Ångström Measurement'); // 'angstrom-measurement' slugify('hello_world', '_'); // 'hello_world' (custom separator)

For non-Latin scripts (Chinese, Arabic, Japanese), transliteration libraries like slugify (npm) or Python's python-slugify handle conversion to ASCII romanization. The native approach above just strips them, which gives empty slugs for CJK-only titles.

<code"># Python
from slugify import slugify  # pip install python-slugify

slugify('Hello World!') # 'hello-world' slugify('Café Résumé') # 'cafe-resume' slugify('日本語タイトル') # 'ri-ben-yu-taitorudang' (romanized) slugify('foo bar', separator='_') # 'foo_bar'

Deduplication and Sorting

Removing duplicate lines

<code">// Preserve insertion order (first occurrence wins)
function deduplicateLines(text) {
const seen = new Set();
return text.split('\n')
.filter(line => {
if (seen.has(line)) return false;
seen.add(line);
return true;
})
.join('\n');
}

// Case-insensitive deduplication (keeps original case of first occurrence) function deduplicateCaseInsensitive(text) { const seen = new Set(); return text.split('\n') .filter(line => { const key = line.toLowerCase().trim(); if (seen.has(key)) return false; seen.add(key); return true; }) .join('\n'); }

// Remove empty lines too function deduplicateAndClean(text) { const seen = new Set(); return text.split('\n') .filter(line => { const trimmed = line.trim(); if (!trimmed || seen.has(trimmed)) return false; seen.add(trimmed); return true; }) .join('\n'); }

Sorting strategies

<code">const lines = text.split('\n').filter(Boolean);

// Alphabetical (case-insensitive) lines.sort((a, b) => a.toLowerCase().localeCompare(b.toLowerCase()));

// Natural sort — file1, file2, file10 (not file1, file10, file2) lines.sort((a, b) => a.localeCompare(b, undefined, { numeric: true }));

// Sort by line length (shortest first) lines.sort((a, b) => a.length - b.length);

// Sort numerically (when lines are numbers) lines.sort((a, b) => parseFloat(a) - parseFloat(b));

// Fisher-Yates shuffle (random order) function shuffle(arr) { for (let i = arr.length - 1; i > 0; i--) { const j = Math.floor(Math.random() * (i + 1)); [arr[i], arr[j]] = [arr[j], arr[i]]; } return arr; }

Text Diffing

Computing diffs between text versions is useful for code review, changelog generation, and content audit tools.

Simple line diff (Myers algorithm)

<code">// Using the 'diff' npm package (widely used)
import { createTwoFilesPatch, diffLines } from 'diff';

const original = 'line one\nline two\nline three'; const modified = 'line one\nline TWO\nline three\nline four';

// Unified diff format const patch = createTwoFilesPatch('original.txt', 'modified.txt', original, modified); console.log(patch);

// Line-by-line diff with change objects const changes = diffLines(original, modified); for (const change of changes) { const prefix = change.added ? '+' : change.removed ? '-' : ' '; process.stdout.write(change.value.replace(/^/gm, prefix)); }

// Character-level diff within lines import { diffChars } from 'diff'; const charDiff = diffChars('the cat sat', 'the dog sat'); charDiff.forEach(part => { const color = part.added ? '\x1b[32m' : part.removed ? '\x1b[31m' : ''; process.stdout.write(color + part.value + '\x1b[0m'); });

Python difflib

<code"># Built into Python standard library
import difflib

original = ['line one\n', 'line two\n', 'line three\n'] modified = ['line one\n', 'line TWO\n', 'line three\n', 'line four\n']

Unified diff

diff = difflib.unified_diff(original, modified, fromfile='original', tofile='modified') print(''.join(diff))

HTML diff

htmldiff = difflib.HtmlDiff() html = htmldiff.make_table(original, modified, fromdesc='Before', todesc='After')

Similarity ratio (0.0 to 1.0)

ratio = difflib.SequenceMatcher(None, 'hello world', 'hello earth').ratio()

0.727...

Word and Character Counting

Word count sounds trivial until you hit edge cases: hyphenated words, contractions, CJK text (no spaces), emoji clusters.

<code">function analyzeText(text) {
// Normalize line endings
const normalized = text.replace(/\r\n/g, '\n').replace(/\r/g, '\n');

// Character counts const charCount = normalized.length; // includes whitespace const charNoSpaces = normalized.replace(/\s/g, '').length;

// Line count const lineCount = normalized.split('\n').length; const nonEmptyLines = normalized.split('\n').filter(l => l.trim()).length;

// Word count — split on whitespace, filter empty strings const words = normalized.trim().split(/\s+/).filter(Boolean); const wordCount = words.length;

// Reading time (average 200 words per minute) const readingMinutes = Math.ceil(wordCount / 200);

// Word frequency map const frequency = {}; for (const word of words) { const key = word.toLowerCase().replace(/[^\w]/g, ''); if (key) frequency[key] = (frequency[key] || 0) + 1; } const topWords = Object.entries(frequency) .sort(([,a], [,b]) => b - a) .slice(0, 10);

// Sentence count (rough) const sentences = normalized.split(/[.!?]+/).filter(s => s.trim()).length;

return { charCount, charNoSpaces, lineCount, nonEmptyLines, wordCount, readingMinutes, topWords, sentences }; }

CSV and TSV Parsing

CSV seems simple. Then you encounter quoted fields, embedded commas, newlines inside quotes, and the lack of a definitive standard. Don't write your own parser for anything beyond toy use cases.

The edge cases that break naive parsers

<code">// This is a single CSV row with three fields:
'"Smith, John",30,"New York, NY"'
// Field 1: Smith, John  (quoted because it contains a comma)
// Field 2: 30
// Field 3: New York, NY

// Embedded newlines are valid in quoted fields: '"First line\nSecond line",value2' // Field 1 spans two lines // A line-by-line reader breaks here

// Escaped quotes (double-quote escaping): '"She said ""hello""",value2' // Field 1: She said "hello"

// naive split(',') fails on all of these

Parsing with a library

<code">// Browser or Node.js — Papa Parse (most popular)
import Papa from 'papaparse';

const result = Papa.parse(csvString, { header: true, // First row is header dynamicTyping: true, // Auto-convert numbers/booleans skipEmptyLines: true, trimHeaders: true });

// result.data — array of objects (if header: true) // result.errors — parse errors // result.meta — delimiter detected, line count, etc.

// Streaming large files (Node.js) Papa.parse(fs.createReadStream('large.csv'), { header: true, step: (row) => processRow(row.data), complete: () => console.log('Done') });

// Generate CSV from data const csv = Papa.unparse([ { name: 'Alice', age: 30 }, { name: 'Bob', age: 25 } ]); // → 'name,age\r\nAlice,30\r\nBob,25'

<code"># Python — csv module (standard library)
import csv

Reading

with open('data.csv', newline='', encoding='utf-8') as f: reader = csv.DictReader(f) for row in reader: print(row['name'], row['age'])

Writing

with open('output.csv', 'w', newline='', encoding='utf-8') as f: writer = csv.DictWriter(f, fieldnames=['name', 'age']) writer.writeheader() writer.writerows([{'name': 'Alice', 'age': 30}])

TSV — just change the delimiter

reader = csv.DictReader(f, delimiter='\t')

TSV vs CSV

TSV (tab-separated values) is simpler than CSV: tabs in field values are rare, so quoting is rarely needed. It's the preferred export format for spreadsheet data that might contain commas (addresses, descriptions). The tradeoff: tabs are invisible in many text editors and can be accidentally converted to spaces.

Placeholder Text Generation

Lorem ipsum is Latin from Cicero's de Finibus, scrambled to look like natural text without being readable. It's been the standard placeholder since the 1960s when Letraset used it for dry-transfer lettering sheets.

<code">// Generate n words of lorem ipsum (browser or Node)
// Using the 'lorem-ipsum' npm package
import { LoremIpsum } from 'lorem-ipsum';

const lorem = new LoremIpsum({ sentencesPerParagraph: { max: 8, min: 4 }, wordsPerSentence: { max: 16, min: 4 } });

lorem.generateWords(10); // 10 random words lorem.generateSentences(3); // 3 sentences lorem.generateParagraphs(2); // 2 paragraphs

// For testing with varied, realistic-looking text, // use 'casual' package or Faker.js instead:

<code"># Python
from faker import Faker  # pip install faker
fake = Faker()

fake.text(max_nb_chars=200) # Paragraph of random text fake.sentence() # Single sentence fake.paragraph(nb_sentences=5) # 5-sentence paragraph fake.name() # Realistic name fake.address() # Realistic address

CLI Tools

The Unix text processing tools are worth knowing even if you work primarily in Python or JavaScript. They're faster for quick one-liners on files than writing a script.

CommandWhat it doesEssential flag
sortSort lines-n numeric, -r reverse, -u unique
uniqRemove consecutive duplicates-c count, -i case-insensitive; must sort first
trTranslate or delete characters-d delete, -s squeeze repeats
sedStream editor (substitution, deletion)-i in-place edit
awkColumn processing, arithmetic-F delimiter
grepSearch by regex-o print only match, -v invert, -E extended regex
cutExtract fields by delimiter or position-d delimiter, -f field number
wcCount words, lines, characters-l lines, -w words, -c bytes

Common one-liners

<code"># Sort and deduplicate
sort file.txt | uniq

Top 10 most frequent lines

sort file.txt | uniq -c | sort -rn | head -10

Extract all email addresses

grep -oE '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}' file.txt | sort -u

Extract all URLs

grep -oE 'https?://[^[:space:]]+' file.txt | sort -u

Convert to uppercase

tr '[:lower:]' '[:upper:]' < file.txt

Remove blank lines

sed '/^[[:space:]]*$/d' file.txt

Strip leading/trailing whitespace from each line

sed 's/^[[:space:]]//;s/[[:space:]]$//' file.txt

Replace all occurrences of a word in-place

sed -i 's/oldword/newword/g' file.txt

Extract second column from CSV

cut -d',' -f2 file.csv

Count words in file

wc -w file.txt

Reverse line order (opposite of head/tail)

tac file.txt

Count occurrences of each unique line

sort file.txt | uniq -c

Find lines in file1 not in file2

comm -23 <(sort file1.txt) <(sort file2.txt)

Online Tools

For quick transformations without writing code:

Case Converter

Convert between camelCase, PascalCase, snake_case, kebab-case, Title Case, and more.

Convert Case
Remove Duplicates

Deduplicate lines with options for case sensitivity, whitespace trimming, and blank line removal.

Remove Duplicates
Sort Lines

Sort text alphabetically, numerically, by length, randomly, or in natural order.

Sort Lines
Slug Generator

Generate URL-friendly slugs with accent stripping and custom separators.

Generate Slug
Text Diff

Compare two blocks of text and highlight additions, deletions, and changes.

Diff Text
Word Counter

Count words, characters, lines, sentences, and reading time for any text.

Count Words

Last updated: March 2026. All text tools on ToolsDock run in your browser—no data is sent to any server.

Privacy Notice: This site works entirely in your browser. We don't collect or store your data. Optional analytics help us improve the site. You can deny without affecting functionality.