Text Manipulation for Developers: Essential Techniques

String processing is something developers do constantly and mostly take for granted until something breaks in an unexpected way—Unicode characters that don't compare equal, regexes that match too much, slugs with invisible characters. This guide covers the patterns that matter, with the edge cases that trip people up.

Case Conversion and Naming Conventions

Different parts of a codebase use different conventions. Converting between them is something you'll do when normalizing API responses, generating code, or processing user input.

Convention	Example	Where it appears
lowercase	myvalue	SQL keywords (by convention), URL paths
UPPERCASE	MY_VALUE	Constants, environment variables, SQL identifiers
Title Case	My Value	UI labels, headings, proper nouns
Sentence case	My value	Body text, descriptions, error messages
camelCase	myValue	JavaScript/Java variables, JSON keys
PascalCase	MyValue	Classes, React components, TypeScript types
snake_case	my_value	Python, Ruby, database columns, Rust variables
kebab-case	my-value	CSS classes, HTML attributes, URL slugs, filenames
SCREAMING_SNAKE	MY_VALUE	Environment variables, C macros, constants

JavaScript conversions

// camelCase → snake_case
function toSnakeCase(str) {
return str
.replace(/([A-Z])/g, '$1')
.toLowerCase()
.replace(/^/, '');  // Strip leading underscore if input started capital
}
toSnakeCase('helloWorldFoo'); // 'hello_world_foo'
// camelCase → kebab-case
function toKebabCase(str) {
return str
.replace(/([A-Z])/g, '-$1')
.toLowerCase()
.replace(/^-/, '');
}
toKebabCase('myVariableName'); // 'my-variable-name'
// snake_case or kebab-case → camelCase
function toCamelCase(str) {
return str
.toLowerCase()
.replace(/-_/g, (_, char) => char.toUpperCase());
}
toCamelCase('hello_world');   // 'helloWorld'
toCamelCase('hello-world');   // 'helloWorld'
// camelCase → PascalCase
function toPascalCase(str) {
const camel = toCamelCase(str);
return camel.charAt(0).toUpperCase() + camel.slice(1);
}
// Title Case (naive — doesn't handle articles)
function toTitleCase(str) {
return str.replace(/\b\w/g, char => char.toUpperCase());
}
// Proper Title Case (skips articles/prepositions mid-sentence)
const minorWords = new Set(['a','an','the','and','but','or','in','on','at','to','for','of']);
function toProperTitleCase(str) {
return str
.toLowerCase()
.replace(/\b\w+/g, (word, offset) => {
if (offset === 0 || !minorWords.has(word)) {
return word.charAt(0).toUpperCase() + word.slice(1);
}
return word;
});
}
toProperTitleCase('the quick brown fox');  // 'The Quick Brown Fox'
toProperTitleCase('war and peace');        // 'War and Peace'

Python

<code">import re
camelCase → snake_case
def to_snake_case(s):
s = re.sub(r'([A-Z]+)([A-Z][a-z])', r'\1_\2', s)
s = re.sub(r'([a-z])([A-Z])', r'\1_\2', s)
return s.lower()
to_snake_case('helloWorldFoo')  # 'hello_world_foo'
to_snake_case('parseHTTPRequest')  # 'parse_http_request'  (handles acronyms)
snake_case → camelCase
def to_camel_case(s):
parts = s.split('_')
return parts[0] + ''.join(w.capitalize() for w in parts[1:])
to_camel_case('hello_world_foo')  # 'helloWorldFoo'

Regex Patterns Every Developer Needs

These patterns cover the extractions and validations you'll reach for repeatedly. None of them are perfect—the RFC specs for email and URLs are byzantine—but they handle 99% of real-world input.

<code">// Email addresses
const EMAIL = /[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}/g;
// URLs (http and https)
const URL = /https?://[^\s<>"{}|\^`[]]+/g;
// IP addresses (IPv4)
const IPV4 = /\b(?:25[0-5]|2[0-4]\d|[01]?\d\d?)(?:.(?:25[0-5]|2[0-4]\d|[01]?\d\d?)){3}\b/g;
// CIDR notation
const CIDR = /\b(?:\d{1,3}.){3}\d{1,3}/(?:[0-9]|[1-2][0-9]|3[0-2])\b/g;
// Dates (YYYY-MM-DD)
const DATE_ISO = /\b\d{4}-(?:0[1-9]|1[0-2])-(?:0[1-9]|[12]\d|3[01])\b/g;
// Phone numbers (loose — matches many formats)
const PHONE = /(?:+?1[-.\s]?)?(?[2-9]\d{2})?[-.\s]?\d{3}[-.\s]?\d{4}/g;
// Credit card numbers (with or without spaces/dashes)
const CC = /\b(?:\d[ -]?){13,16}\b/g;
// Hex color codes
const HEX_COLOR = /#(?:[0-9a-fA-F]{3}){1,2}\b/g;
// HTML tags (for stripping, not parsing HTML generally)
const HTML_TAG = /<[^>]+>/g;
// Markdown links text
const MD_LINK = /[([^]]+)](([^)]+))/g;
// UUIDs (v1-v5)
const UUID = /[0-9a-f]{8}-[0-9a-f]{4}-[1-5][0-9a-f]{3}-[89ab][0-9a-f]{3}-[0-9a-f]{12}/gi;
// Extracting matches
const text = 'Contact us at hello@example.com or see https://example.com';
const emails = [...text.matchAll(EMAIL)].map(m => m[0]);
const urls = [...text.matchAll(URL)].map(m => m[0]);

Named capture groups

<code">// Named groups make extraction readable
const LOG_LINE = /^(?\d{4}-\d{2}-\d{2}T[\d:]+Z)\s+(?ERROR|WARN|INFO|DEBUG)\s+(?.+)$/;
const line = '2026-03-15T14:23:01Z ERROR Connection timeout after 30s';
const match = line.match(LOG_LINE);
if (match) {
const { timestamp, level, message } = match.groups;
console.log(timestamp, level, message);
}
// Extract all named matches from multiple lines
function parseLogs(logText) {
return [...logText.matchAll(new RegExp(LOG_LINE.source, 'gm'))]
.map(m => m.groups);
}

Encoding and Escaping

Getting encoding wrong causes security vulnerabilities (XSS), broken URLs, and corrupted data. Each context has its own escaping requirements.

HTML escaping

<code">// Escape for insertion into HTML content (not attributes)
function escapeHtml(str) {
return str
.replace(/&/g, '&')
.replace(/</g, '<')
.replace(/>/g, '>')
.replace(/"/g, '"')
.replace(/'/g, ''');
}
// Never do: element.innerHTML = userInput
// Always: element.textContent = userInput (auto-escapes)
// Or:    element.innerHTML = escapeHtml(userInput)
// WARNING: Don't call .toUpperCase() on already-escaped HTML
// escapeHtml('<script>').toUpperCase()
// → '<SCRIPT>' — still parsed as a tag by HTML5!
// Always escape AFTER any case transformations.

URL encoding

<code">// encodeURIComponent — encode a parameter value
const param = 'hello world & more';
encodeURIComponent(param);  // 'hello%20world%20%26%20more'
// encodeURI — encode a complete URL (preserves : / ? # & = etc.)
encodeURI('https://example.com/path with spaces');
// 'https://example.com/path with spaces'
// Build a query string properly
function buildQueryString(params) {
return Object.entries(params)
.map(([k, v]) => encodeURIComponent(k) + '=' + encodeURIComponent(v))
.join('&');
}
buildQueryString({ q: 'hello world', page: 2, filter: 'a&b' });
// 'q=hello%20world&page=2&filter=a%26b'
// Parse a query string
function parseQueryString(qs) {
return Object.fromEntries(
new URLSearchParams(qs)
);
}

Base64

<code">// Browser
btoa('hello world');             // 'aGVsbG8gd29ybGQ='
atob('aGVsbG8gd29ybGQ=');       // 'hello world'
// Node.js (Buffer-based, handles binary correctly)
Buffer.from('hello world').toString('base64');   // encode
Buffer.from('aGVsbG8gd29ybGQ=', 'base64').toString('utf8');  // decode
// URL-safe Base64 (replaces + with - and / with )
function toBase64Url(str) {
return btoa(str).replace(/+/g, '-').replace(///g, '').replace(/=/g, '');
}
function fromBase64Url(str) {
str = str.replace(/-/g, '+').replace(/_/g, '/');
while (str.length % 4) str += '=';
return atob(str);
}

Unicode Normalization

Unicode has multiple ways to represent the same visible character. This causes comparison failures that are infuriating to debug because the strings look identical on screen.

<code">// The problem
const s1 = 'é';          // U+00E9 — precomposed
const s2 = 'e\u0301';    // e + combining acute accent — decomposed
s1 === s2;               // false — different byte sequences
s1.length;               // 1
s2.length;               // 2
// The fix: normalize before comparing
s1.normalize('NFC') === s2.normalize('NFC');  // true
// Normalization forms:
// NFC  — Canonical Decomposition, then Canonical Composition (precomposed)
//       Best for storage, comparison, most string operations
// NFD  — Canonical Decomposition (fully decomposed)
//       Useful when you want to strip accents
// NFKC — Compatibility Decomposition, then Composition
//       Folds ligatures, subscripts, Roman numerals to base chars
// NFKD — Compatibility Decomposition
//       Most aggressive normalization
// Strip accents using NFD + filter
function stripAccents(str) {
return str.normalize('NFD').replace(/[\u0300-\u036f]/g, '');
}
stripAccents('Ångström café naïve');  // 'Angstrom cafe naive'
// Unicode-aware string length (grapheme clusters)
// '👨‍👩‍👧‍👦' is a family emoji — one visible character, but:
'👨‍👩‍👧‍👦'.length;      // 11 (code units)
[...'👨‍👩‍👧‍👦'].length; // 7 (code points — spread operator)
// For true grapheme count, use Intl.Segmenter:
const seg = new Intl.Segmenter();
[...seg.segment('👨‍👩‍👧‍👦')].length;  // 1 (actual visible characters)

Safe string reversal

<code">// Naive — breaks surrogate pairs
'hello 👋'.split('').reverse().join('');   // 'hello ' + broken bytes
// Code-point safe — works for most emoji
[...'hello 👋'].reverse().join('');        // '👋 olleh' ✓
// Grapheme-cluster safe — handles emoji with modifiers
function reverseString(str) {
const seg = new Intl.Segmenter();
return [...seg.segment(str)].map(s => s.segment).reverse().join('');
}
reverseString('hello 👍🏿');  // '👍🏿 olleh' — skin tone modifier preserved

Slug Generation

Slugs are URL path segments. The rules: lowercase, alphanumerics and hyphens only, no leading/trailing/consecutive hyphens, ASCII only. The tricky parts are accent stripping and what to do with non-Latin scripts.

<code">function slugify(str, separator = '-') {
return str
.normalize('NFD')                          // Decompose accented chars
.replace(/[\u0300-\u036f]/g, '')           // Strip combining marks (accents)
.toLowerCase()
.trim()
.replace(/[^\w\s-]/g, '')                  // Remove non-word chars except spaces and hyphens
.replace(/[\s_-]+/g, separator)            // Spaces, underscores, hyphens → separator
.replace(new RegExp(^${separator}+|${separator}+$, 'g'), '');  // Trim separator
}
slugify('Hello World!');            // 'hello-world'
slugify('Café au lait');            // 'cafe-au-lait'
slugify('  Multiple   Spaces  ');   // 'multiple-spaces'
slugify('C++ Programming');         // 'c-programming'
slugify('Ångström Measurement');    // 'angstrom-measurement'
slugify('hello_world', '_');        // 'hello_world' (custom separator)

For non-Latin scripts (Chinese, Arabic, Japanese), transliteration libraries like slugify (npm) or Python's python-slugify handle conversion to ASCII romanization. The native approach above just strips them, which gives empty slugs for CJK-only titles.

<code"># Python
from slugify import slugify  # pip install python-slugify
slugify('Hello World!')           # 'hello-world'
slugify('Café Résumé')            # 'cafe-resume'
slugify('日本語タイトル')          # 'ri-ben-yu-taitorudang' (romanized)
slugify('foo bar', separator='_') # 'foo_bar'

Deduplication and Sorting

Removing duplicate lines

<code">// Preserve insertion order (first occurrence wins)
function deduplicateLines(text) {
const seen = new Set();
return text.split('\n')
.filter(line => {
if (seen.has(line)) return false;
seen.add(line);
return true;
})
.join('\n');
}
// Case-insensitive deduplication (keeps original case of first occurrence)
function deduplicateCaseInsensitive(text) {
const seen = new Set();
return text.split('\n')
.filter(line => {
const key = line.toLowerCase().trim();
if (seen.has(key)) return false;
seen.add(key);
return true;
})
.join('\n');
}
// Remove empty lines too
function deduplicateAndClean(text) {
const seen = new Set();
return text.split('\n')
.filter(line => {
const trimmed = line.trim();
if (!trimmed || seen.has(trimmed)) return false;
seen.add(trimmed);
return true;
})
.join('\n');
}

Sorting strategies

<code">const lines = text.split('\n').filter(Boolean);
// Alphabetical (case-insensitive)
lines.sort((a, b) => a.toLowerCase().localeCompare(b.toLowerCase()));
// Natural sort — file1, file2, file10 (not file1, file10, file2)
lines.sort((a, b) => a.localeCompare(b, undefined, { numeric: true }));
// Sort by line length (shortest first)
lines.sort((a, b) => a.length - b.length);
// Sort numerically (when lines are numbers)
lines.sort((a, b) => parseFloat(a) - parseFloat(b));
// Fisher-Yates shuffle (random order)
function shuffle(arr) {
for (let i = arr.length - 1; i > 0; i--) {
const j = Math.floor(Math.random() * (i + 1));
[arr[i], arr[j]] = [arr[j], arr[i]];
}
return arr;
}

Text Diffing

Computing diffs between text versions is useful for code review, changelog generation, and content audit tools.

Simple line diff (Myers algorithm)

<code">// Using the 'diff' npm package (widely used)
import { createTwoFilesPatch, diffLines } from 'diff';
const original = 'line one\nline two\nline three';
const modified = 'line one\nline TWO\nline three\nline four';
// Unified diff format
const patch = createTwoFilesPatch('original.txt', 'modified.txt', original, modified);
console.log(patch);
// Line-by-line diff with change objects
const changes = diffLines(original, modified);
for (const change of changes) {
const prefix = change.added ? '+' : change.removed ? '-' : ' ';
process.stdout.write(change.value.replace(/^/gm, prefix));
}
// Character-level diff within lines
import { diffChars } from 'diff';
const charDiff = diffChars('the cat sat', 'the dog sat');
charDiff.forEach(part => {
const color = part.added ? '\x1b[32m' : part.removed ? '\x1b[31m' : '';
process.stdout.write(color + part.value + '\x1b[0m');
});

Python difflib

<code"># Built into Python standard library
import difflib
original = ['line one\n', 'line two\n', 'line three\n']
modified = ['line one\n', 'line TWO\n', 'line three\n', 'line four\n']
Unified diff
diff = difflib.unified_diff(original, modified, fromfile='original', tofile='modified')
print(''.join(diff))
HTML diff
htmldiff = difflib.HtmlDiff()
html = htmldiff.make_table(original, modified, fromdesc='Before', todesc='After')
Similarity ratio (0.0 to 1.0)
ratio = difflib.SequenceMatcher(None, 'hello world', 'hello earth').ratio()
0.727...

Word and Character Counting

Word count sounds trivial until you hit edge cases: hyphenated words, contractions, CJK text (no spaces), emoji clusters.

<code">function analyzeText(text) {
// Normalize line endings
const normalized = text.replace(/\r\n/g, '\n').replace(/\r/g, '\n');
// Character counts
const charCount = normalized.length;          // includes whitespace
const charNoSpaces = normalized.replace(/\s/g, '').length;
// Line count
const lineCount = normalized.split('\n').length;
const nonEmptyLines = normalized.split('\n').filter(l => l.trim()).length;
// Word count — split on whitespace, filter empty strings
const words = normalized.trim().split(/\s+/).filter(Boolean);
const wordCount = words.length;
// Reading time (average 200 words per minute)
const readingMinutes = Math.ceil(wordCount / 200);
// Word frequency map
const frequency = {};
for (const word of words) {
const key = word.toLowerCase().replace(/[^\w]/g, '');
if (key) frequency[key] = (frequency[key] || 0) + 1;
}
const topWords = Object.entries(frequency)
.sort(([,a], [,b]) => b - a)
.slice(0, 10);
// Sentence count (rough)
const sentences = normalized.split(/[.!?]+/).filter(s => s.trim()).length;
return { charCount, charNoSpaces, lineCount, nonEmptyLines,
wordCount, readingMinutes, topWords, sentences };
}

CSV and TSV Parsing

CSV seems simple. Then you encounter quoted fields, embedded commas, newlines inside quotes, and the lack of a definitive standard. Don't write your own parser for anything beyond toy use cases.

The edge cases that break naive parsers

<code">// This is a single CSV row with three fields: '"Smith, John",30,"New York, NY"' // Field 1: Smith, John (quoted because it contains a comma) // Field 2: 30 // Field 3: New York, NY

// Embedded newlines are valid in quoted fields: '"First line\nSecond line",value2' // Field 1 spans two lines // A line-by-line reader breaks here

// Escaped quotes (double-quote escaping): '"She said ""hello""",value2' // Field 1: She said "hello"

// naive split(',') fails on all of these

Parsing with a library

<code">// Browser or Node.js — Papa Parse (most popular)
import Papa from 'papaparse';
const result = Papa.parse(csvString, {
header: true,     // First row is header
dynamicTyping: true,  // Auto-convert numbers/booleans
skipEmptyLines: true,
trimHeaders: true
});
// result.data — array of objects (if header: true)
// result.errors — parse errors
// result.meta — delimiter detected, line count, etc.
// Streaming large files (Node.js)
Papa.parse(fs.createReadStream('large.csv'), {
header: true,
step: (row) => processRow(row.data),
complete: () => console.log('Done')
});
// Generate CSV from data
const csv = Papa.unparse([
{ name: 'Alice', age: 30 },
{ name: 'Bob', age: 25 }
]);
// → 'name,age\r\nAlice,30\r\nBob,25'

<code"># Python — csv module (standard library)
import csv
Reading
with open('data.csv', newline='', encoding='utf-8') as f:
reader = csv.DictReader(f)
for row in reader:
print(row['name'], row['age'])
Writing
with open('output.csv', 'w', newline='', encoding='utf-8') as f:
writer = csv.DictWriter(f, fieldnames=['name', 'age'])
writer.writeheader()
writer.writerows([{'name': 'Alice', 'age': 30}])
TSV — just change the delimiter
reader = csv.DictReader(f, delimiter='\t')

TSV vs CSV

TSV (tab-separated values) is simpler than CSV: tabs in field values are rare, so quoting is rarely needed. It's the preferred export format for spreadsheet data that might contain commas (addresses, descriptions). The tradeoff: tabs are invisible in many text editors and can be accidentally converted to spaces.

Placeholder Text Generation

Lorem ipsum is Latin from Cicero's de Finibus, scrambled to look like natural text without being readable. It's been the standard placeholder since the 1960s when Letraset used it for dry-transfer lettering sheets.

<code">// Generate n words of lorem ipsum (browser or Node)
// Using the 'lorem-ipsum' npm package
import { LoremIpsum } from 'lorem-ipsum';
const lorem = new LoremIpsum({
sentencesPerParagraph: { max: 8, min: 4 },
wordsPerSentence: { max: 16, min: 4 }
});
lorem.generateWords(10);      // 10 random words
lorem.generateSentences(3);   // 3 sentences
lorem.generateParagraphs(2);  // 2 paragraphs
// For testing with varied, realistic-looking text,
// use 'casual' package or Faker.js instead:

<code"># Python
from faker import Faker  # pip install faker
fake = Faker()
fake.text(max_nb_chars=200)  # Paragraph of random text
fake.sentence()              # Single sentence
fake.paragraph(nb_sentences=5)  # 5-sentence paragraph
fake.name()                  # Realistic name
fake.address()               # Realistic address

CLI Tools

The Unix text processing tools are worth knowing even if you work primarily in Python or JavaScript. They're faster for quick one-liners on files than writing a script.

Command	What it does	Essential flag
`sort`	Sort lines	`-n` numeric, `-r` reverse, `-u` unique
`uniq`	Remove consecutive duplicates	`-c` count, `-i` case-insensitive; must sort first
`tr`	Translate or delete characters	`-d` delete, `-s` squeeze repeats
`sed`	Stream editor (substitution, deletion)	`-i` in-place edit
`awk`	Column processing, arithmetic	`-F` delimiter
`grep`	Search by regex	`-o` print only match, `-v` invert, `-E` extended regex
`cut`	Extract fields by delimiter or position	`-d` delimiter, `-f` field number
`wc`	Count words, lines, characters	`-l` lines, `-w` words, `-c` bytes

Common one-liners

<code"># Sort and deduplicate
sort file.txt | uniq
Top 10 most frequent lines
sort file.txt | uniq -c | sort -rn | head -10
Extract all email addresses
grep -oE '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}' file.txt | sort -u
Extract all URLs
grep -oE 'https?://[^[:space:]]+' file.txt | sort -u
Convert to uppercase
tr '[:lower:]' '[:upper:]' < file.txt
Remove blank lines
sed '/^[[:space:]]*$/d' file.txt
Strip leading/trailing whitespace from each line
sed 's/^[[:space:]]//;s/[[:space:]]$//' file.txt
Replace all occurrences of a word in-place
sed -i 's/oldword/newword/g' file.txt
Extract second column from CSV
cut -d',' -f2 file.csv
Count words in file
wc -w file.txt
Reverse line order (opposite of head/tail)
tac file.txt
Count occurrences of each unique line
sort file.txt | uniq -c
Find lines in file1 not in file2
comm -23 <(sort file1.txt) <(sort file2.txt)

Online Tools

For quick transformations without writing code:

Case Converter

Convert between camelCase, PascalCase, snake_case, kebab-case, Title Case, and more.

Convert Case

Remove Duplicates

Deduplicate lines with options for case sensitivity, whitespace trimming, and blank line removal.

Remove Duplicates

Sort Lines

Sort text alphabetically, numerically, by length, randomly, or in natural order.

Sort Lines

Slug Generator

Generate URL-friendly slugs with accent stripping and custom separators.

Generate Slug

Text Diff

Compare two blocks of text and highlight additions, deletions, and changes.

Diff Text

Word Counter

Count words, characters, lines, sentences, and reading time for any text.

Count Words

Frequently Asked Questions

camelCase starts lowercase with each subsequent word capitalized (myVariableName)—JavaScript variables and object properties. PascalCase capitalizes every word including the first (MyVariableName)—classes, React components, TypeScript types. snake_case separates words with underscores in lowercase (my_variable_name)—Python variables, database column names, Ruby conventions.

Use regex: str.replace(/([A-Z])/g, '_$1').toLowerCase().replace(/^_/, ''). It finds each uppercase letter, inserts an underscore before it, lowercases everything, then strips any leading underscore that appears if the input starts with a capital.

Track seen values in a Set: const seen = new Set(); lines.filter(line => { if (seen.has(line)) return false; seen.add(line); return true; }). A plain new Set(lines) is faster but randomizes order in practice (though technically insertion order is preserved in modern JS).

Natural sort treats embedded numbers numerically. Lexicographic sort puts file10 before file2 because '1' < '2'. Natural sort gives file1, file2, file10. Use it for filenames, version strings, and any user-visible list with numbers. In JavaScript: arr.sort((a, b) => a.localeCompare(b, undefined, { numeric: true })).

Normalize to NFD (decompose accents), strip combining characters (U+0300–U+036F), lowercase, replace non-alphanumeric characters with hyphens, collapse consecutive hyphens, strip leading/trailing hyphens. 'Café Résumé!' becomes 'cafe-resume'. Most frameworks have a slugify function that handles edge cases you'll forget about.

text.match(/[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}/g) returns all matches as an array. Wrap in new Set() to deduplicate. No regex fully covers the RFC 5321 spec, but this pattern handles all commonly encountered addresses.

The spread operator [...str].reverse().join('') treats each code point as a unit, which handles emoji and most Unicode characters correctly. str.split('').reverse().join('') splits on code units and breaks characters represented by surrogate pairs (emoji, many CJK characters). For grapheme clusters (emoji with skin tone modifiers), you need Intl.Segmenter.

Use str.normalize('NFC') for most comparisons—it ensures precomposed form (é as one code point instead of e + combining accent). Use 'NFD' if you want to strip accents by filtering out combining characters afterward. 'NFKC' additionally folds compatibility variants—fi ligature becomes fi, Roman numerals become digits.

sort (sort lines), uniq (deduplicate, requires sorted input—combine with sort first), tr (translate or delete characters), sed (stream editor for substitution), awk (column-oriented processing), grep (search with regex), cut (extract fields by delimiter). Chain them with pipes: sort file.txt | uniq -c | sort -rn shows word frequencies.

Simple CSVs (no quoted fields containing commas or newlines): line.split(',') works. For proper CSV, you need to handle quoted fields, escaped quotes (""), and embedded newlines. Use a library like csv-parse or Papa Parse. Writing your own CSV parser for edge cases is a waste of time—there are too many them.