Regular Expressions: A Practical Guide

Q: What does the dot (.) mean in regex?

The dot matches any single character except newline. Pattern 'c.t' matches 'cat', 'cot', 'cut', 'c@t', but not 'ct' (needs exactly one character between c and t) or 'cart' (. matches only one character). To match a literal dot, escape it: \.

Q: What's the difference between * and + quantifiers?

* matches zero or more occurrences (optional, can repeat). + matches one or more (required at least once). Pattern 'ab*c' matches 'ac', 'abc', 'abbc'. Pattern 'ab+c' matches 'abc', 'abbc' but NOT 'ac' because at least one 'b' is required.

Q: How do I match the beginning or end of a string?

Use ^ for start and $ for end. Pattern '^Hello' matches 'Hello world' but not 'Say Hello'. Pattern 'world$' matches 'Hello world' but not 'world peace'. Combine them: '^exact$' matches only the exact string 'exact'.

Q: What is a capturing group and when do I use it?

Parentheses () create capturing groups that extract matched text for later use. Pattern '(\d{4})-(\d{2})-(\d{2})' on '2024-03-15' captures: group 1 = '2024', group 2 = '03', group 3 = '15'. Use for extracting parts of matches or backreferences.

Q: Why isn't my regex matching what I expect?

Common issues: forgetting to escape special characters (use \. for literal dot), using greedy instead of lazy quantifiers (use .*? instead of .*), missing anchors (^$), or not accounting for whitespace. Test with a regex tester to see exactly what matches.

Q: What's the difference between greedy and lazy matching?

Greedy quantifiers (*, +, {n,}) match as much as possible. Lazy quantifiers (*?, +?, {n,}?) match as little as possible. On ' text ', pattern ' ' greedily matches the entire string, while ' ' lazily matches just ' '.

Regex looks intimidating at first. But once you understand the handful of core ideas it's built on, you can write patterns for almost anything — email validation, log parsing, date extraction, find-and-replace at scale. This guide walks through everything from literal characters to lookaheads, with real examples you can test immediately.

What Regex Actually Is

A regular expression is a pattern that describes a set of strings. You write one pattern, and the regex engine tests it against text — telling you whether it matches, where it matches, and what it captured.

That's it. Everything else (character classes, quantifiers, groups) is just syntax for making patterns more expressive.

Regex engines are built into virtually every language and many command-line tools. The syntax is largely shared — what works in JavaScript mostly works in Python, grep, sed, vim, and your code editor. There are dialect differences (especially for advanced features), but the fundamentals are universal.

When Regex Is the Right Tool

Validation: Does this input match a required format? Email, phone, postal code, credit card number.
Extraction: Pull all dates from a log file. Find every URL in a document. Grab IP addresses from access logs.
Transformation: Reformat dates from MM/DD/YYYY to YYYY-MM-DD. Strip HTML tags. Normalize whitespace.
Search: Find lines matching a complex condition across thousands of files.

When Regex Is the Wrong Tool

Regex is not the right choice for parsing HTML or XML (use a parser), deeply nested structures, or anything that's genuinely context-sensitive. The famous Stack Overflow answer about parsing HTML with regex exists for good reason — some things look like they have patterns but don't.

Practice as you read: Use the Regex Tester to try every example in this guide against your own text.

Basic Syntax

Literal Characters

The simplest regex is just plain text. The pattern cat matches the substring "cat" wherever it appears:

Pattern: cat
Text:    "The cat sat on the mat"
Match:       ^^^

Case matters. cat won't match "Cat" unless you add the case-insensitive flag.

The Dot: Match Any Character

A period matches any single character except a newline:

Pattern: c.t
Matches: "cat", "cot", "cut", "c1t", "c@t"
No match: "ct" (nothing between c and t)
No match: "cart" (dot matches exactly one character)

Escaping Special Characters

Twelve characters have special meaning in regex: . * + ? ^ $ { } [ ] \ | ( )

To match them literally, prefix with a backslash:

Pattern: 3\.14        Matches "3.14" (not "3x14")
Pattern: \$100        Matches "$100"
Pattern: \(optional\) Matches "(optional)"

Forgetting to escape the dot is one of the most common regex bugs. file.txt as a pattern matches "filetxt" too, because the dot matches any character. Use file\.txt for a literal match.

Character Classes

Square brackets define a set — the pattern matches any one character from the set.

Pattern: [aeiou]
Matches any single vowel

Pattern: [0-9]
Matches any digit. Ranges work inside brackets.

Pattern: [a-zA-Z]
Matches any letter, upper or lower case

Pattern: [a-zA-Z0-9_]
Matches any word character (same as \w)

Pattern: [^0-9]
^ inside brackets means "not" — matches anything except a digit

The only characters with special meaning inside [] are ], \, ^ (when first), and - (between characters). Everything else is literal.

Shorthand Classes

Shorthand	Equivalent	Meaning
`\d`	`[0-9]`	Any digit
`\D`	`[^0-9]`	Any non-digit
`\w`	`[a-zA-Z0-9_]`	Word character
`\W`	`[^a-zA-Z0-9_]`	Non-word character
`\s`	`[ \t\n\r\f]`	Whitespace (space, tab, newline)
`\S`	`[^ \t\n\r\f]`	Non-whitespace

Uppercase is always the negation of lowercase. \D = not a digit, \W = not a word character, \S = not whitespace.

You can mix shorthands inside brackets: [\w\s] matches any word character or whitespace.

Quantifiers

Quantifiers say how many times the preceding element must appear.

Quantifier	Meaning	Example
`*`	0 or more	`ab*c` matches "ac", "abc", "abbc", "abbbc"
`+`	1 or more	`ab+c` matches "abc", "abbc" — but not "ac"
`?`	0 or 1 (optional)	`colou?r` matches "color" and "colour"
`{n}`	Exactly n times	`\d{4}` matches exactly 4 digits like "2024"
`{n,}`	n or more	`\d{2,}` matches 2 or more digits
`{n,m}`	Between n and m	`\d{2,4}` matches 2, 3, or 4 digits

Quantifiers apply to the immediately preceding element — a single character, a character class, or a group:

Pattern: \d+        One or more digits
Pattern: [a-z]{3}   Exactly 3 lowercase letters
Pattern: (ab)+      The sequence "ab" repeated one or more times

Anchors and Boundaries

Anchors match positions in the string, not characters. They let you constrain where in the text a pattern must appear.

Anchor	Meaning	Example
`^`	Start of string (or line with multiline flag)	`^Hello` matches "Hello world" but not "Say Hello"
`$`	End of string (or line with multiline flag)	`\.pdf$` matches strings ending in ".pdf"
`\b`	Word boundary	`\bcat\b` matches "cat" but not "catch" or "bobcat"
`\B`	Non-word boundary	`\Bcat\B` matches "scat" not "cat"

Anchors for Full-String Validation

Without anchors, a pattern can match anywhere in the string. With anchors, you enforce the entire string must match:

Pattern: \d+
Text: "abc 123 def"
Matches: "123" (anywhere in the string)

Pattern: ^\d+$
Text: "abc 123 def"
No match (string doesn't consist entirely of digits)

Text: "12345"
Match: "12345" (entire string is digits)

Always use ^ and $ when validating input. Without them, \d{4} would "validate" a phone number by matching any 4-digit substring within it, not the whole value.

Word Boundaries in Practice

Text: "the cat scattered the cats"

Pattern: cat          Matches 3 times: "cat", "scat(tered)", "cat(s)"
Pattern: \bcat\b      Matches 1 time: only the standalone word "cat"
Pattern: \bcat        Matches 2 times: "cat" and "cat" in "cats"

Groups and Capturing

Grouping with Parentheses

Parentheses group part of a pattern so you can apply a quantifier to the whole group:

Pattern: (ab)+
Matches: "ab", "abab", "ababab"

Pattern: (Mr|Mrs|Ms)\.?\s\w+
Matches: "Mr. Smith", "Mrs Jones", "Ms. Lee"

Capturing Groups

Parentheses also capture the matched text, making it available after the match. This is how you extract structured data:

Pattern: (\d{4})-(\d{2})-(\d{2})
Text: "2024-03-15"

Group 0 (full match): "2024-03-15"
Group 1: "2024"
Group 2: "03"
Group 3: "15"

In JavaScript: match[1], match[2], match[3]. In replacement strings: $1, $2, $3.

Named Capturing Groups

Supported in Python, JavaScript (ES2018+), and most modern engines. Much more readable:

Pattern: (?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})   (Python)
Pattern: (?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})       (JavaScript)

Access by name: match.groups.year, match.groups.month

Non-Capturing Groups

Use (?:...) when you need to group but don't need to capture. Slightly faster, and avoids polluting your group numbering:

Pattern: (?:Mr|Mrs|Ms)\.?\s(\w+)
Only group 1 captures the name; the title prefix is grouped but not captured

Backreferences

Reference a previously captured group within the same pattern using \1, \2, etc.:

Pattern: (\w+)\s+\1
Matches repeated words: "the the", "is is", "had had"

Pattern: (['"]).*?\1
Matches quoted strings — ensures opening and closing quotes match

Lookahead and Lookbehind

Lookaheads and lookbehinds are zero-width assertions — they check what's around the match without including it in the match result. This is useful when you want to match something only when it's preceded or followed by something specific.

Syntax	Type	Meaning
`(?=...)`	Positive lookahead	Match if followed by...
`(?!...)`	Negative lookahead	Match if NOT followed by...
`(?<=...)`	Positive lookbehind	Match if preceded by...
`(?<!...)`	Negative lookbehind	Match if NOT preceded by...

Lookahead Examples

// Match a price amount only when followed by "USD"
Pattern: \d+(?=\s*USD)
Text: "100 USD and 50 EUR"
Matches: "100" (not "50" — it's followed by EUR)

// Match words not followed by a comma
Pattern: \w+(?!,)
// Useful for finding end-of-sentence words

Lookbehind Examples

// Match numbers preceded by "$"
Pattern: (?<=\$)\d+
Text: "Pay $100 or €200"
Matches: "100" (not "200" — not preceded by $)

// Match file extensions not preceded by "backup"
Pattern: (?<!backup)\.(jpg|png)$

Password Validation with Lookaheads

Lookaheads are perfect for rules like "must contain at least one X":

Pattern: ^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[@$!%*?&])[A-Za-z\d@$!%*?&]{8,}$

Breakdown:
^                Start of string
(?=.*[a-z])      Must contain at least one lowercase letter
(?=.*[A-Z])      Must contain at least one uppercase letter
(?=.*\d)         Must contain at least one digit
(?=.*[@$!%*?&])  Must contain at least one special character
[A-Za-z\d@$!%*?&]{8,}  8+ characters from the allowed set
$                End of string

Each lookahead is checked independently from position 0, so they all apply to the whole string. The order doesn't matter for correctness, only for readability.

Greedy vs Lazy Matching

By default, quantifiers are greedy: they match as much text as possible while still allowing the overall pattern to match. Add a ? after a quantifier to make it lazy: it matches as little as possible.

Text: <div>Hello</div><div>World</div>

Greedy: <.*>
Matches: the entire string from first < to last >
Result: "<div>Hello</div><div>World</div>"

Lazy: <.*?>
Matches each tag separately
Results: "<div>", "</div>", "<div>", "</div>"

This is a classic issue when parsing HTML-like structures. The greedy .* swallows everything between the first opening and last closing tag. The lazy .*? stops at the first opportunity.

When to Use Each

Greedy (default): Most of the time. Works correctly when the delimiter after the quantifier is unambiguous.
Lazy: When you need to match repeated short items between delimiters. Extracting quoted strings, HTML tags, parenthesized expressions.

// Extract contents of quoted strings
Pattern: ".*?"   (lazy)
Text: '"hello" and "world"'
Matches: "hello", "world"

Pattern: ".*"    (greedy)
Text: '"hello" and "world"'
Matches: '"hello" and "world"' — one big match including the middle

All quantifiers have a lazy version: *?, +?, ??, {n,m}?.

Common Patterns

Email Address

Pattern: ^[\w.+-]+@[\w-]+\.[\w.]{2,}$

Breakdown:
^            Start
[\w.+-]+     Local part: word chars, dots, plus, hyphen
@            Literal @
[\w-]+       Domain name
\.           Literal dot
[\w.]{2,}    TLD (2+ chars, allows dots for things like .co.uk)
$            End

Matches: user@example.com, john.doe+filter@company.co.uk
No match: @example.com, user@, user@.com

Note: A regex can check the shape of an email address but can't verify it actually exists or that the domain is valid. For production, validate server-side by sending a confirmation email.

US Phone Number

Pattern: ^\(?(\d{3})\)?[-.\s]?(\d{3})[-.\s]?(\d{4})$

Matches all these formats:
(555) 123-4567
555-123-4567
555.123.4567
5551234567
555 123 4567

IPv4 Address

Pattern: ^((25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(25[0-5]|2[0-4]\d|[01]?\d\d?)$

Breakdown:
25[0-5]     250-255
2[0-4]\d    200-249
[01]?\d\d?  0-199

Matches: 192.168.1.1, 0.0.0.0, 255.255.255.255
No match: 999.1.1.1, 1.2.3 (only 3 octets)

URL

Pattern: https?://[\w.-]+(?:/[\w./?%&=-]*)?

Matches: http://example.com, https://sub.domain.com/path?q=test
Note: URL parsing is complex — consider using the URL API instead of regex

ISO Date (YYYY-MM-DD)

Pattern: ^\d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01])$

Matches: 2024-03-15, 2000-01-01
No match: 2024-13-01 (month 13), 2024-00-15 (month 0)

Credit Card Number (Basic)

Pattern: ^(?:\d{4}[-\s]?){3}\d{4}$

Matches: 4111111111111111, 4111 1111 1111 1111, 4111-1111-1111-1111
Note: This checks format only, not validity — use Luhn for that

Hex Color

Pattern: ^#([A-Fa-f0-9]{6}|[A-Fa-f0-9]{3})$

Matches: #FF5733, #fff, #a0b1c2
No match: #GG0000, #12345 (5 digits)

Date Reformatting (MM/DD/YYYY to YYYY-MM-DD)

Pattern: (\d{2})/(\d{2})/(\d{4})
Replace: $3-$1-$2

Input:  03/15/2024
Output: 2024-03-15

Extract Hashtags

Pattern: #\w+

Text: "Learning #regex is #awesome #programming"
Matches: #regex, #awesome, #programming

Regex in JavaScript, Python, and grep

JavaScript

Regex literals use slashes. Flags go after the closing slash:

const pattern = /^\d{4}-\d{2}-\d{2}$/;
const dateStr = "2024-03-15";

// Test (returns boolean)
pattern.test(dateStr); // true

// Match (returns array or null)
const match = "Price: $42.50".match(/\$(\d+\.\d{2})/);
// match[0] = "$42.50", match[1] = "42.50"

// Replace
"hello world".replace(/\bworld\b/, "regex");
// "hello regex"

// Replace all (use g flag or replaceAll)
"aaa bbb aaa".replace(/aaa/g, "xxx");
// "xxx bbb xxx"

// Extract all matches (matchAll, returns iterator)
const text = "cat 123 dog 456";
const nums = [...text.matchAll(/\d+/g)];
// nums[0][0] = "123", nums[1][0] = "456"

// Common flags:
// i = case insensitive
// g = global (find all matches)
// m = multiline (^ and $ match line boundaries)
// s = dotall (. matches newlines too)

Python

Python uses the re module. Raw strings (r"...") avoid having to double-escape backslashes:

import re

pattern = re.compile(r'^\d{4}-\d{2}-\d{2}$')

# Test
bool(pattern.match("2024-03-15"))  # True

# Search (anywhere in string)
m = re.search(r'\$(\d+\.\d{2})', "Price: $42.50")
m.group(0)  # "$42.50"
m.group(1)  # "42.50"

# Find all
re.findall(r'\d+', "cat 123 dog 456")
# ['123', '456']

# Replace
re.sub(r'\bworld\b', 'regex', "hello world")
# "hello regex"

# Named groups
m = re.match(r'(?P<year>\d{4})-(?P<month>\d{2})', "2024-03")
m.group('year')   # "2024"
m.group('month')  # "03"

grep (Command Line)

# Basic search
grep "error" logfile.txt

# Case insensitive
grep -i "error" logfile.txt

# Extended regex (enables +, ?, |, grouping)
grep -E "^(ERROR|WARN)" logfile.txt

# Perl-compatible regex (enables lookaheads, etc.)
grep -P "(?<=ERROR: ).*" logfile.txt

# Count matches per file
grep -c "pattern" *.log

# Show filenames only
grep -l "pattern" *.txt

# Show line numbers
grep -n "pattern" file.txt

# Invert match (lines NOT matching)
grep -v "debug" logfile.txt

# Recursive search in directory
grep -r "TODO" ./src/

# Common log analysis patterns
grep -E "^[0-9]{4}-[0-9]{2}-[0-9]{2}" app.log   # Lines starting with date
grep -E "\b(5[0-9]{2})\b" access.log             # 5xx HTTP status codes

Debugging Regex

Use a Tester

Don't try to debug regex in your head or by running your application. Use a live tester that shows you exactly what's matching and why. The Regex Tester lets you paste text and pattern, see matches highlighted in real time, and inspect groups.

Common Bugs and Fixes

Pattern matches too much:

// Bug: matches "filename.txt.backup" when you wanted just ".txt" files
Pattern: .+\.txt

// Fix: anchor the extension to end of string
Pattern: .+\.txt$

Unescaped special character:

// Bug: . matches any character, so "configXtxt" matches
Pattern: config.txt

// Fix: escape the dot
Pattern: config\.txt

Greedy match swallows too much:

// Bug: extracts from first <b> to last </b>
Pattern: <b>(.*)</b>
Text: "Say <b>hello</b> and <b>goodbye</b>"
Match: "hello</b> and <b>goodbye"

// Fix: lazy quantifier
Pattern: <b>(.*?)</b>
Matches: "hello", then "goodbye"

Missing anchors in validation:

// Bug: accepts "abc123xyz" because \d{3} matches "123" in the middle
Pattern: \d{3}

// Fix: anchor to full string
Pattern: ^\d{3}$

Double-escaped backslash in strings:

// In JavaScript, string "\\d" becomes the two characters \d in regex
// These are equivalent:
const p1 = /\d+/;
const p2 = new RegExp("\\d+");  // String needs double backslash

// Python raw strings avoid this
import re
p = re.compile(r'\d+')  # r"" prefix — backslashes are literal

Break Complex Patterns Apart

If a long regex isn't working, test each piece independently. Build the pattern incrementally — start with the simplest version that partially works, then add complexity one piece at a time.

Check for Catastrophic Backtracking

Patterns like (a+)+ on input like "aaaaab" can cause exponential backtracking — the engine tries every possible combination before giving up. This can hang your application for seconds or longer. Test suspicious patterns with long non-matching inputs and watch for timeouts.

Quick Reference

Characters

.     Any character (not newline)
\d    Digit [0-9]
\D    Non-digit
\w    Word char [a-zA-Z0-9_]
\W    Non-word char
\s    Whitespace
\S    Non-whitespace

Quantifiers

*     0 or more (greedy)
+     1 or more (greedy)
?     0 or 1
{3}   Exactly 3
{3,}  3 or more
{3,5} Between 3 and 5
*?    0 or more (lazy)
+?    1 or more (lazy)

Anchors

^     Start of string/line
$     End of string/line
\b    Word boundary
\B    Non-word boundary

Groups

(...)     Capturing group
(?:...)   Non-capturing group
(?=...)   Positive lookahead
(?!...)   Negative lookahead
(?<=...)  Positive lookbehind
(?<!...)  Negative lookbehind
\1        Backreference to group 1
|         Alternation (or)

Flags

i   Case insensitive
g   Global (all matches)
m   Multiline (^ $ per line)
s   Dotall (. matches \n)
x   Extended (allow comments)

Character Classes

[abc]   One of: a, b, c
[a-z]   Any lowercase letter
[^abc]  Not a, b, or c
[a-zA-Z0-9]  Any alphanumeric

Practice Tools

Regex Tester — test patterns with live highlighting

Regex Replacer — find and replace with regex

Frequently Asked Questions

The dot matches any single character except newline. Pattern 'c.t' matches 'cat', 'cot', 'cut', 'c@t', but not 'ct' (needs exactly one character between c and t) or 'cart' (. matches only one character). To match a literal dot, escape it: \.

* matches zero or more occurrences (optional, can repeat). + matches one or more (required at least once). Pattern 'ab*c' matches 'ac', 'abc', 'abbc'. Pattern 'ab+c' matches 'abc', 'abbc' but NOT 'ac' because at least one 'b' is required.

Use ^ for start and $ for end. Pattern '^Hello' matches 'Hello world' but not 'Say Hello'. Pattern 'world$' matches 'Hello world' but not 'world peace'. Combine them: '^exact$' matches only the exact string 'exact'.

Parentheses () create capturing groups that extract matched text for later use. Pattern '(\d{4})-(\d{2})-(\d{2})' on '2024-03-15' captures: group 1 = '2024', group 2 = '03', group 3 = '15'. Use for extracting parts of matches or backreferences.

Common issues: forgetting to escape special characters (use \. for literal dot), using greedy instead of lazy quantifiers (use .*? instead of .*), missing anchors (^$), or not accounting for whitespace. Test with a regex tester to see exactly what matches.

Greedy quantifiers (*, +, {n,}) match as much as possible. Lazy quantifiers (*?, +?, {n,}?) match as little as possible. On '

text

', pattern '<.*>' greedily matches the entire string, while '<.*?>' lazily matches just '

Use word boundaries \b around the pattern. Pattern '\bcat\b' matches 'cat' in 'the cat sat' but not in 'category' or 'bobcat'. \b matches the position between a word character (\w) and a non-word character.

These are shorthand character classes. \d matches any digit [0-9]. \w matches word characters [a-zA-Z0-9_]. \s matches whitespace (space, tab, newline). Uppercase versions match the opposite: \D (non-digit), \W (non-word), \S (non-whitespace).

Yes, by default regex is case-sensitive. 'Cat' won't match 'cat'. Use the 'i' flag for case-insensitive matching: /cat/i matches 'Cat', 'CAT', 'cat'. In character classes, [a-zA-Z] explicitly matches both cases without flags.