!Architecture Overview
Regular Expressions Mastery Across Languages
Introduction
Prerequisites
| Requirement | Details |
|---|---|
| Basic setup and tooling | Basic setup and tooling |
Figure: Code pattern examples for regular expressions mastery across languages—syntax comparison, idiomatic approaches, performance characteristics, and common pitfalls.
Figure: Best practices implementation for regular expressions mastery across languages—error handling, testing strategies, maintainability patterns, and documentation standards.
Figure: Production readiness checklist for regular expressions mastery across languages—logging, monitoring, performance tuning, and security hardening.
Regular expressions (regex) provide powerful pattern matching for text processing. This guide covers regex syntax—character classes, quantifiers, anchors, groups, capturing, lookaheads/lookbehinds—with practical examples for validation, extraction, and replacement across JavaScript, Python, C#, and Java.
Basic Patterns
Literal Characters and Metacharacters
Simple matching:
// JavaScript
const text = "Hello World";
// Literal match
/Hello/.test(text); // true
/hello/.test(text); // false (case-sensitive by default)
// Case-insensitive flag
/hello/i.test(text); // true
// Match any single character (.)
/H.llo/.test("Hello"); // true
/H.llo/.test("Hallo"); // true
/H.llo/.test("H123lo"); // false (. matches one char)
// Escape metacharacters
/example\.com/.test("contoso.com"); // true
/\$19\.99/.test("$19.99"); // true
// Metacharacters requiring escape: . ^ $ * + ? { } [ ] \ | ( )
Character Classes
Predefined classes:
# Python
import re
## \d = digit [0-9]
re.search(r'\d+', 'Order 12345') # Matches '12345'
![\d = digit [0-9]](/images/articles/programming-languages/2025-05-05-regular-expressions-mastery-across-languages-sec5-generic.jpg)
## \w = word character [a-zA-Z0-9_]
re.search(r'\w+', 'hello_world') # Matches 'hello_world'
![\w = word character [a-zA-Z0-9_]](/images/articles/programming-languages/2025-05-05-regular-expressions-mastery-across-languages-sec6-generic.jpg)
## \s = whitespace [ \t\n\r\f\v]
re.search(r'\s+', 'hello world') # Matches ' '
![\s = whitespace [ \t\n\r\f\v]](/images/articles/programming-languages/2025-05-05-regular-expressions-mastery-across-languages-sec7-generic.jpg)
## Negated classes:
## \D = non-digit [^0-9]
![\D = non-digit [^0-9]](/images/articles/programming-languages/2025-05-05-regular-expressions-mastery-across-languages-sec9-generic.jpg)
## \W = non-word character [^a-zA-Z0-9_]
![\W = non-word character [^a-zA-Z0-9_]](/images/articles/programming-languages/2025-05-05-regular-expressions-mastery-across-languages-sec10-generic.jpg)
## \S = non-whitespace
## Custom character class
re.search(r'[aeiou]', 'hello') # Matches 'e' (first vowel)
re.search(r'[0-9]', 'abc123') # Matches '1'
re.search(r'[^0-9]', '123abc') # Matches 'a' (first non-digit)
## Ranges
re.search(r'[a-z]+', 'Hello') # Matches 'ello'
re.search(r'[A-Z]+', 'Hello') # Matches 'H'
re.search(r'[a-zA-Z]+', 'Hello123') # Matches 'Hello'
re.search(r'[0-9a-fA-F]+', 'FF00AA') # Matches 'FF00AA' (hex)
C# examples:
using System.Text.RegularExpressions;
// Character class matching
Regex.IsMatch("Hello123", @"[a-z]+"); // false (lowercase only)
Regex.IsMatch("Hello123", @"[a-zA-Z]+"); // true
Regex.IsMatch("user@contoso.com", @"[\w@.]+"); // true
// Extract digits
var match = Regex.Match("Price: $199.99", @"\d+\.\d+");
Console.WriteLine(match.Value); // "199.99"
Quantifiers
Repetition Patterns
Basic quantifiers:
// JavaScript
const patterns = {
```text
'*': 'Zero or more',
'+': 'One or more',
'?': 'Zero or one (optional)',
'{n}': 'Exactly n times',
'{n,}': 'At least n times',
'{n,m}': 'Between n and m times'```
};
// Examples
/\d+/.test('123'); // true - one or more digits
/\d*/.test(''); // true - zero or more digits
/colou?r/.test('color'); // true - 'u' is optional
/colou?r/.test('colour');// true
// Specific counts
/\d{4}/.test('2025'); // true - exactly 4 digits
/\d{2,4}/.test('99'); // true - 2 to 4 digits
/\d{2,4}/.test('12345'); // true - matches first 4
/\w{3,}/.test('hello'); // true - at least 3 word chars
// Phone number pattern
/\d{3}-\d{3}-\d{4}/.test('555-123-4567'); // true
Greedy vs lazy (non-greedy):
## Python
import re
text = "<div>Content</div><div>More</div>"
## Greedy (default) - matches as much as possible
greedy = re.search(r'<div>.*</div>', text)
print(greedy.group()) # '<div>Content</div><div>More</div>'
## Lazy (non-greedy) - matches as little as possible
lazy = re.search(r'<div>.*?</div>', text)
print(lazy.group()) # '<div>Content</div>'
## Password validation (8-20 chars)
pattern = r'^.{8,20}$'
re.match(pattern, 'password123') # Valid
re.match(pattern, 'short') # None (too short)
Anchors and Boundaries
Position Matching
Start and end anchors:
// JavaScript
// ^ = start of string
// $ = end of string
/^Hello/.test('Hello World'); // true
/^Hello/.test('Say Hello'); // false
/World$/.test('Hello World'); // true
/World$/.test('World is big'); // false
// Exact match (start + end)
/^Hello World$/.test('Hello World'); // true
/^Hello World$/.test('Hello World!'); // false
/^Hello World$/.test('Say Hello World'); // false
// Validate format exactly
const emailPattern = /^[\w.-]+@[\w.-]+\.\w{2,}$/;
emailPattern.test('user@contoso.com'); // true
emailPattern.test('invalid email'); // false
Word boundaries:
## Python
import re
## \b = word boundary (between \w and \W)
## \B = non-word boundary
text = "The cat in the cathedral"
## Match whole word 'cat'
re.search(r'\bcat\b', text) # Matches 'cat' (standalone)
re.search(r'\bcat\b', 'cathedral') # None (part of word)
## Find all whole words
words = re.findall(r'\b\w+\b', "Hello, world! How are you?")
print(words) # ['Hello', 'world', 'How', 'are', 'you']
## Replace whole word only
result = re.sub(r'\bcat\b', 'dog', text)
print(result) # "The dog in the cathedral"
Groups and Capturing
Parentheses for Grouping
Capturing groups:
// JavaScript
// ( ) = capturing group
const text = "John Doe (555-1234)";
const pattern = /(\w+) (\w+) \((\d{3}-\d{4})\)/;
const match = text.match(pattern);
console.log(match[0]); // "John Doe (555-1234)" - full match
console.log(match[1]); // "John" - first capture group
console.log(match[2]); // "Doe" - second capture group
console.log(match[3]); // "555-1234" - third capture group
// Named capturing groups (ES2018)
const namedPattern = /(?<firstName>\w+) (?<lastName>\w+) \((?<phone>[\d-]+)\)/;
const namedMatch = text.match(namedPattern);
console.log(namedMatch.groups.firstName); // "John"
console.log(namedMatch.groups.lastName); // "Doe"
console.log(namedMatch.groups.phone); // "555-1234"
Python named groups:
## Python
import re
## Named groups with ?P<name>
pattern = r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})'
match = re.search(pattern, 'Date: 2025-05-15')
print(match.group('year')) # '2025'
print(match.group('month')) # '05'
print(match.group('day')) # '15'
## Access as dictionary
print(match.groupdict())
## {'year': '2025', 'month': '05', 'day': '15'}
## Extract email components
email_pattern = r'(?P<user>[\w.-]+)@(?P<domain>[\w.-]+)\.(?P<tld>\w+)'
email_match = re.search(email_pattern, 'user@contoso.com')
print(email_match.group('user')) # 'user'
print(email_match.group('domain')) # 'example'
print(email_match.group('tld')) # 'com'
C# named groups:
// C#
using System.Text.RegularExpressions;
var pattern = @"(?<area>\d{3})-(?<exchange>\d{3})-(?<number>\d{4})";
var match = Regex.Match("555-123-4567", pattern);
if (match.Success)
{
```text
Console.WriteLine(match.Groups["area"].Value); // "555"
Console.WriteLine(match.Groups["exchange"].Value); // "123"
Console.WriteLine(match.Groups["number"].Value); // "4567"```
}
Non-capturing groups:
// (?: ) = non-capturing group (for grouping without capturing)
// Without non-capturing group
const withCapture = /(\d{3})-(\d{3})-(\d{4})/.exec('555-123-4567');
console.log(withCapture); // ['555-123-4567', '555', '123', '4567']
// With non-capturing group
const withoutCapture = /(?:\d{3})-(\d{3})-(\d{4})/.exec('555-123-4567');
console.log(withoutCapture); // ['555-123-4567', '123', '4567']
// Useful for alternation
/(https?|ftp):\/\//.test('https://contoso.com'); // true
/(?:https?|ftp):\/\//.test('ftp://files.com'); // true
Lookaheads and Lookbehinds
Zero-Width Assertions
Positive lookahead (?=):
// JavaScript
// (?= ) = positive lookahead (match if followed by pattern)
// Password must contain digit
/^(?=.*\d).{8,}$/.test('password123'); // true
/^(?=.*\d).{8,}$/.test('password'); // false
// Password must contain uppercase AND lowercase AND digit
/^(?=.*[a-z])(?=.*[A-Z])(?=.*\d).{8,}$/.test('Pass1234'); // true
/^(?=.*[a-z])(?=.*[A-Z])(?=.*\d).{8,}$/.test('password1'); // false
// Extract word before comma
/\w+(?=,)/.exec('apple,banana,orange'); // ['apple']
Negative lookahead (?!):
## Python
import re
## (?! ) = negative lookahead (match if NOT followed by pattern)
## Find 'q' not followed by 'u'
pattern = r'q(?!u)'
re.findall(pattern, 'Iraq Qatar queue') # ['q'] (only in Iraq)
## Username: letters/digits, but cannot start with digit
username_pattern = r'^(?!\d)[a-zA-Z0-9_]{3,16}$'
re.match(username_pattern, 'user123') # Valid
re.match(username_pattern, '123user') # None (starts with digit)
Positive lookbehind (?<=):
## Python
## (?<= ) = positive lookbehind (match if preceded by pattern)
## Find price (digits after $)
pattern = r'(?<=\$)\d+(?:\.\d{2})?'
re.findall(pattern, 'Items: $19.99, $5, $150.00')
## ['19.99', '5', '150.00']
![['19.99', '5', '150.00']](/images/articles/programming-languages/2025-05-05-regular-expressions-mastery-across-languages-sec40-generic.jpg)
## Extract @mentions (alphanumeric after @)
mentions_pattern = r'(?<=@)\w+'
text = "Hello @alice and @bob_123!"
re.findall(mentions_pattern, text) # ['alice', 'bob_123']
Negative lookbehind (?<!):
// C#
using System.Text.RegularExpressions;
// (?<! ) = negative lookbehind (match if NOT preceded by pattern)
// Find digits not preceded by $
var pattern = @"(?<!\$)\d+";
var matches = Regex.Matches("Price: $100 and 50 items", pattern);
// Matches: "100" in "$100" is skipped, "50" is matched
foreach (Match match in matches)
{
```text
Console.WriteLine(match.Value); // "50"```
}
Practical Examples
Email Validation
Basic email pattern:
// JavaScript
const emailPattern = /^[\w.-]+@[\w.-]+\.\w{2,}$/;
// Valid emails
emailPattern.test('user@contoso.com'); // true
emailPattern.test('john.doe@company.org'); // true
emailPattern.test('test_123@sub.domain.co.uk'); // true
// Invalid emails
emailPattern.test('invalid'); // false
emailPattern.test('@contoso.com'); // false
emailPattern.test('user@'); // false
// More comprehensive email validation
const strictEmail = /^[a-zA-Z0-9.!#$%&'*+\/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$/;
Phone Number Formats
Multiple formats:
## Python
import re
def validate_phone(phone):
```text
"""Validate US phone number in various formats."""
patterns = [
r'^\d{3}-\d{3}-\d{4}$', # 555-123-4567
r'^\(\d{3}\) \d{3}-\d{4}$', # (555) 123-4567
r'^\d{10}$', # 5551234567
r'^\+1-\d{3}-\d{3}-\d{4}$', # +1-555-123-4567
]
return any(re.match(pattern, phone) for pattern in patterns)
Test
print(validate_phone('555-123-4567')) # True print(validate_phone('(555) 123-4567')) # True print(validate_phone('5551234567')) # True print(validate_phone('invalid')) # False
Extract and normalize phone numbers
def extract_phone(text):
"""Extract phone number and normalize to XXX-XXX-XXXX format."""
pattern = r'(?:\+1[-.]?)?\(?(\d{3})\)?[-. ]?(\d{3})[-. ]?(\d{4})'
match = re.search(pattern, text)
if match:
return f'{match.group(1)}-{match.group(2)}-{match.group(3)}'
return None
print(extract_phone('Call me at (555) 123-4567')) # '555-123-4567' print(extract_phone('Phone: 555.123.4567')) # '555-123-4567'
## URL Parsing
**Extract URL components:**
```javascript
// JavaScript
const urlPattern = /^(https?):\/\/([^:\/\s]+)(?::(\d+))?(\/[^\s]*)?$/;
const url = 'https://contoso.com:8080/path/to/page?query=value';
const match = url.match(urlPattern);
if (match) {
```javascript
console.log('Protocol:', match[1]); // 'https'
console.log('Domain:', match[2]); // 'contoso.com'
console.log('Port:', match[3]); // '8080'
console.log('Path:', match[4]); // '/path/to/page?query=value'```
}
// Extract all URLs from text
const text = "Visit https://contoso.com or http://test.org for more info";
const urls = text.match(/https?:\/\/[^\s]+/g);
console.log(urls); // ['https://contoso.com', 'http://test.org']
Data Extraction
Parse log files:
## Python
import re
from datetime import datetime
log_pattern = r'(?P<timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) \[(?P<level>\w+)\] (?P<message>.*)'
log_lines = [
```text
'2025-05-05 14:30:00 [INFO] Application started',
'2025-05-05 14:30:15 [ERROR] Database connection failed',
'2025-05-05 14:30:20 [WARN] Retrying connection',```
]
for line in log_lines:
```text
match = re.match(log_pattern, line)
if match:
timestamp = datetime.strptime(match.group('timestamp'), '%Y-%m-%d %H:%M:%S')
level = match.group('level')
message = match.group('message')
print(f'{level}: {message} at {timestamp}')
**Extract data from HTML:**
```csharp
// C#
using System.Text.RegularExpressions;
// Extract all links from HTML
var html = @"
```text
<a href='/home'>Home</a>
<a href='https://contoso.com'>Example</a>
<a href='/contact'>Contact</a>```
";
var linkPattern = @"<a\s+href=['""]([^'""]+)['""]>([^<]+)</a>";
var matches = Regex.Matches(html, linkPattern);
foreach (Match match in matches)
{
```text
var url = match.Groups[1].Value;
var text = match.Groups[2].Value;
Console.WriteLine($"{text}: {url}");```
}
// Output:
// Home: /home
// Example: https://contoso.com
// Contact: /contact
String Replacement
Find and replace:
// JavaScript
// Simple replacement
'hello world'.replace(/world/, 'JavaScript'); // 'hello JavaScript'
// Global replacement (all occurrences)
'foo bar foo'.replace(/foo/g, 'baz'); // 'baz bar baz'
// Case-insensitive replacement
'Hello WORLD'.replace(/world/gi, 'JavaScript'); // 'Hello JavaScript'
// Replacement with capturing groups
const date = '2025-05-15';
const formatted = date.replace(/(\d{4})-(\d{2})-(\d{2})/, '$2/$3/$1');
console.log(formatted); // '05/15/2025'
// Replacement with function
const text = 'Total: $100, Tax: $8, Shipping: $5';
const doubled = text.replace(/\$(\d+)/g, (match, amount) => {
```text
return '$' + (parseInt(amount) * 2);```
});
console.log(doubled); // 'Total: $200, Tax: $16, Shipping: $10'
Python substitution:
## Python
import re
## Simple substitution
re.sub(r'apple', 'orange', 'I like apple pie') # 'I like orange pie'
## Using captured groups
text = 'Name: John Doe, Age: 30'
result = re.sub(r'Name: (\w+) (\w+)', r'\2, \1', text)
print(result) # 'Name: Doe, John, Age: 30'
## Substitution with function
def uppercase_match(match):
```text
return match.group().upper()
text = 'hello world from python' result = re.sub(r'\b\w+\b', uppercase_match, text) print(result) # 'HELLO WORLD FROM PYTHON'
Remove HTML tags
html = '
Hello world!
' clean = re.sub(r'<[^>]+>', '', html) print(clean) # 'Hello world!'
## Language-Specific Features
### JavaScript Flags
```javascript
// i = case-insensitive
/hello/i.test('HELLO'); // true
// g = global (find all matches)
'foo bar foo'.match(/foo/g); // ['foo', 'foo']
// m = multiline (^ and $ match line boundaries)
const text = 'Line 1\nLine 2';
text.match(/^Line/gm); // ['Line', 'Line']
// s = dotAll (. matches newlines)
/hello.world/s.test('hello\nworld'); // true
// u = unicode
/\u{1F600}/u.test('😀'); // true
// y = sticky (matches at exact position)
const pattern = /foo/y;
pattern.lastIndex = 4;
pattern.test('foo foo'); // true (matches at position 4)
Python re Module
import re
## Compile pattern for reuse
pattern = re.compile(r'\d+')
pattern.findall('123 abc 456') # ['123', '456']
## Verbose mode (comments and whitespace ignored)
email_pattern = re.compile(r'''
```text
[\w.-]+ # username
@ # at symbol
[\w.-]+ # domain
\. # dot
\w{2,} # TLD```
''', re.VERBOSE)
## Methods
re.search(pattern, string) # Find first match
re.match(pattern, string) # Match at start
re.findall(pattern, string) # Find all matches (list)
re.finditer(pattern, string) # Find all matches (iterator)
re.sub(pattern, repl, string) # Replace
re.split(pattern, string) # Split by pattern
C# Regex Options
Figure: Visual Studio C# – CodeLens, refactoring, and build output.
using System.Text.RegularExpressions;
// RegexOptions enumeration
var pattern = @"hello";
// Case-insensitive
Regex.IsMatch("HELLO", pattern, RegexOptions.IgnoreCase);
// Multiline
var text = "Line 1\nLine 2";
Regex.Matches(text, @"^Line", RegexOptions.Multiline);
// Compiled (faster for repeated use)
var compiled = new Regex(@"\d+", RegexOptions.Compiled);
// Timeout (prevent catastrophic backtracking)
var regex = new Regex(@"a+b+c+", RegexOptions.None, TimeSpan.FromSeconds(1));
Best Practices
- Start Simple: Begin with basic patterns, add complexity gradually
- Test Thoroughly: Use regex testers (regex101.com, regexr.com)
- Use Non-Capturing Groups: (?:) when you don't need to capture
- Avoid Greedy Quantifiers: Use lazy quantifiers (.*?) for HTML/XML
- Escape Metacharacters: Always escape . $ ^ * + ? { } [ ] \ | ( )
- Comment Complex Patterns: Use verbose mode in Python, comments in code
Architecture Decision and Tradeoffs
When designing software development solutions with Programming Languages, consider these key architectural trade-offs:
| Approach | Best For | Tradeoff |
|---|---|---|
| Managed / platform service | Rapid delivery, reduced ops burden | Less customisation, potential vendor lock-in |
| Custom / self-hosted | Full control, advanced tuning | Higher operational overhead and cost |
Recommendation: Start with the managed approach for most workloads and move to custom only when specific requirements demand it.
Validation and Versioning
- Last validated: April 2026
- Validate examples against your tenant, region, and SKU constraints before production rollout.
- Keep module, CLI, and SDK versions pinned in automation pipelines and review quarterly.
Security and Governance Considerations
- Apply least-privilege access using RBAC roles and just-in-time elevation for admin tasks.
- Store secrets in managed secret stores and avoid embedding credentials in scripts or source files.
- Enable audit logging, data protection policies, and periodic access reviews for regulated workloads.
Cost and Performance Notes
- Define budgets and alerts, then monitor usage and cost trends continuously after go-live.
- Baseline performance with synthetic and real-user checks before and after major changes.
- Scale resources with measured thresholds and revisit sizing after usage pattern changes.
Official Microsoft References
- https://learn.microsoft.com/
- https://learn.microsoft.com/azure/
- https://learn.microsoft.com/power-platform/
- https://learn.microsoft.com/microsoft-365/
Public Examples from Official Sources
- These examples are sourced from official public Microsoft documentation and sample repositories.
- Documentation examples: https://learn.microsoft.com/training/
- Sample repositories: https://github.com/microsoft
- Prefer adapting these examples to your tenant, subscriptions, and governance requirements before production use.
Key Takeaways
- Character classes ([a-z], \d, \w) match specific character sets
- Quantifiers (*, +, ?, {n,m}) control repetition
- Anchors (^, $, \b) match positions, not characters
- Groups () capture submatches, (?:) groups without capturing
- Lookaheads/lookbehinds (?=, ?!, ?<=, ?<!) enable zero-width assertions
- Named groups improve readability and maintenance
Next Steps
- Learn atomic groups (?>...) for performance optimization
- Explore Unicode properties (\p{L}, \p{N}) for international text
- Master conditional patterns (?(condition)yes|no)
- Study catastrophic backtracking and prevention strategies
Additional Resources
Match patterns, not headaches.
Discussion