What is a regular expression?

Part of this discussion is based on page 94 of "Compilers, Principles, Techniques, and Tools" by Aho, Sethi and Ullman

A regular expression is a pattern denoted by a sequence of symbols representing a state-machine or mini-program that is capable of matching particular sequences of characters. Regular expressions have their root in lexical analysis and tokenization where a set of lexemes had to be recognized before being passed on to a parser. Since then, regular expressions took a life of their own, appearing in such languages as AWK, TCL, and of course Perl, for all sorts of textual data extraction and manipulation purposes.

The most basic regular expression syntax consists of 4 operations. Let A and B each represent an alphabet (a set of characters) and s and t represent members of those alphabets.

Operation Representation Meaning
Union of A and B A|B s is such that s is in A or s is in B
Concatentation of A and B AB st are such that s is in A and t is in B
Kleene closure of A A* Zero or more concatenations of A
Positive closure of A A+ One or more concatenations of A

Using this notation you can define a regular expression for positive integers as follows:

digit +

Here digit represents the set of characters 0 - 9. A range of characters like this can be represented in most regular expression languages as [0-9]. Because this is such a common expression, some languages have a special character for it: \d .

Learning a regular expression language is quite simple once you've learned one, because most of the operations are the same. Only the notation changes.

Perl5 regular expressions

Here we summarize the syntax of Perl5 regular expressions. However, for a definitive reference, you should consult the perlre man page that accompanies the Perl5 distribution and also the book Programming Perl, 2nd Edition from O'Reilly & Associates. We need to point out here that for efficiency reasons the character set operator [...] is limited to work on only ASCII characters (Unicode characters 0 through 255). Other than that restriction, all Unicode characters should be useable in the package's regular expressions.

By default, a quantified subpattern is greedy . In other words it matches as many times as possible without causing the rest of the pattern not to match. To change the quantifiers to match the minimum number of times possible, without causing the rest of the pattern not to match, you may use a "?" right after the quantifier.

*?
Match 0 or more times
+?
Match 1 or more times
??
Match 0 or 1 time
{n}?
Match exactly n times
{n,}?
Match at least n times
{n,m}?
Match at least n but not more than m times

Perl5 extended regular expressions are fully supported.

(?#text)
An embedded comment causing text to be ignored.
(?:regexp)
Groups things like "()" but doesn't cause the group match to be saved.
(?=regexp)
A zero-width positive lookahead assertion. For example, \w+(?=\s) matches a word followed by whitespace, without including whitespace in the MatchResult.
(?!regexp)
A zero-width negative lookahead assertion. For example foo(?!bar) matches any occurrence of "foo" that isn't followed by "bar". Remember that this is a zero-width assertion, which means that a(?!b)d will match ad because a is followed by a character that is not b (the d) and a d follows the zero-width assertion.
(?imsx)
One or more embedded pattern-match modifiers. i enables case insensitivity, m enables multiline treatment of the input, s enables single line treatment of the input, and x enables extended whitespace comments.