Friday, December 11, 2009

Perl Regular Expression

SkyHi @ Friday, December 11, 2009

How it is used

  • test if a string or its substring matches with some pattern.

  • For example, if the user input in a form contains all digits, legal phone number patterns, credit card number patterns, or date patterns.
  • replace or substitute some string pattern in a text string.

  • For example, remove all tags in a web page and only leave text content.
  • extract substring from a string based on certain text pattern.

  • For example, given a URL, extract the protocol, domain name, port no., and uri fields for further processing such as web crawling, web indexing/searching, or copying web pages for offline reading.

Web Pag for Testing Your Regulart Expression with provided data

http://cs.uccs.edu/~cs301/testreg.html

Reference:

  • Mastering Regular Expressions by Jeff Friedl, Oreily.
  • Perlre man page ("man perlre")

Perl Metacharacter Summary

Items to match a single characters

. dot Match any one characters
[...] character class Match any character listed
[^...] negated character class Match any character not listed
\t tab Match HT or TAB character
\n new line Match LF or NL character
\r return Match CR character
\f line feed Match FF (Form Feed) character
\a alarm Match BELL character
\e escape Match ESC character
\0nnn Character in octal, e.g. \033 Match equivalent character
\xnn Character in hexa decimal, e.g. \x1B Match equivalent character
\c[ Control character, e.g., \c[A? Match control character?
\l lowercase next character
\u uppercase next character
\L lowercase characters till \E
\U uppercase characters till \E
\E end case modification
\Q quote (disable) pattern metacharacters till \E

Example 1: character class
if ($string =~ /[01][0-9]/) {
print "$string contains digits 00 to 19\n";
} else {
print "$string contains digits 00 to 19\n";
}

Example 2: negated character class
if ($string =~ /[^A-z]/) { print "$string contains nonletter characters\n"}
else { print "$string does not contains non-letter characters.\n"}

Class Shorthand: Items that match a single character in a predefined character class

\w Match a "word" character (alphanumeric plus "_")
\W Match a non-word character
\s Match a whitespace character
\S Match a non-whitespace character
\d Match a digit character
\D Match a non-digit character

Quantifiers: Items appended to provide "Counting"

* Match 0 or more times
+ Match 1 or more times
? Match 0 or 1 times
{n} Match exactly n times
{n,} Match at least n times
{n, m} Match at least n but no more than m times

Items That Match Positions

^ Caret, Match start of the line (can match multiple times when /m (multiline matching)
$ Match end of the line (can match multiple times when /m (multiline matching)
\b Match a word boundary
\B Match a non-(word boundary)
\A Match only at beginning of string
\Z Match only at end of string, or before newline at the end
\z Match only at end of string
\G Match only where previous m//g left off (works only with /g)

Grouping and Alternation

| Alternation, Match either expression it separates
(...) Limit scope of alternation, Provide grouping for the quantifiers, Capture matched substrings for backreferences.
\1, \2, ... Backreference, Match text previously matched within first, second, ..., set of parentheses.
(?:...) Grouping only, non-capturing parentheses
(?=...) Positive lookahead, non-capturing parentheses
(?!...) Negative lookahead, non-capturing parentheses

Modes, append at the end of regular expression

i ignore case
g global, in substitute case s/.../.../g, repeat substitution multiple times.
m multiline matching mode

Reference: http://cs.uccs.edu/~cs301/perl/re.htm