Thursday, November 20, 2008

Regular Expression

What is a Regular Expression?

A regular expression is a set of characters that specify a pattern. Regular expressions are used when you want to search for specify lines of text containing a particular pattern. Most of the UNIX utilities operate on ASCII files a line at a time. Regular expressions search for patterns on a single line, and not for patterns that start on one line and end on another.

It is simple to search for a specific word or string of characters. Almost every editor on every computer system can do this. Regular expressions are more powerful and flexible. You can search for words of a certain size. You can search for a word with four or more vowels that end with an "s." Numbers, punctuation characters, you name it, a regular expression can find it. What happens once the program you are using find it is another matter. Some just search for the pattern. Others print out the line containing the pattern. Editors can replace the string with a new pattern.

Parts Of Regular Expression

There are three important parts to a regular expression.

1) Anchors are used to specify the position of the pattern in relation to a line of text.

2) Character Sets match one or more characters in a single position.

3) Modifiers specify how many times the previous character set is repeated.

A simple example that demonstrates all three parts is the regular expression "^#*."

The up arrow is an anchor that indicates the beginning of the line. The character "#" is a simple character set that matches the single character "#." The asterisk is a modifier.

The Anchor Characters: ^ and $

Pattern

Matches

^A

"A" at the beginning of a line

A$

"A" at the end of a line

A^

"A^" anywhere on a line

$A

"$A" anywhere on a line

^^

"^" at the beginning of a line

$$

"$" at the end of a line

The use of "^" and "$" as indicators of the beginning or end of a line is a convention

Matching a character with a character set :

The simplest character set is a character. The regular expression "may" contains three character sets: "m," "a" and "y." It will match any line with the string "may" inside it. This would also match the word "mayur." To prevent this, put spaces before and after the pattern: " may ."

Match any character with . (DOT)

The character "." is one of those special meta-characters. By itself it will match any character, except the end-of-line character. The pattern that will match a line with a single characters is ^.$

Specifying a Range of Characters with [...]

If you want to match specific characters, you can use the square brackets to identify the exact characters you are searching for. The pattern that will match any line of text that contains exactly one number is

^[0123456789]$

. You can also use the hyphen between two characters to specify a range:

^[0-9]$

You can also have explicit characters with character ranges. This pattern will match a single character that is a letter, number, or underscore:

[A-Za-z0-9_]

Rules in Short.

Regular Expression

Class

Type

Meaning

_




.

all

Character Set

A single character (except newline)

^

all

Anchor

Beginning of line

$

all

Anchor

End of line

[...]

all

Character Set

Range of characters

*

all

Modifier

zero or more duplicates

\<

Basic

Anchor

Beginning of word

\>

Basic

Anchor

End of word

\(..\)

Basic

Backreference

Remembers pattern

\1..\9

Basic

Reference

Recalls pattern

_+

Extended

Modifier

One or more duplicates

?

Extended

Modifier

Zero or one duplicate

\{M,N\}

Extended

Modifier

M to N Duplicates

(...|...)

Extended

Anchor

Shows alteration