What is a Regular Expression?

A regular expression is a set of characters that specify a pattern. Regular expressions are used when you want to search for specify lines of text containing a particular pattern. Most of the UNIX utilities operate on ASCII files a line at a time. Regular expressions search for patterns on a single line, and not for patterns that start on one line and end on another.

It is simple to search for a specific word or string of characters. Almost every editor on every computer system can do this. Regular expressions are more powerful and flexible. You can search for words of a certain size. You can search for a word with four or more vowels that end with an "s." Numbers, punctuation characters, you name it, a regular expression can find it. What happens once the program you are using find it is another matter. Some just search for the pattern. Others print out the line containing the pattern. Editors can replace the string with a new pattern.

Parts Of Regular Expression

There are three important parts to a regular expression.

1) Anchors are used to specify the position of the pattern in relation to a line of text.

2) Character Sets match one or more characters in a single position.

3) Modifiers specify how many times the previous character set is repeated.

A simple example that demonstrates all three parts is the regular expression "^#*."

The up arrow is an anchor that indicates the beginning of the line. The character "#" is a simple character set that matches the single character "#." The asterisk is a modifier.

The Anchor Characters: ^ and $

Pattern	Matches
^A	"A" at the beginning of a line
A$	"A" at the end of a line
A^	"A^" anywhere on a line
$A	"$A" anywhere on a line
^^	"^" at the beginning of a line
$$	"$" at the end of a line

The use of "^" and "$" as indicators of the beginning or end of a line is a convention

Matching a character with a character set :

The simplest character set is a character. The regular expression "may" contains three character sets: "m," "a" and "y." It will match any line with the string "may" inside it. This would also match the word "mayur." To prevent this, put spaces before and after the pattern: " may ."

Match any character with . (DOT)

The character "." is one of those special meta-characters. By itself it will match any character, except the end-of-line character. The pattern that will match a line with a single characters is ^.$

Specifying a Range of Characters with [...]

If you want to match specific characters, you can use the square brackets to identify the exact characters you are searching for. The pattern that will match any line of text that contains exactly one number is

^[0123456789]$

. You can also use the hyphen between two characters to specify a range:

^[0-9]$

You can also have explicit characters with character ranges. This pattern will match a single character that is a letter, number, or underscore:

[A-Za-z0-9_]

Rules in Short.

Regular Expression	Class	Type	Meaning
_
.	all	Character Set	A single character (except newline)
^	all	Anchor	Beginning of line
$	all	Anchor	End of line
[...]	all	Character Set	Range of characters
*	all	Modifier	zero or more duplicates
\<	Basic	Anchor	Beginning of word
\>	Basic	Anchor	End of word
$..$	Basic	Backreference	Remembers pattern
\1..\9	Basic	Reference	Recalls pattern
_+	Extended	Modifier	One or more duplicates
?	Extended	Modifier	Zero or one duplicate
\{M,N\}	Extended	Modifier	M to N Duplicates
(...\|...)	Extended	Anchor	Shows alteration

MayurS Blog

Thursday, November 20, 2008

Regular Expression