Once upon a time

When I was new to PHP I wanted to make a bbcode parser, so I started looking for ways to find a specific string and replace each occurrences of it with appropriate strings or tags, for example replace the smiley symbol : ) with <img src=’Images/smile.jpg’>. That’s when I came across something called regular expression, the first look at it sent chills down my spine, I mean look at it.



<?php

$text="<a href="www.wordpress.com">Wordpress</a>

preg_match_all("<a href=['|\"](http:\/\/www\..*\.[a-z]{1,4})['|\"])>.*<\/a>/i",$text,$matches);

echo "Found matches";

foreach($matches[1] as $mach)echo $match."

";

?>



And they lived happily ever after. Now don’t worry if you don’t understand it, everyone doesn’t the first time, its really easy. In case you’re wondering what the above line does, it gets every http link from the text in $text and stores it in the form of an array in $matches. I know I know, I should be starting from the basics here, sorry I got a little ahead here, just wanted to show you whats in store for you here, I’ll explain everything I know about regular expression now.

The beast



We’ve been hearing a lot of regular expressions or regexps as its easy to type. But wth is a regexp? Well think of it as a “find a word” function, it looks for specific patterns or pattern of words matches them and if you ask it to store the resulting matches to a variable, it does so. In even simpler words if you want to find out if there’s a word “foo” present in “foobar”, it does just that for you. Its a very useful function of PHP or other languages like Perl, but we’ll focus on PHP for now.

What good is it?

Well, there are a lot of applications and cool uses for regular expression. Some of them are.

1. Like I said before you can use it to parse smileys and bbcodes.

2. You can use it to find out the number of occurrence of a word or letter in a string or text file.

3. You can use it to process the results returned by your cURL google search script.

4. Get links and image urls from webpages that you scrape.

5. Find and replace words, use it as a word filterer.

6. Validate form data like email address or data of birth checking for a specified pattern.

Prerequisite

The only prerequisite to understand or make use of this article is that you should know that basics of PHP.

Now for some coding

Alright, enough talk, how in the name of everything holy do we use regexps and do some cool stuff with it? Well, to begin with we need to learn 3 functions.

1. preg_match:

It should be obvious that its the function used to match a “pattern” in a string. Now whats a pattern? In short if you look at the first example /<a href=.*(http:\/\/www\..*\..*)>.*<\/a>/i, this stuff is called a pattern.It tells preg_match what you want it to, match. You might be thinking wth do we need all those weird looking symbols and characters for. Well, you don’t really, all those symbols and stuff are for complex pattern matching. If you just want to match a word “foo” in a string “foobar”, your pattern could look like this.



"/foo/"



Delimiters

And what are those two unholy “/” symbols doing at both ends of our pattern? Those are called “delimiter”, they mark the beginning and end of our pattern, it doesn’t really need be the “/”, you can use any symbol you want as long as you use the same one to mark the beginning and end of the pattern. That brings us to our next question, why do we need to mark beginning and end of the pattern? We need them so that we can use something called “pattern modifiers”.

Pattern Modifiers

Yay more new terms :D. Modifiers are kind of advance features that helps us make advanced kind of searches on texts. Stuff like if you want to make the search case insensitive greedy or ungreedy, do extra analysis, ignore newlines and treat the whole string as one line. I know some of this may look strange to you but it’ll make sense sooner. Modifiers come immediately after the end delimiter like this “/pattern/i” i is the modifier here which makes the search case less, so Pattern and pattern looks the same to it. Some of the modifiers are.

i: This modifiers sets case less searching so that preg_match

does case insensitive matching.

s: If this modifier is set the “.” metacharacter matches all the characters. The “.” metacharacter, we’ll talk about

metacharacters next, matches everything excluding the “

” character. Using this will make the dot character match it too.

x: This modifier makes the preg_match ignore white spaces unless its specified inside character class, I’ve been talking about a lot of terms that I’ven’t explained. Please humor me for a moment here. I’ll explain it all after this section.

U: Its also called the ungreedy modifier. The thing widh preg_match is that if you make a pattern like this /[b](.*)[/b]/ to match all the text between the bold bbcode, it’ll match everything in between the first occurrence of [b] and the last occurrence of [b]. So something like [b]bold text one[b] something else [b] bold text two[/b] after parsing would look like bold text one something else bold text two because it matched everything in between the first [b] and the sencond [/b]. So we put the U modifier so that it matches each [b] text separately.

Metacharacters

Did you notice all the “.[]{}\()*” characters? Well, turns out they’re called metacharacters. What they do is that they help us make our patterns even more specific. Lets say for example we want to check if a line ends with foo. We can do it using the “$” metacharacter which tells the function to match the word at the end of the string.

The most commonly used metacharacters are.

$- : This meta-character helps us specify characters in the end of pattern we want to match.

For example:

/end/ matches words like blend, will send, and end.

^-: The circumflex symbol sets the characters at the beginning of the pattern to be matched.

For example:

/start/ matches words like started, startle, and start the engine.

. : The dot character matches any character with exception to the “

” or new line by default unless you set the m modifier.

\ : Is an escape character. We can use it to escape special characters like $,[,],{,},*,.,(,), and ^ in case we use them in our pattern. For example if we want to check if a text contains $ sign, we could do this.

See what happened up there? The $ character is escaped using a \. If you don’t this this the compiler will return an error.

We can also use it to escape the delimiter if we use it. Like this. /s\/ash/

Character Class

Just what we need, more terms :P. Character classes are like baskets or menus or lists in which we can specify a range of characters we want preg_match to match. For example if we want a [b] bb tag to match only letters and not numbers our pattern could look like this. /\[b\]([a-zA-Z]){1,}\[\/b\]/. Aha two new signs. Those two signs are called minimum and maximum quantifiers. The “-” character marks the range of characters for example “0-9”, or “a-b,a-z”.

Quantifiers

Holy cow, whats that? Nah, don’t worry its not as complex as its name sounds. There’s only two kind of quantifiers, minimum and maximum. They set the limit to how many characters we want the pattern to match in a characters class. Its used like this, {min quant, max quant} where min quant and max quant specifies the minimum and number of characters. For example.

{1,2} Means that we want to match a minimum of 1 and a maximum of two characters.

{1,} Means that we want the function to match any number of characters from 1 to any :P.

preg_match(“/[a-z]{1,}/”): Matches text like abc, cde, and thingislong

preg_match(“/[a-z]{1,2}[0-9]{1,}/”): Matches text like ab123,ab1, and jk123123123

Another thing about character classes is that the “^” symbol signifies the characters we want to exclude from the search or the characters that the matching pattern shouldn’t contain. For example.

preg_match(“/[a-z^0-9]/i”): Matches words like abce, and word and doesn’t match word like abc123.

The ^ meta-character outside the [ and ] characters of the character class is not the same as the one inside it, both of them have different meanings.

More quantifiers

Just when you thought you won’t hear from the quantifiers any more. There are two more quantifiers we need to know about. The first one is “*” or the 0 or more quantifier, it matches one or more characters specified in the character class or character that precedes it. And “+” or the 1 or more quantifier matches one or more characters. For example.

preg_match(“/\w{1,}.*/”): Matches w, wo1, wo% and pretty much any one word followed by 0 or more characters other than

or newline.

preg_match(“/\W\w+/”): Matches one or more of any character.

SubPatterns

SubPatterns are like, well, sub patterns, suppose you want to match a smiley : D in a text “yay : D” so that you can parse it to replace it with appropriate image. If we do it using a pattern like this.

preg_match(“/.*: D/”,$text,$match)

It’ll return an array with the whole text, thats the reason we need sub patterns. The start and end of a sub pattern can be marked with “(” and “)” characters. Now if we do this.

preg_match(“/.*(: D)/”,$text,$match);

We’ll get an array with a collection of arrays in it which has the sub pattern or the “: D” we want. We can use this to replace it with appropriate image tag and source. In this case the matching sub pattern will be in $match[1] and if we add more sub patterns just increase the index number like $match[2], $match[3] and so on.

Alright now that we’ve covered the basics of syntax and preg_match lets move on to the other two functions. Woohoo, more fun :P.

preg_match_all:

preg_match_all’s a function almost like preg_match with exception for the fact that it matches all the words that matches the pattern in the string or text. The thing with preg_match is that it stops once it finds the first match, it can only be used to find a single match or to check if a text matches a particular pattern. If you wanna do some text collection, preg_match_all’s the function you’re looking for.

You can use it like this.

preg_match_all(“/pa[t]{2}ern/”,”text for the pattern to match”,$matchStoringVariable);

You’ll see some sample applications for this at the end of this article.

preg_replace:

Now for the last function, preg_replace. We can use it to replace strings that match a particular pattern. For example if we want to replace words with two p’s in it in a text with just one p we can do it like this.

preg_replace(“/[p]{2}/i”,”p”,$text);

Or we can replace an array of patterns with an array of replacements.

$smilies=array(“/:\)/,”: D”,”: p”,”: (“);

$replace=array(“images/smile.png”,”images/lau.png”,”images/ton.png”,”images/sa.png”);

preg_replace($smilies,$replace,$text);

Or if we want to get some matches from a string and replace some parts of the text along with the text we got from the match function we can do this.

$bbcode=array(“/\[b\](.*)\[\/b\]/Ui”,”\[i\](.*)\[\/i\]/Ui”);

$replace=array(“<strong>$1<\/strong>”,”<em>$1<\/em>”);

preg_replace($bbcode,$replace,$text);

You must have noticed the $1 up there. That’s the sub pattern match text we got from the preg_match function. If we want more sub patterns we just need to increase the number near the “$” sign like $2, $3 and so on. Its that easy.

Special characters

There are some special non-printable characters and characters than can be represented using a backslash, they can help you make your patterns more precise. Some of them are.

\a Alarm or the BEL character, its usage usually outputs a sounds when the script is run on terminal console.

\f Formfeed

\t Tab



Newline

\r Carriage return, similar to newline

\d Digits

\D Non decimal digits

\h and \v Horizontal and vertical whitespace

\H and \V Any character that is not a horizontal or vertical whitespace

\s Any whitespace character

\S Any character that is not a whitespace

\w Any word character.

\W Any nonword character.

\b Word boundary

\B Not a word boundary

\G First matching position in a pattern or string

\A Start of a string or line

\Z End of a line or string

The characters have not been sorted on the basis of anything. I just wrote it down :P.

And they regexp’ ed happily ever after

I think I’ve explained about everything or almost everything I know about regular expressions in PHP. If you’d like to, please take a look at the sample code. They’re not much, but I’ll add more as I make them.

Sample functions

Do XSS filtering

<?php

function XSSFilter($str){

$unsafe=array("//”,”/\”/”);

$safe=array(“<“,”>”,”"”);

$str=preg_replace($unsafe,$safe,$str);

return $str;

}

?>

Check if mail address is valid

Check if a field has only numbers or if a field has only letters

<?php

function onlyNum($str){

$ret=false;

if(!preg_match("/[^0-9]/"))$ret=true;

return $ret;

}

function onlyLet($str){

$ret=false;

if(!preg_match("/[^a-zA-Z]/"))$ret=true;

return $ret;

}

BBCode parser

function bbCode($str){

$bbcodes=array(“/\[b\](.*)\[\/b\]”); //Add more bbcodes here

$replace=array(“<strong>$1<\/strong>”); //Add the replacement texts or tags here

$str=preg_replace($bbcode,$replace,$str);

return $str;

}

?>

I’ll add more later :P. Sorry I couldn’t do it now.