Custom Tags Parsing Using Regular Expressions

In the last post, we had created a simple
custom tag parsing script using PHP string functions. In this post,
we are going to continue our discussion on custom tag parsing but rather using
Regular Expressions. Here we will see how regular expressions can used to parse
strings, we will also see where to and where not to use Regular Expressions.
Before continuing, I expect that you have a working knowledge of Regular Expressions
if not please first check out this
websites.

Let us first create the previous custom tag parsing script using expressions:


<form name="form1" method="get" action="">

  <p>

    <!-- textarea should display previously wriiten text -->


    <textarea name="content" cols="35" rows="12" id="content"><? if (isset($_GET['content'])) echo $_GET['content']; ?></textarea>


  </p>

  <p>

    <input name="parse" type="submit" id="parse" value="Parse">

  </p>

</form>

<?



if(isset($_GET['parse']))


{

    $content = $_GET['content'];

    //convert newlines in the text to HTML "<br />"


    //required to keep formatting (newlines)

    $content = nl2br($content);

    

    //PHP function 'eregi_replace' replaces all occurences of the expression with the one mentioned


    //'\\1' is the string matched (one in parentheses '()' in the regular expression

    //it's a 'eregi_replace' thing not PHP's




    $content = eregi_replace('\.b\.(.+)\./b\.', '<strong>\\1</strong>', $content);

    $content = eregi_replace('\.i\.(.+)\./i\.', '<i>\\1</i>', $content);


    

    //now the variable $content contains HTML formatted text

    //display it

    echo '<hr />';


    echo $content;

}

?>

But should we use regular expressions here, answer is NO, because, first regular
expressions run slower and they add a fair bit of complexity where the same
thing could have been done easily using just string functions.

The reason for me staring this post with something contradicting to the theme
of the post is because people tend to avoid regular expressions thinking that
the same thing can be done otherwise (I just gave them one more chance!). Well
it may be case sometimes but in many other cases where complex string manipulation
is required with efficiency there is but one choice, regular expressions. The
next example will illustrate this.

For this example we will parse ‘*’ (asterisk) and ‘_’
(underscore) for bolding and italicizing text (as in Google Talk / IM applications).
The following text:

Hello *World*. Hello _World_.

Will be parsed and displayed as:

Hello World. Hello World.

It is quite obvious that both tags’ start and end tags are the same.
Now let us see how this can be implemented (using regular expressions).


<form name="form1" method="get" action="">

  <p>

    <!-- textarea should display previously wriiten text -->


    <textarea name="content" cols="35" rows="12" id="content"><? if (isset($_GET['content'])) echo $_GET['content']; ?></textarea>


  </p>

  <p>

    <input name="parse" type="submit" id="parse" value="Parse">

  </p>

</form>

<?



if(isset($_GET['parse']))


{

    $content = $_GET['content'];

    //convert newlines in the text to HTML "<br />"


    //required to keep formatting (newlines)

    $content = nl2br($content);

    

    //match anything between the tags but not the tag itself


    //otherwise '*hello* world *hello*'

    //will be print 'hello* world *hello' in bold

    //and not 'hello(in bold) world hello(again in bold)'




    $content = eregi_replace('\*(.[^*]+)\*', '<strong>\\1</strong>', $content);

    $content = eregi_replace('\_(.[^_]+)\_', '<i>\\1</i>', $content);


    

    //now the variable $content contains HTML formatted text

    //display it

    echo '<hr />';


    echo $content;

}

?>