Regular expressions in PHP

In this part of the PHP tutorial, we cover regular expressions in PHP.

Regular expressions are used for text searching and more advanced text manipulation. Regular expressions are built-in tools like grep, sed, text editors like vi, emacs, programming languages like Tcl, Perl, and Python. PHP has a built-in support for regular expressions too.

In PHP, there are two modules for regular expressions: the POSIX Regex and the PCRE. The POSIX Regex is depreciated. In this chapter, we will use the PCRE examples. PCRE stands for Perl compatible regular expressions.

Two things are needed when we work with regular expressions: Regex functions and the pattern.

A pattern is a regular expression that defines the text we are searching for or manipulating. It consists of text literals and metacharacters. The pattern is placed inside two delimiters. These are usually //, ##, or @@ characters. They inform the regex function where the pattern starts and ends.

Here is a partial list of metacharacters used in PCRE.

.Matches any single character.
*Matches the preceding element zero or more times.
[ ]Bracket expression. Matches a character within the brackets.
[^ ]Matches a single character that is not contained within the brackets.
^Matches the starting position within the string.
$Matches the ending position within the string.
|Alternation operator.

PRCE functions

We define some PCRE regex functions. They all have a preg prefix.

Next we will have an example for each function.

php > print_r(preg_split("@\s@", "Jane\tKate\nLucy Marion"));
Array
(
    [0] => Jane
    [1] => Kate
    [2] => Lucy
    [3] => Marion
)

We have four names divided by spaces. The \s is a character class which stands for spaces. The preg_split() function returns the split strings in an array.

php > echo preg_match("#[a-z]#", "s");
1

The preg_match() function looks if the 's' character is in the character class [a-z]. The class stands for all characters from a to z. It returns 1 for success.

php > echo preg_replace("/Jane/","Beky","I saw Jane. Jane was beautiful.");
I saw Beky. Beky was beautiful.

The preg_replace() function replaces all occurrences of the word 'Jane' for the word 'Beky'.

php > print_r(preg_grep("#Jane#", ["Jane", "jane", "Joan", "JANE"]));
Array
(
    [0] => Jane
)

The preg_grep() function returns an array of words that match the given pattern. In this example, only one word is returned in the array. This is because by default, the search is case sensitive.

php > print_r(preg_grep("#Jane#i", ["Jane", "jane", "Joan", "JANE"]));
Array
(
    [0] => Jane
    [1] => jane
    [3] => JANE
)

In this example, we perform a case insensitive grep. We put the i modifier after the right delimiter. The returned array has now three words.

The dot metacharacter

The . (dot) metacharacter stands for any single character in the text.

single.php
<?php

$words = [ "Seven", "even", "Maven", "Amen", "Leven" ];
$pattern = "/.even/";

foreach ($words as $word) {

    if (preg_match($pattern, $word)) {
        echo "$word matches the pattern\n";
    } else {
        echo "$word does not match the pattern\n";
    }
}

?>

In the $words array, we have five words.

$pattern = "/.even/";

Here we define the search pattern. The pattern is a string. The regular expression is placed within delimiters. The delimiters are mandatory. In our case, we use forward slashes / / as delimiters. Note that we can use different delimiters if we want. The dot character stands for any single character.

if (preg_match($pattern, $word)) {
    echo "$word matches the pattern\n";
} else {
    echo "$word does not match the pattern\n";
}

We test all five words if they match with the pattern.

$ php single.php 
Seven matches the pattern
even does not match the pattern
Maven does not match the pattern
Amen does not match the pattern
Leven matches the pattern

The Seven and Leven words match our search pattern.

Anchors

Anchors match positions of characters inside a given text.

In the next example, we look if a string is located at the beginning of a sentence.

anchors.php
<?php

$sentence1 = "Everywhere I look I see Jane";
$sentence2 = "Jane is the best thing that happened to me";

if (preg_match("/^Jane/", $sentence1)) {
    echo "Jane is at the beginning of the \$sentence1\n";
} else {
    echo "Jane is not at the beginning of the \$sentence1\n";
}

if (preg_match("/^Jane/", $sentence2)) {
    echo "Jane is at the beginning of the \$sentence2\n";
} else {
    echo "Jane is not at the beginning of the \$sentence2\n";
}

?>

We have two sentences. The pattern is ^Jane. The pattern checks if the 'Jane' string located at the beginning of the text.

$ php anchors.php 
Jane is not at the beginning of the $sentence1
Jane is at the beginning of the $sentence2
php > echo preg_match("#Jane$#", "I love Jane");
1
php > echo preg_match("#Jane$#", "Jane does not love me");
0

The Jane$ pattern matches a string in which the word Jane is at the end.

Exact word match

In the following examples we show how to look for exact word matches.

php > echo preg_match("/mother/", "mother");
1
php > echo preg_match("/mother/", "motherboard");
1
php > echo preg_match("/mother/", "motherland");
1

The mother pattern fits the words mother, motherboard and motherland. Say, we want to look just for exact word matches. We will use the aforementioned anchor ^ and $ characters.

php > echo preg_match("/^mother$/", "motherland");
0
php > echo preg_match("/^mother$/", "Who is your mother?");
0
php > echo preg_match("/^mother$/", "mother");
1

Using the anchor characters, we get an exact word match for a pattern.

Quantifiers

A quantifier after a token or a group specifies how often that preceding element is allowed to occur.

 ?     - 0 or 1 match
 *     - 0 or more
 +     - 1 or more
 {n}   - exactly n
 {n,}  - n or more
 {,n}  - n or less (??)
 {n,m} - range n to m

The above is a list of common quantifiers.

The question mark ? indicates there is zero or one of the preceding element.

zeroorone.php
<?php

$words = [ "color", "colour", "comic", "colourful", "colored", 
    "cosmos", "coloseum", "coloured", "colourful" ];
$pattern = "/colou?r/";

foreach ($words as $word) {
    if (preg_match($pattern, $word)) {
        echo "$word matches the pattern\n";
    } else {
        echo "$word does not match the pattern\n";
    }
}

?>

We have four nine in the $words array.

$pattern = "/colou?r/";

Color is used in American English, colour in British English. This pattern matches both cases.

$ php zeroorone.php 
color matches the pattern
colour matches the pattern
comic does not match the pattern
colourful matches the pattern
colored matches the pattern
cosmos does not match the pattern
coloseum does not match the pattern
coloured matches the pattern
colourful matches the pattern

This is the output of the zeroorone.php script.

The * metacharacter matches the preceding element zero or more times.

zeroormore.php
<?php

$words = [ "Seven", "even", "Maven", "Amen", "Leven" ];

$pattern = "/.*even/";

foreach ($words as $word) {
    
    if (preg_match($pattern, $word)) {
        echo "$word matches the pattern\n";
    } else {
        echo "$word does not match the pattern\n";
    }
}

?>

In the above script, we have added the * metacharacter. The .* combination means, zero, one or more single characters.

$ php zeroormore.php 
Seven matches the pattern
even matches the pattern
Maven does not match the pattern
Amen does not match the pattern
Leven matches the pattern

Now the pattern matches three words: Seven, even and Leven.

php > print_r(preg_grep("#o{2}#", ["gool", "root", "foot", "dog"]));
Array
(
    [0] => gool
    [1] => root
    [2] => foot
)

The o{2} pattern matches strings that contain exactly two 'o' characters.

php > print_r(preg_grep("#^\d{2,4}$#", ["1", "12", "123", "1234", "12345"]));
Array
(
    [1] => 12
    [2] => 123
    [3] => 1234
)

We have this ^\d{2,4}$ pattern. The \d is a character set; it stands for digits. The pattern matches numbers that have 2, 3, or 4 digits.

Alternation

The next example explains the alternation operator |. This operator enables to create a regular expression with several choices.

alternation.php
<?php

$names = [ "Jane", "Thomas", "Robert", "Lucy", "Beky", 
    "John", "Peter", "Andy" ];

$pattern = "/Jane|Beky|Robert/";

foreach ($names as $name) {

    if (preg_match($pattern, $name)) {
        echo "$name is my friend\n";
    } else {
        echo "$name is not my friend\n";
    }
}

?>

We have eight names in the $names array.

$pattern = "/Jane|Beky|Robert/";

This is the search pattern. The pattern looks for 'Jane', 'Beky', or 'Robert' strings..

$ php alternation.php 
Jane is my friend
Thomas is not my friend
Robert is my friend
Lucy is not my friend
Beky is my friend
John is not my friend
Peter is not my friend
Andy is not my friend

This is the output of the script.

Subpatterns

We can use square brackets () to create subpatterns inside patterns.

php > echo preg_match("/book(worm)?$/", "bookworm");
1
php > echo preg_match("/book(worm)?$/", "book");
1
php > echo preg_match("/book(worm)?$/", "worm");
0

We have the following regex pattern: book(worm)?$. The (worm) is a subpattern. The ? character follows the subpattern, which means that the subpattern might appear 0, 1 times in the final pattern. The $ character is here for the exact end match of the string. Without it, words like bookstore, bookmania would match too.

php > echo preg_match("/book(shelf|worm)?$/", "book");
1
php > echo preg_match("/book(shelf|worm)?$/", "bookshelf");
1
php > echo preg_match("/book(shelf|worm)?$/", "bookworm");
1
php > echo preg_match("/book(shelf|worm)?$/", "bookstore");
0

Subpatterns are often used with alternation. The (shelf|worm) subpattern enables to create several word combinations.

Character classes

We can combine characters into character classes with the square brackets. A character class matches any character that is specified in the brackets.

characterclass.php
<?php

$words = [ "sit", "MIT", "fit", "fat", "lot" ];

$pattern = "/[fs]it/";

foreach ($words as $word) {

    if (preg_match($pattern, $word)) {
        echo "$word matches the pattern\n";
    } else {
        echo "$word does not match the pattern\n";
    }
}

?>

We define a character set with two characters.

$pattern = "/[fs]it/";

This is our pattern. The [fs] is the character class. Note that we work only with one character at a time. We either consider f, or s, but not both.

$ php characterclass.php 
sit matches the pattern
MIT does not match the pattern
fit matches the pattern
fat does not match the pattern
lot does not match the pattern

This is the outcome of the script.

We can also use shorthand metacharacters for character classes. The \w stands for alphanumeric characters, \d for digit, and \s whitespace characters.

shorthand.php
<?php

$words = [ "Prague", "111978", "terry2", "mitt##" ];
$pattern = "/\w{6}/";

foreach ($words as $word) {

    if (preg_match($pattern, $word)) {
        echo "$word matches the pattern\n";
    } else {
        echo "$word does not match the pattern\n";
    }
}

?>

In the above script, we test for words consisting of alphanumeric characters. The \w{6} stands for six alphanumeric characters. Only the word mitt## does not match, because it contains non-alphanumeric characters.

php > echo preg_match("#[^a-z]{3}#", "ABC");
1

The #[^a-z]{3}# pattern stands for three characters that are not in the class a-z. The "ABC" characters match the condition.

php > print_r(preg_grep("#\d{2,4}#", [ "32", "234", "2345", "3d3", "2"]));
Array
(
    [0] => 32
    [1] => 234
    [2] => 2345
)

In the above example, we have a pattern that matches 2, 3, and 4 digits.

Extracting matches

The preg_match() takes an optional third parameter. If it is provided, it is filled with the results of the search. The variable is an array whose first element contains the text that matched the full pattern, the second element contains the first captured parenthesized subpattern, and so on.

extract_matches.php
<?php

$times = [ "10:10:22", "23:23:11", "09:06:56" ];

$pattern = "/(\d\d):(\d\d):(\d\d)/";

foreach ($times as $time) {

    $r = preg_match($pattern, $time, $match);
    
    if ($r) {
        
        echo "The $match[0] is split into:\n";
        
        echo "Hour: $match[1]\n";
        echo "Minute: $match[2]\n";
        echo "Second: $match[3]\n";
    } 
}

?>

In the example, we extract parts of a time string.

$times = [ "10:10:22", "23:23:11", "09:06:56" ];

We have three time strings in English locale.

$pattern = "/(\d\d):(\d\d):(\d\d)/";

The pattern is divided into three subpatterns using square brackets. We want to refer specifically to exactly to each of these parts.

$r = preg_match($pattern, $time, $match);

We pass a third parameter to the preg_match() function. In case of a match, it contains text parts of the matched string.

if ($r) {
    
    echo "The $match[0] is split into:\n";
    
    echo "Hour: $match[1]\n";
    echo "Minute: $match[2]\n";
    echo "Second: $match[3]\n";
} 

The $match[0] contains the text that matched the full pattern, $match[1] contains text that matched the first subpattern, $match[2] the second, and $match[3] the third.

$ php extract_matches.php 
The 10:10:22 is split into:
Hour: 10
Minute: 10
Second: 22
The 23:23:11 is split into:
Hour: 23
Minute: 23
Second: 11
The 09:06:56 is split into:
Hour: 09
Minute: 06
Second: 56

This is the output of the example.

Email example

Next have a practical example. We create a regex pattern for checking email addresses.

emails.php
<?php

$emails = [ "luke@gmail.com", "andy@yahoocom", "34234sdfa#2345", 
    "f344@gmail.com"];

# regular expression for emails
$pattern = "/^[a-zA-Z0-9._-]+@[a-zA-Z0-9-]+\.[a-zA-Z.]{2,18}$/";

foreach ($emails as $email) {

    if (preg_match($pattern, $email)) {
        echo "$email matches \n";
    } else {
        echo "$email does not match\n";
    }
}

>?

Note that this example provides only one solution. It does not have to be the best one.

$pattern = "/^[a-zA-Z0-9._-]+@[a-zA-Z0-9-]+\.[a-zA-Z.]{2,18}$/";

This is the pattern. The first ^ and the last $ characters are here to get an exact pattern match. No characters before and after the pattern are allowed. The email is divided into five parts. The first part is the local part. This is usually a name of a company, individual, or a nickname. The [a-zA-Z0-9._-]+ lists all possible characters, we can use in the local part. They can be used one or more times. The second part is the literal @ character. The third part is the domain part. It is usually the domain name of the email provider, like yahoo, or gmail. The [a-zA-Z0-9-]+ is a character set providing all characters that can be used in the domain name. The + quantifier makes use of one or more of these characters. The fourth part is the dot character. It is preceded by the escape character (\). This is because the dot character is a metacharacter and has a special meaning. By escaping it, we get a literal dot. The final part is the top level domain. The pattern is as follows: [a-zA-Z.]{2,18} Top level domains can have from 2 to 18 characters, like sk, net, info, travel, cleaning, travelinsurance. The maximum lenght can be 63 characters, but most domain are shorter than 18 characters today. There is also a dot character. This is because some top level domains have two parts; for example co.uk.

$ php emails.php 
luke@gmail.com matches 
andy@yahoocom does not match
34234sdfa#2345 does not match
f344@gmail.com matches 

This is the output of the emails.php example.

Recap

Finally, we provide a quick recap of the regex patterns.

Jane    the 'Jane' string
^Jane   'Jane' at the start of a string
Jane$   'Jane' at the end of a string
^Jane$  exact match of the string 'Jane'
[abc]   a, b, or c
[a-z]   any lowercase letter
[^A-Z]  any character that is not a uppercase letter
(Jane|Becky)   Matches either 'Jane' or 'Becky'
[a-z]+   one or more lowercase letters
^[98]?$  digits 9, 8 or empty string       
([wx])([yz])  wy, wz, xy, or xz
[0-9]         any digit
[^A-Za-z0-9]  any symbol (not a number or a letter)

In this chapter, we have covered regular expressions in PHP.