|
One of PHPs most useful features is its string processing
abilities. Feed PHP any string, and it can process it in any number of different
ways with a multitude of different in-built functions.
Finding letter occurrences, replacing certain words, limiting the number of
characters, etc - it's all made very easy.
One very useful function in
particular is preg_replace(), which allows you to find certain
occurrences of words in an advanced, customized way and replace
them with a a string of your choice. The searched string can either be a simple string
(although I recommend you use str_replace() for this function only due to
its superior speed), or it can be a regular expression (REGEX). These
regular expressions are like targeted wildcards, albeit MUCH more complex.
The aim of this tutorial is to describe the formulation strategy of various
REGEX expressions, what they do, and how to customize them to your own unique
purposes. As you can guess, this is an ADVANCED tutorial, so no efforts
will be made to explain the preg_replace() or str_replace()
functions. If you need this tutorial, you are more than likely able to
read the PHP manual anyway... ;) Basic REGEX Make no
mistake - REGEX is widely used today - even searches in Microsoft Windows use
them to some degree. Let me point you towards a simple example:
*.* -
This is REGEX, and in windows it means "find any file with any extension in a
given directory". In PHP it would mean "find one or more characters followed by
a dot followed by one or more characters.
Let us enhance that a little:
[A-Z]*.* -
The "[A-Z]" is a character class and it basically means any letter from a to z
that is uppercase. If you want to collect lowercase you would enter "[a-z]". If you would like to
collect any letter, the obvious solution would be "[A-Za-z]".
TIP: If you want to check for a custom range of characters you could always use
[g-p], etc. Occurence-Counting REGEX
A character class followed by a "*" means "zero or more characters from
the selected character class". So this string: [a-z]* would mean "zero or more lowercase letters". If you need to check for at
least one occurrence of a letter you would use:
[a-z]+ -
A "+" basically means "one or more occurrences". You could also do:
[a-z]{1} -
This means "exactly one or more occurrences of a lowercase letter". So "exactly
two to three occurrences of a lowercase letter" would be: [a-z]{2-3}
If you want to check for an optional character you use the question mark (?),
like this: [a-z]? -
And the explanation of this line is "an optional lowercase character". Now
that we have this covered lets move on...
Character-Counting REGEX
^(.){4-6}$ -
In PHP REGEX the carrot (^) symbol basically means the beginning of the line. So
the dollar ($) symbol obviously means the end of the line. The end of the line
occurs when a '/n' character is found. So this expression will mean "the start of the line
followed by 4 to 6 any characters followed by the end of the line". Yes, the dot
(.) character means "any character". So the line: (.*)
would mean "any amount of any character". The carrot (^) character can also be
used for negating character classes. By negating I mean checking if there are no
characters of the specified range. So a string like ^[^0-9]*$ would mean "start of the line followed by zero or more any characters that is NOT
a digit followed by the end of the line". The Zen of
Brackets By now you have probably noticed all
the different brackets that are used. All of them have a different meaning. Let me explain:
-
The parenthesis "(" and ")" are used to group different expressions together,
to which (if you need to use preg_replace) you can return later using a simple
"$n" where n means a digit representing order from left to right of all the
groups in the REGEX string. So, if you want to extract the text from the second
group in this: ^([a-z]+)[A-Z]?([0-5]{1-3})$ You would have to use "$2" (the
first group is ([a-z]+) and the second is ([0-5]{1-3})). And, of course, the usual
translation of the string to human language is "the start of the line followed by one or
more lowercase letters followed by an optional uppercase letter followed by 1 to
3 digits not higher then 5 followed by the end of the line".
-
The curly brackets "{" and "}" represent the widely used minimum/maximum
values. As explained earlier, they can be used to further customize checking for
characters in a string instead of the usual "one or more" or "zero or more".
Syntax would be: {n} for n or more e.g. {1}, or {n-m} for
no less than n number of
characters and no more then m number of characters. e.g. {3-7}
-
And finally, of course, there are the the normal brackets "[" and "]". These represent a character
range, which was also explained earlier. The syntax for this one is: [a-b]
where
a is the range start and b is the range end e.g. [A-Z]
Of course, you don't have to use all REGEX for a string. You can also check for
occurrences of words in a more advanced way. If, for example, you would like to search for a
string containing the word "military" followed by an optional digit followed by
the end of the line, you would write something like this: [Mm]ilitary[0-9]?$
Take note that the "[Mm]" is also a character range - it specifies a search for
either character in the brackets. You can use all kinds of characters in your
searches, but if you want to use a special character (e.g. a bracket) you will need to escape it using the all-saving backslash (\).
This is, of course, the rule for PHP in general anyway! So,
for example,
if you want to search for "[word]" you would write the REGEX
like this: (\[word\]+) Commonly Used
Examples Now that we have all the advanced theory out of the way, here
are some frequently used reference REGEX expressions found in popular PHP-driven
scripts:
\[b\](.*?)\[/b\] -
What you see here is REGEX used to search for text encased in a [b] and
[/b]
tag. This is used very
widely among forums, news systems of all kinds, etc.
[0-9A-Za-z]{8-15}
-
This could be used in scripts that utilise registration with passwords. This
REGEX only accepts a
string that is numeric or alphabetic with minimum 8 and maximum 15 characters.
The Speed Issue & Techniques Using
preg_replace() is definitely convenient, but it isn't too fast
considering that PHP has to
parse the string for metacharacters first instead of proceeding straight to the
searching. I cant stress this enough: if you want to search a rather large text
file for the word "cat" then, FOR THE LOVE OF GOD, use the strstr()
function instead
of preg_match(). Don't use preg functions when you're not using REGEX.
Trust me on this one!
Also, many new people don't see the magic of arrays and proceed with the ignorant
way of using 30 preg_match() functions each after the other instead of just
putting the content in an array and searching that instead. Arrays are faster,
more convenient
and, most of all, they wont make your code look messy. Incidentally, if you are
still rusty with arrays, you will do well to check out
Scrowler's tutorial on arrays, also on Biorust...
Well, this is the end of the tutorial, so if you have any questions (or just want
to flame me for writing some innate babble) then proceed to the
Biorust forums and leave your opinions there. I promise someone will get back to you.
- Tutorial written by Blodo
|