Start at the beginning.
This page is a little long but I felt it was better to keep all this material together rather than split it up. Much of the content is examples so you should be able to get through it fairly quickly however.
Even though we are only looking at the basic set of regular expression characters here you will find that you can still use them to create quite useful search patterns.
A regular expression is a description of a pattern of characters. The most basic pattern we can describe is an exact string (or sequence) of characters. So for example I may want to do a search for the characters th (Or in more specific terms, I am searching for the character t followed directly by the character h)
You may be wondering why the th in there was not picked up as a match. The reason is that it contains a capital T as opposed to lowercase which is what the regular expression was searching for. We know that they are the same character, just in a different form. Regular expressions do not however. Regular expressions do not interpret any meaning from the search pattern. All they do is look for exact matches to specifically what the pattern describes.
It is possible to make a regular expression look for matches in a case insensitive way but you'll learn about that later on.
A very basic expression like this is really no different to a search you may do in a search engine or in your favourite word processor or such. It's not really that exciting. From here in however things do start to get more interesting.
Regular expressions can sometimes be a bit hard to get your head around at first so if the material below seems a little confusing don't worry too much. With practice it will start to make more sense. If you find yourself getting stuck, it may be worth revising our bit on the previous page on Learning Regular Expressions
The dot ( . ) (or full stop) character is what we refer to as a metacharacter. Metacharacters are characters which have a special meaning. They help us to create more interesting patterns than just a string of specific characters. Pretty much everything we look at from here in will be metacharacters.
The dot ( . ) represents any character. So with the regular expression below, what we are looking for is the character b followed by any character, followed by the character g.
It is important to note that the . matches only a single character. We may get it to match more than a single character using multipliers which we'll look at further below. Alternatively, you could also use multiple .'s like so:
In the above example we are matching an l, followed by two characters, followed by an e.
The . allows us to match any character. Sometimes we would like to be a bit more specific than that. This is where ranges come in useful. We specify a range of characters by enclosing them within square brackets ( [ ] ).
In the regular expression above we are looking for the character t followed by either the character e or o, followed by the character d.
There is no limit to how many characters you may place in side the square brackets. You could place a single character, eg. [y] (which would be a bit silly but nevertheless it is legal), or you could have many, eg. [grf4s2#lknx].
Let's say we wanted to find the presence of a digit between 1 and 8. We could use a range like so [12345678] but there is a shortcut we may use to make things a bit easier.
You can combine a set of characters along with other characters.
In the above regular expression we are searching for the digits 1, 2, 3, 4 or 9.
We can also combine multiple sets. In the regular expression below we are looking for 1, 2, 3, 4, 5, a, b, c, d, e, f, x.
Using sets of characters can sometimes lead to odd behaviour. For example, you may use the range [a-f] and find that it matches D. This has to do with the character tables the system is using. Most systems have a character table where all the lowercase letters come first, then the uppercase letters. eg. abcdef....xyzABCD... A few systems however, alternate the lowercase and uppercase letters. eg. aAbBcCdD...yYzZ. If you encounter some strange behaviour and you're using ranges, this is the first place to check.
Sometimes we may want to find the presence of a character which is not a range of characters. We can do this by placing a caret ( ^ ) at the beginning of the range.
The following regular expression searches for the character t followed by a character which is not either e or o, followed by the character d.
Any characters which would normally have a special meaning (metacharacters) lose their special meaning and become literally their character when inside a range. The exception to this is the caret ( ^ ) which gains a new meaning which is not.
Multipliers allow us to increase the number of times an item may occur in our regular expression. Here is the basic set of multipliers:
Their effect will be applied to whatever is directly in front of them. It could be a normal character, eg:
In the above example we are looking for the character 'l' followed by the character 'o' zero or more times. That is why the 'l' in silk is also matched (it is an 'l' followed by zero 'o's).
Or it could be a metacharacter, eg:
Now this one may seem a bit odd to you at first. The '.*' matches zero or more of any character. It is normal to think that it will come across the first 'k' and then say 'yep, I've found a match', but what it actually does is say 'k is also any character however so let's see how far we can take this' and it keeps going until it finds the final 'k' in the string. This is what's referred to as greedy matching. It's normal behaviour is to try and find the largest string it can which matches the pattern. We may reverse this behaviour and make it not greedy or lazy by placing a question mark ( ? ) after the multiplier (which can seem a little confusing as the question mark is a multiplier itself but you'll get the hang of it).
Sometimes we may actually want to search for one of the characters which is a metacharacter. To do this we use a feature called escaping. By placing the backslash ( \ ) in front of a metacharacter we can remove it's special meaning. (In some instances of regular expressions we may also use escaping to introduce a special meaning to characters which normally don't have a special meaning but more on that in the intermediate section of this tutorial). Let's say we wanted to find instances of the word 'this' which are the last word of a sentence. If we did the following:
It would match the 'this.' at the end of the sentence but it also matches 'this ' in the middle of the sentence because the full stop in the regular expression normally matches any character. If we want to make sure it is limited to only 'this.' we may escape the full stop like so:
It is easy to forget to escape metacharacters when they are part of your search string. If you're getting weird behaviour in your regular expressions keep an eye out for any metacharacters you may have forgotten to escape.
Now that you have a reasonable idea what regular expressions are, the next step in taking your regular expression skills to the next level is a good understanding of the underlying mechanism that is used to apply regular expressions over text. When you understand the mechanism, it makes it easier to troubleshoot when things start going wrong.
The way it works is that we have a pointer which is moved progressively through the search string. Once it comes across a character which matches the beginning of the regular expression it stops. Now a second pointer is started which moves forward from the first pointer, character by character, checking with each step if the pattern still holds or if it fails. If we get to the end of the pattern and it still holds then we have found a match. If it fails at any point then the second pointer is discarded and the main pointer continues through the string.
Let's say that we are looking for the letter p followed by any character followed by the letter t. The example below illustrates how the mechanism works :
The reason that the main pointer continues from it's location, as opposed to where the second pointer either failed or completed a match is illustrated in the example above. It is possible that another match may be found within the set of characters we just checked.