Now there is looking back.
Now that you've got a feel for regular expressions, we'll add a bit more complexity. In demonstrating the features on this page we will also be using features introduced in the Basic and Intermediate sections of this tutorial. If some of this stuff seems a bit confusing it may be worth reviewing those sections first. Once you complete this section (and understand it) you won't be a complete Regular Expressions guru but you will be well on your way and you should be armed with enough Regular Expressions ammo to tackle the majority of problems you encounter.
We may group several characters together in our regular expression using brackets '( )' (also referred to as parentheses). There are then various things which can be done with that group. Some of these we'll look at further down this page. They also allow us to add a multiplier to that group of characters (as a whole).
So, for instance, we may want to find out if a particular person is mentioned. Their name is John Reginald Smith but the middle name may or may not be present.
Notice where the spaces are and aren't in the regular expression above. It's important to remember that they are part of your regular expression and you need to make sure they are and aren't in the right places.
The above tip is very important and a common source of problems when people first start playing with regular expressions. Below is a common mistake that people make.
We aren't limited to just normal characters in the brackets. You may include special characters in there (including multipliers) as well.
For instance, maybe we would like to find instances of IP addresses. An IP address is a set of 4 numbers (between 0 and 255) separated by full stops (eg. 192.168.0.5).
Let's break it down as this is starting to get a little complex:
The above expression uses elements that have been covered in the previous sections of this tutorial. Be sure to review these sections if need be.
As you can see, regular expressions can soon get hard to read once you get various brackets and backslashes in there. This makes it easy to make silly mistakes by missing or misplacing one of these characters and the mistakes can be hard to spot. Remember the strategies for handling this.
Whenever we match something within brackets, that value is actually stored in a variable which we may refer to later on in the regular expression. To access these variables we use the escape character ( \ ) followed by a digit. The first set of brackets is referred to with \1, the second set of brackets with \2 and so on.
Let's say we went to find lines with two mentions of a person whos last name is Smith. We don't know that their first name may be however. We could do the following:
In the above example you'll notice that we matched the text between the two instances of John Smith as well but in this case that is ok as we are not too concerned in what was matched, only that there was a match.
With alternation we are looking for something or something else. We have seen a very basic example of alternation with the range operator. This allows us to perform alternation with a single character, but sometimes we would like to perform the operation with a larger set of characters. We can achieve this with the pipe symbol ( | ) which means or.
So for intance, if we wanted to find all instance of either 'dog' or 'cat' we could do the following:
We can also use more than one | to include more options.
Maybe we only want alternation to happen on a part of the regular expression instead of the whole regular expression. To achieve this we use brackets.
Maybe we want to match Harold Smith or John Smith but not any other Smith.
Lookaheads and Lookbehinds are the final thing we are going to introduce in this tutorial and they can be one of the trickiest things you will encounter in regular expressions. Both of them operate in one of two modes:
The main idea of both the lookahead and lookbehind is to see if something matches (or doesn't) and then to throw away what was actually matched.
With a lookahead we want to look ahead (hence the name) in our string and see if it matches the given pattern, but then disregard it and move on. The concept is best illustrated with an example.
Let's say we wish to identify numbers greater than 4000 but less than 5000. This is a problem which seems simple but is in fact a little trickier than you suspect. A common first attempt is to try the following:
Then you realise that the way we can tackle this is to say we are looking for a '4' followed by 3 ditigs and at least one of those digits is not a '0'. For us as humans that seems like a simple thing to look for but with what we have learnt so far in regular expressions, it is not so easy. We could try something like:
That is, use alternation to check three different scenarios, each with a different of the three digits not being '0'.
I reckon you're probably looking at the above and thinking that's a lot of regular expression to mach just 4 characters. Worse still, think about how that would increase if instead of between 4000 and 5000 we wanted between 40000 and 50000. It soon becomes clear that the above regular expression works but it is not elegant and it doesn't scale.
It turns out that a negative lookahead can solve problems like this quite well. A negative lookahead is set up as follows:
(?!x)
Our negative lookahead is contained within brackets and the first two characters inside the brackets are ?!. Replace x with what it is you don't want to match.
Now we can set up our regular expression as follows:
That might seem a little confusing so let's break it down
In plain english we could say: "We are looking for a '4' which is not followed by 3 '0's but is followed by 3 digits".
A positive lookahead works in the same way but the characters inside the lookahead have to match rather than not match. The syntax for a positive lookahead is as follows:
(?=x)
All we need to do is replace the '!' with an '='.
Lookbehinds work similarly to lookaheads but instead of looking forwards then throwing it away, we look backwards and then throw it away. Similar to lookaheads, they are available in both positive and negative. They follow a similar syntax but include a '<' after the '?' (Think of it as an arrow pointing backwards).
(?<=x) and (?<!x)
Is the syntax for a positive lookbehind and negative lookbehind respectively.
Let's say we would like to find instances of the name 'Smith' but only if they are a surname. To achieve this we have said that we want to look at the word before it and if that word begins with a capital letter we'll assume it is a surname (the more astute of you will have already seen the flaw in this, ie what if Smith is the second word in a sentence, but we'll ignore that for now.)
Lookaheads and lookbeinds can be a bit tough to get your head around at first. I would suggest you experiment with a few different searches yourself to get the hang of it.
Applications and programming languages differ in how they implement lookaheads and lookbehinds. Some will allow you to use other regular expression features within a lookahead and lookbehind, some will not. Some will allow some features but not all of them. If you are getting unexpected behaviour you may need to find out which features are and aren't implemented for your particular application or programming language.
You've now learnt enough about regular expressions to get you through the majority of problems you will probably face. You've really only been introduced to the building blocks though. Learning how to put the building blocks together into effective patterns is something which will take time and practice. Don't worry if some of this stuff is still a little confusing at this point in time. With practice it will all become clearer and you will become very powerful in terms of the things you can achieve.