Regular Expressions Advanced!

Now there is looking back.

Introduction

Now that you've got a feel for regular expressions, we'll add a bit more complexity. In demonstrating the features on this page we will also be using features introduced in the Basic and Intermediate sections of this tutorial. If some of this stuff seems a bit confusing it may be worth reviewing those sections first. Once you complete this section (and understand it) you won't be a complete Regular Expressions guru but you will be well on your way and you should be armed with enough Regular Expressions ammo to tackle the majority of problems you encounter.

Grouping

We may group several characters together in our regular expression using brackets '( )' (also referred to as parentheses). There are then various things which can be done with that group. Some of these we'll look at further down this page. They also allow us to add a multiplier to that group of characters (as a whole).

So, for instance, we may want to find out if a particular person is mentioned. Their name is John Reginald Smith but the middle name may or may not be present.

John (Reginald )?Smith
John Reginald Smith is sometime just called John Smith.

Notice where the spaces are and aren't in the regular expression above. It's important to remember that they are part of your regular expression and you need to make sure they are and aren't in the right places.

The above tip is very important and a common source of problems when people first start playing with regular expressions. Below is a common mistake that people make.

John (Reginald)? Smith
The problem with this regular expression is that it will match John Reginald Smith perfectly fine and John  Smith (two spaces between John and Smith) but not John Smith. Can you see why?

We aren't limited to just normal characters in the brackets. You may include special characters in there (including multipliers) as well.

For instance, maybe we would like to find instances of IP addresses. An IP address is a set of 4 numbers (between 0 and 255) separated by full stops (eg. 192.168.0.5).

\b(\d{1,3}\.){3}\d{1,3}\b
The server has an address of 10.18.0.20 and the printer has an address of 10.18.0.116.

Let's break it down as this is starting to get a little complex:

  • \b indicates a word boundary so we can be sure the IP address is not part of something else.
  • We have now broken the IP address into 3 chunks consisting of a number between 0 and 255 and a full stop, and a final number between 0 and 255.
  • In the brackets we handle the first 3 chunks so \d{1,3} indicates we are looking for between 1 and 3 digits and we remember to escape the full stop to remove it's special meaning. We are looking for exactly 3 of these so we place the multiplier {3} just outside the brackets.
  • Finally we include the fourth number with \d{1,3} and end with another word boundary

The above expression uses elements that have been covered in the previous sections of this tutorial. Be sure to review these sections if need be.

As you can see, regular expressions can soon get hard to read once you get various brackets and backslashes in there. This makes it easy to make silly mistakes by missing or misplacing one of these characters and the mistakes can be hard to spot. Remember the strategies for handling this.

Back references

Whenever we match something within brackets, that value is actually stored in a variable which we may refer to later on in the regular expression. To access these variables we use the escape character ( \ ) followed by a digit. The first set of brackets is referred to with \1, the second set of brackets with \2 and so on.

Let's say we went to find lines with two mentions of a person whos last name is Smith. We don't know that their first name may be however. We could do the following:

(\b[A-Z]\w+\b) Smith.*\1 Smith
Harold Smith went to meet John Smith but John Smith was not there.

In the above example you'll notice that we matched the text between the two instances of John Smith as well but in this case that is ok as we are not too concerned in what was matched, only that there was a match.

Alternation

With alternation we are looking for something or something else. We have seen a very basic example of alternation with the range operator. This allows us to perform alternation with a single character, but sometimes we would like to perform the operation with a larger set of characters. We can achieve this with the pipe symbol ( | ) which means or.

So for intance, if we wanted to find all instance of either 'dog' or 'cat' we could do the following:

dog|cat
Harold Smith has two dogs and one cat.

We can also use more than one | to include more options.

dog|cat|bird
Harold Smith has two dogs, one cat and three birds.

Maybe we only want alternation to happen on a part of the regular expression instead of the whole regular expression. To achieve this we use brackets.

Maybe we want to match Harold Smith or John Smith but not any other Smith.

(John|Harold) Smith
Harold Smith went to meet John Smith but instead bumped into Jane Smith.

Lookahead and Lookbehind

Lookaheads and Lookbehinds are the final thing we are going to introduce in this tutorial and they can be one of the trickiest things you will encounter in regular expressions. Both of them operate in one of two modes:

  • Positive - in which we are seeking to find something which matches.
  • Negative - in which we are seeking to find something which doesn't match.

The main idea of both the lookahead and lookbehind is to see if something matches (or doesn't) and then to throw away what was actually matched.

Lookaheads

With a lookahead we want to look ahead (hence the name) in our string and see if it matches the given pattern, but then disregard it and move on. The concept is best illustrated with an example.

Let's say we wish to identify numbers greater than 4000 but less than 5000. This is a problem which seems simple but is in fact a little trickier than you suspect. A common first attempt is to try the following:

\b4\d\d\d\b
This looks promising with 4021 but unfortunately also matches 4000.

Then you realise that the way we can tackle this is to say we are looking for a '4' followed by 3 ditigs and at least one of those digits is not a '0'. For us as humans that seems like a simple thing to look for but with what we have learnt so far in regular expressions, it is not so easy. We could try something like:

\b4([1-9]\d\d|\d[1-9]\d|\d\d[1-9])\b
Now we will match 4010 but not 4000.

That is, use alternation to check three different scenarios, each with a different of the three digits not being '0'.

I reckon you're probably looking at the above and thinking that's a lot of regular expression to mach just 4 characters. Worse still, think about how that would increase if instead of between 4000 and 5000 we wanted between 40000 and 50000. It soon becomes clear that the above regular expression works but it is not elegant and it doesn't scale.

It turns out that a negative lookahead can solve problems like this quite well. A negative lookahead is set up as follows:

(?!x)

Our negative lookahead is contained within brackets and the first two characters inside the brackets are ?!. Replace x with what it is you don't want to match.

Now we can set up our regular expression as follows:

\b4(?!000)\d\d\d\b
Now we still match 4010 but not 4000.

That might seem a little confusing so let's break it down

  • First we look for the character '4'.
  • When we find a '4' the negative lookahead returns true if the next 3 characters are not '000'.
  • If this returns true we go back to just after the '4' and continue with our regular expression.

In plain english we could say: "We are looking for a '4' which is not followed by 3 '0's but is followed by 3 digits".

A positive lookahead works in the same way but the characters inside the lookahead have to match rather than not match. The syntax for a positive lookahead is as follows:

(?=x)

All we need to do is replace the '!' with an '='.

Lookbehinds

Lookbehinds work similarly to lookaheads but instead of looking forwards then throwing it away, we look backwards and then throw it away. Similar to lookaheads, they are available in both positive and negative. They follow a similar syntax but include a '<' after the '?' (Think of it as an arrow pointing backwards).

(?<=x) and (?<!x)

Is the syntax for a positive lookbehind and negative lookbehind respectively.

Let's say we would like to find instances of the name 'Smith' but only if they are a surname. To achieve this we have said that we want to look at the word before it and if that word begins with a capital letter we'll assume it is a surname (the more astute of you will have already seen the flaw in this, ie what if Smith is the second word in a sentence, but we'll ignore that for now.)

(?<=[A-Z]\w* )Smith
Now we won't identify Smith Francis but we will identify Harold Smith.

Lookaheads and lookbeinds can be a bit tough to get your head around at first. I would suggest you experiment with a few different searches yourself to get the hang of it.

Applications and programming languages differ in how they implement lookaheads and lookbehinds. Some will allow you to use other regular expression features within a lookahead and lookbehind, some will not. Some will allow some features but not all of them. If you are getting unexpected behaviour you may need to find out which features are and aren't implemented for your particular application or programming language.

Where to from here?

You've now learnt enough about regular expressions to get you through the majority of problems you will probably face. You've really only been introduced to the building blocks though. Learning how to put the building blocks together into effective patterns is something which will take time and practice. Don't worry if some of this stuff is still a little confusing at this point in time. With practice it will all become clearer and you will become very powerful in terms of the things you can achieve.

Summary

( )
Group part of the regular expression.
\1 \2 etc
Refer to something matched by a previous grouping.
|
Match what is on either the left or right of the pipe symbol.
(?=x)
Positive lookahead.
(?!x)
Negative lookahead.
(?<=x)
Positive lookbehind.
(?<!x)
Negative lookbehind.