Regular Expressions Basics!

Start at the beginning.

Introduction

This page is a little long but I felt it was better to keep all this material together rather than split it up. Much of the content is examples so you should be able to get through it fairly quickly however.

Even though we are only looking at the basic set of regular expression characters here you will find that you can still use them to create quite useful search patterns.

The most basic example

A regular expression is a description of a pattern of characters. The most basic pattern we can describe is an exact string (or sequence) of characters. So for example I may want to do a search for the characters th (Or in more specific terms, I am searching for the character t followed directly by the character h)

th
There is no theory of evolution. Only a list of animals Chuck Norris allows to live.

You may be wondering why the th in there was not picked up as a match. The reason is that it contains a capital T as opposed to lowercase which is what the regular expression was searching for. We know that they are the same character, just in a different form. Regular expressions do not however. Regular expressions do not interpret any meaning from the search pattern. All they do is look for exact matches to specifically what the pattern describes.

It is possible to make a regular expression look for matches in a case insensitive way but you'll learn about that later on.

A very basic expression like this is really no different to a search you may do in a search engine or in your favourite word processor or such. It's not really that exciting. From here in however things do start to get more interesting.

Regular expressions can sometimes be a bit hard to get your head around at first so if the material below seems a little confusing don't worry too much. With practice it will start to make more sense. If you find yourself getting stuck, it may be worth revising our bit on the previous page on Learning Regular Expressions

The dot - any character

The dot ( . ) (or full stop) character is what we refer to as a metacharacter. Metacharacters are characters which have a special meaning. They help us to create more interesting patterns than just a string of specific characters. Pretty much everything we look at from here in will be metacharacters.

The dot ( . ) represents any character. So with the regular expression below, what we are looking for is the character b followed by any character, followed by the character g.

b.g
The big bag of bits was bugged.

It is important to note that the . matches only a single character. We may get it to match more than a single character using multipliers which we'll look at further below. Alternatively, you could also use multiple .'s like so:

l..e
You can live like a king but make sure it isn't a lie.

In the above example we are matching an l, followed by two characters, followed by an e.

Ranges of Characters

The . allows us to match any character. Sometimes we would like to be a bit more specific than that. This is where ranges come in useful. We specify a range of characters by enclosing them within square brackets ( [ ] ).

t[eo]d
When today is over Ted will have a tedious time tidying up.

In the regular expression above we are looking for the character t followed by either the character e or o, followed by the character d.

There is no limit to how many characters you may place in side the square brackets. You could place a single character, eg. [y] (which would be a bit silly but nevertheless it is legal), or you could have many, eg. [grf4s2#lknx].

Shortcut for characters in a row

Let's say we wanted to find the presence of a digit between 1 and 8. We could use a range like so [12345678] but there is a shortcut we may use to make things a bit easier.

[1-8]
Room Allocations: G4 G9 F2 H1 L0 K7 M9

You can combine a set of characters along with other characters.

[1-49]
Room Allocations: G4 G9 F2 H1 L0 K7 M9

In the above regular expression we are searching for the digits 1, 2, 3, 4 or 9.

We can also combine multiple sets. In the regular expression below we are looking for 1, 2, 3, 4, 5, a, b, c, d, e, f, x.

[1-5a-fx]
A random set of characters: y, w, a, r, f, 4, 9, 6, 3, p, x, t

Using sets of characters can sometimes lead to odd behaviour. For example, you may use the range [a-f] and find that it matches D. This has to do with the character tables the system is using. Most systems have a character table where all the lowercase letters come first, then the uppercase letters. eg. abcdef....xyzABCD... A few systems however, alternate the lowercase and uppercase letters. eg. aAbBcCdD...yYzZ. If you encounter some strange behaviour and you're using ranges, this is the first place to check.

Negating - Find characters that aren't

Sometimes we may want to find the presence of a character which is not a range of characters. We can do this by placing a caret ( ^ ) at the beginning of the range.

The following regular expression searches for the character t followed by a character which is not either e or o, followed by the character d.

t[^eo]d
When today is over Ted will have a tedious time tidying up.

Any characters which would normally have a special meaning (metacharacters) lose their special meaning and become literally their character when inside a range. The exception to this is the caret ( ^ ) which gains a new meaning which is not.

Multipliers

Multipliers allow us to increase the number of times an item may occur in our regular expression. Here is the basic set of multipliers:

  • * - item occurs zero or more times.
  • + - item occurs one or more times.
  • ? - item occurs zero or one times.
  • {5} - item occurs five times.
  • {3,7} - item occurs between 3 and 7 times.
  • {2,} - item occurs at least 2 times.

Their effect will be applied to whatever is directly in front of them. It could be a normal character, eg:

lo*
Are you looking at the lock or the silk?

In the above example we are looking for the character 'l' followed by the character 'o' zero or more times. That is why the 'l' in silk is also matched (it is an 'l' followed by zero 'o's).

Or it could be a metacharacter, eg:

l.*k
Are you looking at the lock or the silk?

Now this one may seem a bit odd to you at first. The '.*' matches zero or more of any character. It is normal to think that it will come across the first 'k' and then say 'yep, I've found a match', but what it actually does is say 'k is also any character however so let's see how far we can take this' and it keeps going until it finds the final 'k' in the string. This is what's referred to as greedy matching. It's normal behaviour is to try and find the largest string it can which matches the pattern. We may reverse this behaviour and make it not greedy or lazy by placing a question mark ( ? ) after the multiplier (which can seem a little confusing as the question mark is a multiplier itself but you'll get the hang of it).

l.*?k
Are you looking at the lock or the silk?

Escaping Metacharacters

Sometimes we may actually want to search for one of the characters which is a metacharacter. To do this we use a feature called escaping. By placing the backslash ( \ ) in front of a metacharacter we can remove it's special meaning. (In some instances of regular expressions we may also use escaping to introduce a special meaning to characters which normally don't have a special meaning but more on that in the intermediate section of this tutorial). Let's say we wanted to find instances of the word 'this' which are the last word of a sentence. If we did the following:

this.
Surely this regular expression should match this.

It would match the 'this.' at the end of the sentence but it also matches 'this ' in the middle of the sentence because the full stop in the regular expression normally matches any character. If we want to make sure it is limited to only 'this.' we may escape the full stop like so:

this\.
Surely this regular expression should match this.

It is easy to forget to escape metacharacters when they are part of your search string. If you're getting weird behaviour in your regular expressions keep an eye out for any metacharacters you may have forgotten to escape.

The Mechanism

Now that you have a reasonable idea what regular expressions are, the next step in taking your regular expression skills to the next level is a good understanding of the underlying mechanism that is used to apply regular expressions over text. When you understand the mechanism, it makes it easier to troubleshoot when things start going wrong.

The way it works is that we have a pointer which is moved progressively through the search string. Once it comes across a character which matches the beginning of the regular expression it stops. Now a second pointer is started which moves forward from the first pointer, character by character, checking with each step if the pattern still holds or if it fails. If we get to the end of the pattern and it still holds then we have found a match. If it fails at any point then the second pointer is discarded and the main pointer continues through the string.

Let's say that we are looking for the letter p followed by any character followed by the letter t. The example below illustrates how the mechanism works :

p.t
My appetite is huge.
Click 'forward' to begin.

The reason that the main pointer continues from it's location, as opposed to where the second pointer either failed or completed a match is illustrated in the example above. It is possible that another match may be found within the set of characters we just checked.

Summary

dot (.)
Match any character.
[ ]
Match a range of characters contained within the square brackets.
[^ ]
Match a character which is not one of those contained within the square brackets.
*
Match zero or more of the preceeding item.
+
Match one or more of the preceeding item.
?
Match zero or one of the preceeding item.
{n}
Match exactly n of the preceeding item.
{n,m}
Match between n and m of the preceeding item.
{n,}
Match n or more of the preceeding item.
\
Escape, or remove the special meaning of the next character.
String
A sequence of characters.