Regular Expressions Intermediate!

Now it starts to get interesting.

Introduction

Now that you've got a feel for regular expressions, we'll add a bit more complexity. The features you'll find below have to do with identifying particular types of characters and locations within a string.

Shorthand Character Classes

In the previous section of this tutorial we looked at the range operator ( [] ). That allowed us to specify a set of characters, any of which could be matched. There are some ranges that are used frequently so a set of shortcuts has been created to refer to them. We access these by using the escape character ' \ ' followed by a letter. (In this case the escape character introduces a special meaning rather than taking it away.)

  • \s - matches anything which is considered whitespace. This could be a space, tab, line break etc.
  • \S - matches the opposite of \s, that is anything which is not considered whitespace.
  • \d - matches anything which is considered a digit. ie 0 - 9 (It is effectively a shortcut for [0-9]).
  • \D - matches the opposite of \d, that is anything which is not considered a digit.
  • \w - matches anything which is considered a word character. That is [A-Za-z0-9_]. Note the inclusion of the underscore character '_'. This is because in programming and other areas we regulaly use the underscore as part of, say, a variable or function name.
  • \W - matches the opposite of \w, that is anything which is not considered a word character.

The last pair or shortcuts above where we are dealing with words starts to get interesting. See the section on word boundaries below for more information on this.

In the above list you'll notice that the same letter but in uppercase always matches the opposite of what the letter in lowercase does.

As with the elements we saw in the previous section, these will match a single character but may have a multiplier placed after them to increase that. For instance, if we wanted to find any dollar amounts which had four digits in them we could create a regular expression as follows:

\$\d{4}
Today I earned $58327 but lost $3826.

You'll notice that in the above example I have escaped the '$' sign. The '$' has a special meaning which you'll learn about below. You'll also notice that we found a match in the first number even though it had more than 4 digits. This can sometimes be a little confusing but you have to remember that regular expressions don't consider the meaning of the content, only if a string of characters match the given pattern.

It can be easy for us as humans to overlook potential matches as we tend to look at things and percieve their meaning. We have to get into the habit of remembering that regular expressions don't do this.

Non Printable Characters

As well as our normal characters, there are a few other characters which we don't actually see but which help in formatting our text. These are the:

  • Tab - represented in regular expressions as \t
  • Carriage return - represented in regular expressions as \r
  • Line feed (or newline) - represented in regular expressions as \n

The tab character you should be familiar with (it prints a larger gap than a normal space) but the other two are a bit more interesting.

The concepts of carriage return and line feed came about with mechanical typewriters. The carriage return function moved the cursor from the end of the line to the beginning of the line. The line feed function moved down a line.

Find out more about the Carriage return and Line feed characters.

Depending on the OS you are using, one or a combination of these can be used to signify a new line.

  • Windows - uses the sequence \r\n (in that order)
  • Mac OS (version 9 and below) - uses the sequence \r
  • Unix/Linux and OSX - uses the sequence \n

Thankfully, with the power of regular expressions, we can create a pattern that will identify a newline regardless of which OS the data has come from. To do this however we need to use some of the features that will be introduced in the advanced section of this tutorial. If you know for certain which characters the data you are searching is using however then you can just use that.

\n and \r are present in some regular expression implementation but not in others. If you are getting some weird behaviour it could be that they are not implemented in the particular tool you are using.

Anchors - ^ and $

Building upon the idea of new lines we introduce two particular locations on a line which are the beginning and the end of the line. We can refer to these locations in our regular expressions using the following special characters:

  • ^ (caret) - represents the beginning of the line.
  • $ (dollar) - represents the end of the line.

It's important to understand exactly what these represent.

In the line above it is usual for people to say that the I in It's and the full stop at end represent the beginning and end of the line. This is in fact incorrect (with respect to regular expressions). The beginning of the line is actually a zero width character just before the I and the end of the line is another zero width character just after the full stop. By zero width we mean that they are effectively invisible. They are there but we may not see them.

These two positions on the line are referred to as anchors as they allow us to anchor our pattern to a particular point on the line.

Let's say we want to identify a number but only if it is the very first thing on the line.

^\d+
13 cats escaped from the 5 cages at the vet's clinic.

Using the two together can be a useful way to identify something which is the only thing on the line. Maybe we want to identify any lines which contain only a single word which is either bat or bit or but.

^b[aiu]t$
This line does not match but the next line does.
bat

Word Boundaries

Word boundaries are an example of another zero width character used often within regular expressions. A word boundary is the very beginning or end of a word. They may be identified using the following:

  • \< - represents the beginning of a word.
  • \> - represents the end of a word.
  • \b - represents either the beginning or end of a word.

The first two items listed above aren't available in all regular expression tools but \b generally is so it is the safer one to use.

A word is generally considered to be a string of characters that would be matched by the \w character class (that is, A-Z, a-z, 0-9 and _). Note that this doesn't include punctuation such as the apostrophe ( ' ) as may be seen in the example below.

\bt\w+\b
Now that's the truth and you know it.

Summary

\s
Match any character which is considered whitespace (space, tab etc).
\S
Match any character which is not whitespace.
\d
Match any character which is a digit ( 0 - 9 ).
\D
Match any character which is not a digit.
\w
Match any character which is a word character ( A-Z, a-z, 0-9 and _ ).
\W
Match any character which is not a word character.
\t
Match a tab.
\r
Match a carriage return.
\n
Match a line feed (or newline).
^ (caret)
An anchor which matches the beginning of the line.
$ (dollar)
An anchor which matches the end of the line.
\b
Matches the beginning or end of a word.
\<
Matches the beginning of a word.
\>
Matches the end of a word.