Now it starts to get interesting.
Now that you've got a feel for regular expressions, we'll add a bit more complexity. The features you'll find below have to do with identifying particular types of characters and locations within a string.
In the previous section of this tutorial we looked at the range operator (  ). That allowed us to specify a set of characters, any of which could be matched. There are some ranges that are used frequently so a set of shortcuts has been created to refer to them. We access these by using the escape character ' \ ' followed by a letter. (In this case the escape character introduces a special meaning rather than taking it away.)
The last pair or shortcuts above where we are dealing with words starts to get interesting. See the section on word boundaries below for more information on this.
In the above list you'll notice that the same letter but in uppercase always matches the opposite of what the letter in lowercase does.
As with the elements we saw in the previous section, these will match a single character but may have a multiplier placed after them to increase that. For instance, if we wanted to find any dollar amounts which had four digits in them we could create a regular expression as follows:
You'll notice that in the above example I have escaped the '$' sign. The '$' has a special meaning which you'll learn about below. You'll also notice that we found a match in the first number even though it had more than 4 digits. This can sometimes be a little confusing but you have to remember that regular expressions don't consider the meaning of the content, only if a string of characters match the given pattern.
It can be easy for us as humans to overlook potential matches as we tend to look at things and percieve their meaning. We have to get into the habit of remembering that regular expressions don't do this.
As well as our normal characters, there are a few other characters which we don't actually see but which help in formatting our text. These are the:
The tab character you should be familiar with (it prints a larger gap than a normal space) but the other two are a bit more interesting.
The concepts of carriage return and line feed came about with mechanical typewriters. The carriage return function moved the cursor from the end of the line to the beginning of the line. The line feed function moved down a line.
Depending on the OS you are using, one or a combination of these can be used to signify a new line.
Thankfully, with the power of regular expressions, we can create a pattern that will identify a newline regardless of which OS the data has come from. To do this however we need to use some of the features that will be introduced in the advanced section of this tutorial. If you know for certain which characters the data you are searching is using however then you can just use that.
\n and \r are present in some regular expression implementation but not in others. If you are getting some weird behaviour it could be that they are not implemented in the particular tool you are using.
Building upon the idea of new lines we introduce two particular locations on a line which are the beginning and the end of the line. We can refer to these locations in our regular expressions using the following special characters:
It's important to understand exactly what these represent.
In the line above it is usual for people to say that the I in It's and the full stop at end represent the beginning and end of the line. This is in fact incorrect (with respect to regular expressions). The beginning of the line is actually a zero width character just before the I and the end of the line is another zero width character just after the full stop. By zero width we mean that they are effectively invisible. They are there but we may not see them.
These two positions on the line are referred to as anchors as they allow us to anchor our pattern to a particular point on the line.
Let's say we want to identify a number but only if it is the very first thing on the line.
Using the two together can be a useful way to identify something which is the only thing on the line. Maybe we want to identify any lines which contain only a single word which is either bat or bit or but.
Word boundaries are an example of another zero width character used often within regular expressions. A word boundary is the very beginning or end of a word. They may be identified using the following:
The first two items listed above aren't available in all regular expression tools but \b generally is so it is the safer one to use.
A word is generally considered to be a string of characters that would be matched by the \w character class (that is, A-Z, a-z, 0-9 and _). Note that this doesn't include punctuation such as the apostrophe ( ' ) as may be seen in the example below.