Examples - Regular Expressions Tutorial

Regular Expressions Examples!

RE for the real world.

Introduction

Regular expressions can be a bit of an abstract concept to get your head around. Let's take a look at some real world examples to give you a better idea of how they are actually used and what types of problems you can solve with them. Remember, the examples below are just a taste of what you can do with regular expressions. They have many uses and you are really only limited by your imagination and creativity.

In the examples below you may hover over the breakdowns to highlight the various components.

Hovers on this page.

To create the functionality on this page where you can hover on descriptions and examples and see which parts of the regular expression they are referring to a small amount of Javascript is used. If you want to see the code you can right-click and 'view source' to see it. I set ID's on relevant sections then the Javascript uses a regular expression to extract relevant data and from that identify the corresponding content to also highlight. So for instance, I have an item identified as eg1trigger3. When you hover over this a regular expression is used to extract the example number (1) and trigger number (3) as follows:

eg(\d+)trigger(\d+)

Broken down that is:

The beginning of the intended identifier - eg.
The example number. By placing it within brackets we can use it in the Javascript later on.
The trigger section of the identifier.
The trigger number. By placing it within brackets (similar to the example number) we may use it in the Javascript later on.

Credit card numbers

Form validation is an area where regular expressions are really useful. This example and the next few are really useful here.

So we know that a credit card number is 16 digits, and is typically divided into 4 groups of 4 digits. Don't you hate it when you are given just a sinle field and not told whether it accepts the number as one big number or four 4 digit groups. What if it could elegantly handle both. While we're at it, let's make it also allow for the separator to be a space, dash or comma.

\d{4}[-, ]?\d{4}[-, ]?\d{4}[-, ]\d{4}

Broken down that is:

4 digits, followed by
either a dash, comma or space, zero or one times, follwed by
4 digits, followed by
either a dash, comma or space, zero or one times, follwed by
4 digits, followed by
either a dash, comma or space, zero or one times, follwed by
4 digits

If you want different separators it's a simple matter of changing what is inside the range operators. It's also easy to adapt this pattern to other digit based items such as membership numbers or IP addresses etc.

It's possible to expand on this to accommodate different brands of credit cards. This is because each brand has unique characteristics that we can identify.

Email addresses

Email addresses are an interesting case. There are a variety of approaches depending on how pedantic you want to be. What constitues valid syntax can be quite complex. Here is a less pedantic approach.

[a-zA-Z0-9.+-_]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,63}

Broken down that is:

The recipient name, followed by
The @ symbol, followed by
The main part of the domain, followed by
The top level domain part of the domain

You may be wondering why the top level domain section is between 2 and 63 characters and not something like 2 and 4 characters. Things used to be simple with nearly all domains being something like .com or .net, with maybe a .au to signify country. Now things have gotten silly though and domains such as .sandvikcoromant exist (with more likely to follow). The official specification allows for up to 63 charactes for the top level domain so it is probably safest to check for that.

HTML Mangling

Things like HTML, where our content follows a certain syntax, are great for utilising regular expressions when we want to identify certain elements. I regularly use regular expresisons with a search and replace to update certain parts of a page. Let's say for instance that we wish to identify img tags which don't contain an alt attribute and which are contained within an opening and closing tag of the same type, on a single line. (If you want to learn a bit more about HTML you can check out our HTML tutorial.)

<([a-zA-Z][a-zA-Z0-9]?).*>.*<[iI][mM][gG] (?![^>]*alt=).*>.*</\1>

Broken down that is:

The name of the opening tag. We place this in brackets so we can match it at the end as well.
Other potential stuff within the opening tag, attributes etc.
Content which is not an img tag.
An img tag. Using range operators to identify in either lowercase or uppwercase or a mixture.
Here we use a negative lookahead to make sure that alt= is not somewhere after the img tag befor the closing >.
Potential content after the img tab and before the closing tag.
The closing tag. \1 becomes the same as what was matched in the brackets around the opening tag name.