RE for the real world.
Regular expressions can be a bit of an abstract concept to get your head around. Let's take a look at some real world examples to give you a better idea of how they are actually used and what types of problems you can solve with them. Remember, the examples below are just a taste of what you can do with regular expressions. They have many uses and you are really only limited by your imagination and creativity.
In the examples below you may hover over the breakdowns to highlight the various components.
To create the functionality on this page where you can hover on descriptions and examples and see which parts of the regular expression they are referring to a small amount of Javascript is used. If you want to see the code you can right-click and 'view source' to see it. I set ID's on relevant sections then the Javascript uses a regular expression to extract relevant data and from that identify the corresponding content to also highlight. So for instance, I have an item identified as eg1trigger3. When you hover over this a regular expression is used to extract the example number (1) and trigger number (3) as follows:
eg(\d+)trigger(\d+)
Broken down that is:
Form validation is an area where regular expressions are really useful. This example and the next few are really useful here.
So we know that a credit card number is 16 digits, and is typically divided into 4 groups of 4 digits. Don't you hate it when you are given just a sinle field and not told whether it accepts the number as one big number or four 4 digit groups. What if it could elegantly handle both. While we're at it, let's make it also allow for the separator to be a space, dash or comma.
\d{4}[-, ]?\d{4}[-, ]?\d{4}[-, ]\d{4}
Broken down that is:
If you want different separators it's a simple matter of changing what is inside the range operators. It's also easy to adapt this pattern to other digit based items such as membership numbers or IP addresses etc.
It's possible to expand on this to accommodate different brands of credit cards. This is because each brand has unique characteristics that we can identify.
Email addresses are an interesting case. There are a variety of approaches depending on how pedantic you want to be. What constitues valid syntax can be quite complex. Here is a less pedantic approach.
[a-zA-Z0-9.+-_]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,63}
Broken down that is:
You may be wondering why the top level domain section is between 2 and 63 characters and not something like 2 and 4 characters. Things used to be simple with nearly all domains being something like .com or .net, with maybe a .au to signify country. Now things have gotten silly though and domains such as .sandvikcoromant exist (with more likely to follow). The official specification allows for up to 63 charactes for the top level domain so it is probably safest to check for that.
Things like HTML, where our content follows a certain syntax, are great for utilising regular expressions when we want to identify certain elements. I regularly use regular expresisons with a search and replace to update certain parts of a page. Let's say for instance that we wish to identify img tags which don't contain an alt attribute and which are contained within an opening and closing tag of the same type, on a single line. (If you want to learn a bit more about HTML you can check out our HTML tutorial.)
<([a-zA-Z][a-zA-Z0-9]?).*>.*<[iI][mM][gG] (?![^>]*alt=).*>.*</\1>
Broken down that is:
Let's look at an example to illustrate it (hover your mouse over sections to see how they line up with the regular expression above):
<p class='important'>This is a picture of my dog <img src='./mydog.jpg'>. He is awesome.</p>