Intro to Regular Expressions

The Value of Regular Expressions:

Let's say you've been given a text file containing thousands of random words, each word on a different line . And you're looking for words in the file that both start with the letter "v" (lowercase v) and consist of exactly 6 letters. 

You have several options.
  1. Go through the list, look at each word, and note down the ones that start with the letter "v".
  2. Hire someone else to do it for you.
  3. Use regular expressions
At this point, you're probably screaming internally at me to tell you what a regular expression is. A regular expression is a string. An incredibly special string.It consists of a series of characters that are used to describe the particular string or group of strings you are searching for. This makes it an extremely powerful tool for comparison.You can use a single regular expression to navigate through endless pages of text to find what you are looking for. 

This concept is best illustrated through an example. Let's take another look at the text file full of random words.

In order to find all of the words that start with the letter "v" in a huge text file, we can use the following regular expression: "^v[a-z]{4}". 

Let's decode the regular expression:
  • The ^ signifies that the character after it has to be the first character in the string that is being compared. In this case,  "^v" makes sure that the first character in the string is a "v".
  • Anything inside brackets [ ] is something that describes something specific about the desired string. In this case, [a-z] represents the various lower case letters that may come after the first letter, "v". 
  • Anything inside braces { } represents the number of times the expression before the braces shows up in a string that would match the regular expression. In this case, the 4 within the braces in this examples signifies that there must be exactly four lowercase letters (any letter from a to z) following the first letter, "v". 
Whenever you forget what one of the sequences of characters used in a regular expression is used for, do yourself a favor and look it up below. Doing this will eventually help you know these sequences like the back of your hand. 

Regular Expressions Sequences

Matches character at the beggining of the line
Character after ^ should be the first character in the matched string
“^d   ...” would match “dino” or “dirty socks”
“^abc …” would match “abcdef” or “abc de”

Matches character at the end of the line
Character right before $ should be the last character in the matched string
“…   d$” would match “sad” or “very mad”
“… abc$” would match “zyxabc” or “sdkn abc”

Matches any character (except a new line)
“h.s” would match “his” or “h%s” or “h1s”

Matches 0 or more incidences of preceding expression
“y*s” would match “s” or “ys” or “yyyyys”

Matches 1 or more incidences of preceding expression
“boi+” would match “boi” or “boiiiii”

either…or option.
Matches either a or b in a|b
“c@ts|d0g$” would match either c@ts or d0g$

[ ]
Matches any single character in the brackets
“A[bc]D” matches “AbD” or “AcD”
Use hyphen (-) to specify a range
“A[a-z]Z” matches “AbZ” or “AtZ”
“A[0-9]Z” matches “A1Z” or “A8Z”
“A[a-zA-Z0-9]Z” matches “AXZ” or “A7Z” or “AqZ”

{  }
Matches number of incidences specified in the braces of preceding character
“el{3}f” matches “elllf”
Use comma to indicate number range
“el{2,5}f” matches “elllf” or “ellf” or “elllllf”

(  )
Used to group part of an expression
“(tin){2}” matches “tintin”

[^   ]
Matches any character not between the brackets
“b[a-z]t”matches “bAt” or “b1t”

Matches any letter or digit characters
(reminder: in Java, you have to use two backslashes instead of one)
“b\\w” matches “bat” or “bZt” or “b2t”

Matches any non-letter/non-digit
“b\\Wt” matches “b#t” or “b@t”

Matches any white space
“b\\st” matches “b t”

Matches any non-white space character
“b\\St” matches “b1t” or “b%t”

Matches any digit character
“1\\d4” matches “104” or “124”

Matches any non-digit character
“1\D4” matches “1a4” or “1#4”