Regex character classes

[]
-

Prev Next

In the previous article we saw how regular characters match themselves and how dot . can match any character.

A character class is something in between those two extremes. A character class is a list of characters that can be matched.

The list is placed in square brackets [].

For example [abc] will match either 'a' or 'b' or 'c'.

Just as a regular character or a . can match exactly one character so does a character class. Later we are going to learn about quantifiers that will allow us to say how many of something we would like to match, but for now remember that a character class always matches exactly one character. If it cannot fulfill the match then the whole regex matching fails.

So what if we have a bunch of strings and we would like to make sure only strings containing any of the following will match? #a#, #b#, #c#, #d#, #e#, #f#, #@# or #.# That is, we would like the string to have a # character, followed by 'a', 'b', 'c', 'd', 'e', 'f', '@', or '.', followed by another # character. (We are using # in this example in order to get you used to seeing 'strange' characters that have no special meaning.)

The regex that will match those looks like this: /#[abcdef@.]#/.

It says: match a #, then match any one(!) of the characters in the square bracket, then match another #.

this will match

"#a#"
"ab #z#a# "
"ab #.# "

but will not match any of the following:

"ab #q# "
"ab ## "
"##"
"#ab#"
"#aa#"
"# #"
"###"
"#-#"

Two notes:

The regex won't match "##" or "#ab#" because the character class must match exactly one character between the two '#' characters.
The '.' inside the character class lost its special meaning of "everything except newline" and can match a single '.' only.

In general, most special characters lose their special meaning inside a character class, but there are of course exceptions. There are even character that gain special meaning inside a character class.

Range in a character class

Programmers are lazy typing in all the characters between 'a' and 'f' in the regex /#[abcdef@.]#/ was really tiring. If we had to type in all the characters between 'a' and 'z' that would be even worse and it would be very error-prone. What if I miss one of the characters? Instead of that regexes allow us to define a range of characters from the ASCII table using a dash (-). The previous regex could be written as /#[a-f@.]#/

So as you can see a dash -, that did not have any special meaning outside of a character class, inside has the special "range-making" meaning.

Of course you will then want to know how can you express that one of the characters you'd like to match in the character class is a dash, and the answer is that if you place the dash as the first or the last character in the character class, then it will be just a plain dash. So /#[a-f@.-]#/ will match all the above and also "#-#".

Another frequently asked question at this point is how to include a closing square bracket ] in a character class. That's simple too. You just need to "escape" it be preceding with a back-slash: \].

Negated character class

What if we would like to allow the matching of any character between two '#' characters except 'a', 'b', or 'c'? We would need to construct a character class with all the characters in the world and Unicode has more that 110,000 characters. That would be a lot of work to type in. Instead of that, Perl allows us to negate a character class. If we put a Caret (^) as the first character in the character class it will mean the character class can match any one character except those mentioned in the character class. So [^abc] would match exactly one character that is not 'a', nor 'b', nor 'c'. Our full regex then would look like /#[^abc]#/.

This regex will match these strings:

"abc #z# z"
"#z#"

but will not match any of these strings:

"abc #a# z"
"#xyz#"
"##"

Note, it won't match the string '##' or the string "#xyz#", because the negated character class still has to match exactly one character.

Summary

/a[bc]a/      # aba, aca
/a[2#=x?.]a/  # a2a, a#a, a=a, axa, a?a, a.a
              # inside the character class most of the spec characters lose their
              # special meaning  BUT there are some new special characters
/a[2-8]a/     # is the same as /a[2345678]a/
/a[2-]a/      # a2a, a-a        - has no special meaning at the ends
/a[-8]a/      # a8a, a-a
/a[6-C]a/     # a6a, a7a ... aCa
              #      characters from the ASCII table: 6789:;<=>?@ABC but this is not recommended, don't use it!
/a[C-6]a/     # syntax error

/a[^xy]a/     # "aba", "aca"  but not "aya", "axa" and remember, not "aa"
              # ^ as the first character in a character class means 
              # a character that is not in the list
/a[b^x]a/     # aba, a^a, axa,  but not aza

Comments

/a[b^x]a/ # aba, a^a, axa, but not aza I didn't get this.

--- 'z' is not in the character class, therefore you cannot match it. It has to match 'a', then any character in the [] ( 'b' or '^' or 'x'), then 'a'. Hope it helps!!!

I didn't understand /#[abcdef@.]#/ is matches with string "ab #z#a# ".

Can you please explain this?

Prev Next

Written by
Gabor Szabo

Published on 2014-11-09

If you have any comments or questions, feel free to post them on the source of this page in GitHub. Source on GitHub. Comment on this post

Author: Gabor Szabo

Gabor who runs the Perl Maven site helps companies set up test automation, CI/CD Continuous Integration and Continuous Deployment and other DevOps related systems.

Gabor can help refactor your old Perl code-base.

He runs the Perl Weekly newsletter.

Contact Gabor if you'd like to hire his service.

Buy his eBooks or if you just would like to support him, do it via Patreon.