Character classes

Consider a practical task â€“ we have a phone number like "+7(903)-123-45-67", and we need to turn it into pure numbers: 79031234567.

To do so, we can find and remove anything thatâ€™s not a number. Character classes can help with that.

A character class is a special notation that matches any symbol from a certain set.

For the start, letâ€™s explore the â€œdigitâ€ class. Itâ€™s written as \d and corresponds to â€œany single digitâ€.

For instance, letâ€™s find the first digit in the phone number:

let str = "+7(903)-123-45-67";

let regexp = /\d/;

alert( str.match(regexp) ); // 7

Without the flag g, the regular expression only looks for the first match, that is the first digit \d.

Letâ€™s add the g flag to find all digits:

let str = "+7(903)-123-45-67";

let regexp = /\d/g;

alert( str.match(regexp) ); // array of matches: 7,9,0,3,1,2,3,4,5,6,7

// let's make the digits-only phone number of them:
alert( str.match(regexp).join('') ); // 79031234567

That was a character class for digits. There are other character classes as well.

Most used are:

\d (â€œdâ€ is from â€œdigitâ€): A digit: a character from 0 to 9.
\s (â€œsâ€ is from â€œspaceâ€): A space symbol: includes spaces, tabs \t, newlines \n and few other rare characters, such as \v, \f and \r.
\w (â€œwâ€ is from â€œwordâ€): A â€œwordlyâ€ character: either a letter of Latin alphabet or a digit or an underscore _. Non-Latin letters (like cyrillic or hindi) do not belong to \w.

For instance, \d\s\w means a â€œdigitâ€ followed by a â€œspace characterâ€ followed by a â€œwordly characterâ€, such as 1 a.

A regexp may contain both regular symbols and character classes.

For instance, CSS\d matches a string CSS with a digit after it:

let str = "Is there CSS4?";
let regexp = /CSS\d/

alert( str.match(regexp) ); // CSS4

Also we can use many character classes:

alert( "I love HTML5!".match(/\s\w\w\w\w\d/) ); // ' HTML5'

The match (each regexp character class has the corresponding result character):

Inverse classes

For every character class there exists an â€œinverse classâ€, denoted with the same letter, but uppercased.

The â€œinverseâ€ means that it matches all other characters, for instance:

\D: Non-digit: any character except \d, for instance a letter.
\S: Non-space: any character except \s, for instance a letter.
\W: Non-wordly character: anything but \w, e.g a non-latin letter or a space.

In the beginning of the chapter we saw how to make a number-only phone number from a string like +7(903)-123-45-67: find all digits and join them.

let str = "+7(903)-123-45-67";

alert( str.match(/\d/g).join('') ); // 79031234567

An alternative, shorter way is to find non-digits \D and remove them from the string:

let str = "+7(903)-123-45-67";

alert( str.replace(/\D/g, "") ); // 79031234567

A dot is â€œany characterâ€

A dot . is a special character class that matches â€œany character except a newlineâ€.

For instance:

alert( "Z".match(/./) ); // Z

Or in the middle of a regexp:

let regexp = /CS.4/;

alert( "CSS4".match(regexp) ); // CSS4
alert( "CS-4".match(regexp) ); // CS-4
alert( "CS 4".match(regexp) ); // CS 4 (space is also a character)

Please note that a dot means â€œany characterâ€, but not the â€œabsence of a characterâ€. There must be a character to match it:

alert( "CS4".match(/CS.4/) ); // null, no match because there's no character for the dot

Dot as literally any character with â€œsâ€ flag

By default, a dot doesnâ€™t match the newline character \n.

For instance, the regexp A.B matches A, and then B with any character between them, except a newline \n:

alert( "A\nB".match(/A.B/) ); // null (no match)

There are many situations when weâ€™d like a dot to mean literally â€œany characterâ€, newline included.

Thatâ€™s what flag s does. If a regexp has it, then a dot . matches literally any character:

alert( "A\nB".match(/A.B/s) ); // A\nB (match!)

The s flag is not supported in IE.

Luckily, thereâ€™s an alternative, that works everywhere. We can use a regexp like [\s\S] to match â€œany characterâ€ (this pattern will be covered in the article Sets and ranges [...]).

alert( "A\nB".match(/A[\s\S]B/) ); // A\nB (match!)

The pattern [\s\S] literally says: â€œa space character OR not a space characterâ€. In other words, â€œanythingâ€. We could use another pair of complementary classes, such as [\d\D], that doesnâ€™t matter. Or even the [^] â€“ as it means match any character except nothing.

Also we can use this trick if we want both kind of â€œdotsâ€ in the same pattern: the actual dot . behaving the regular way (â€œnot including a newlineâ€), and also a way to match â€œany characterâ€ with [\s\S] or alike.

Usually we pay little attention to spaces. For us strings 1-5 and 1 - 5 are nearly identical.

But if a regexp doesnâ€™t take spaces into account, it may fail to work.

Letâ€™s try to find digits separated by a hyphen:

alert( "1 - 5".match(/\d-\d/) ); // null, no match!

Letâ€™s fix it adding spaces into the regexp \d - \d:

alert( "1 - 5".match(/\d - \d/) ); // 1 - 5, now it works
// or we can use \s class:
alert( "1 - 5".match(/\d\s-\s\d/) ); // 1 - 5, also works

A space is a character. Equal in importance with any other character.

We canâ€™t add or remove spaces from a regular expression and expect it to work the same.

In other words, in a regular expression all characters matter, spaces too.

Summary

There exist following character classes:

\d â€“ digits.
\D â€“ non-digits.
\s â€“ space symbols, tabs, newlines.
\S â€“ all but \s.
\w â€“ Latin letters, digits, underscore '_'.
\W â€“ all but \w.
. â€“ any character if with the regexp 's' flag, otherwise any except a newline \n.

â€¦But thatâ€™s not all!

Unicode encoding, used by JavaScript for strings, provides many properties for characters, like: which language the letter belongs to (if itâ€™s a letter), is it a punctuation sign, etc.

We can search by these properties as well. That requires flag u, covered in the next article.

Character classes

Inverse classes

A dot is â€œany characterâ€

Dot as literally any character with â€œsâ€ flag

Summary

Comments

Chapter

Lesson navigation

Inverse classes

A dot is â€œany characterâ€

Dot as literally any character with â€œsâ€ flag

Summary

Comments

Chapter

Lesson navigation

A dot is â€œany characterâ€

Dot as literally any character with â€œsâ€ flag