Consider a practical task â we have a phone number like "+7(903)-123-45-67", and we need to turn it into pure numbers: 79031234567.
To do so, we can find and remove anything thatâs not a number. Character classes can help with that.
A character class is a special notation that matches any symbol from a certain set.
For the start, letâs explore the âdigitâ class. Itâs written as \d and corresponds to âany single digitâ.
For instance, letâs find the first digit in the phone number:
let str = "+7(903)-123-45-67";
let regexp = /\d/;
alert( str.match(regexp) ); // 7
Without the flag g, the regular expression only looks for the first match, that is the first digit \d.
Letâs add the g flag to find all digits:
let str = "+7(903)-123-45-67";
let regexp = /\d/g;
alert( str.match(regexp) ); // array of matches: 7,9,0,3,1,2,3,4,5,6,7
// let's make the digits-only phone number of them:
alert( str.match(regexp).join('') ); // 79031234567
That was a character class for digits. There are other character classes as well.
Most used are:
\d(âdâ is from âdigitâ)- A digit: a character from
0to9. \s(âsâ is from âspaceâ)- A space symbol: includes spaces, tabs
\t, newlines\nand few other rare characters, such as\v,\fand\r. \w(âwâ is from âwordâ)- A âwordlyâ character: either a letter of Latin alphabet or a digit or an underscore
_. Non-Latin letters (like cyrillic or hindi) do not belong to\w.
For instance, \d\s\w means a âdigitâ followed by a âspace characterâ followed by a âwordly characterâ, such as 1 a.
A regexp may contain both regular symbols and character classes.
For instance, CSS\d matches a string CSS with a digit after it:
let str = "Is there CSS4?";
let regexp = /CSS\d/
alert( str.match(regexp) ); // CSS4
Also we can use many character classes:
alert( "I love HTML5!".match(/\s\w\w\w\w\d/) ); // ' HTML5'
The match (each regexp character class has the corresponding result character):
Inverse classes
For every character class there exists an âinverse classâ, denoted with the same letter, but uppercased.
The âinverseâ means that it matches all other characters, for instance:
\D- Non-digit: any character except
\d, for instance a letter. \S- Non-space: any character except
\s, for instance a letter. \W- Non-wordly character: anything but
\w, e.g a non-latin letter or a space.
In the beginning of the chapter we saw how to make a number-only phone number from a string like +7(903)-123-45-67: find all digits and join them.
let str = "+7(903)-123-45-67";
alert( str.match(/\d/g).join('') ); // 79031234567
An alternative, shorter way is to find non-digits \D and remove them from the string:
let str = "+7(903)-123-45-67";
alert( str.replace(/\D/g, "") ); // 79031234567
A dot is âany characterâ
A dot . is a special character class that matches âany character except a newlineâ.
For instance:
alert( "Z".match(/./) ); // Z
Or in the middle of a regexp:
let regexp = /CS.4/;
alert( "CSS4".match(regexp) ); // CSS4
alert( "CS-4".match(regexp) ); // CS-4
alert( "CS 4".match(regexp) ); // CS 4 (space is also a character)
Please note that a dot means âany characterâ, but not the âabsence of a characterâ. There must be a character to match it:
alert( "CS4".match(/CS.4/) ); // null, no match because there's no character for the dot
Dot as literally any character with âsâ flag
By default, a dot doesnât match the newline character \n.
For instance, the regexp A.B matches A, and then B with any character between them, except a newline \n:
alert( "A\nB".match(/A.B/) ); // null (no match)
There are many situations when weâd like a dot to mean literally âany characterâ, newline included.
Thatâs what flag s does. If a regexp has it, then a dot . matches literally any character:
alert( "A\nB".match(/A.B/s) ); // A\nB (match!)
The s flag is not supported in IE.
Luckily, thereâs an alternative, that works everywhere. We can use a regexp like [\s\S] to match âany characterâ (this pattern will be covered in the article Sets and ranges [...]).
alert( "A\nB".match(/A[\s\S]B/) ); // A\nB (match!)
The pattern [\s\S] literally says: âa space character OR not a space characterâ. In other words, âanythingâ. We could use another pair of complementary classes, such as [\d\D], that doesnât matter. Or even the [^] â as it means match any character except nothing.
Also we can use this trick if we want both kind of âdotsâ in the same pattern: the actual dot . behaving the regular way (ânot including a newlineâ), and also a way to match âany characterâ with [\s\S] or alike.
Usually we pay little attention to spaces. For us strings 1-5 and 1 - 5 are nearly identical.
But if a regexp doesnât take spaces into account, it may fail to work.
Letâs try to find digits separated by a hyphen:
alert( "1 - 5".match(/\d-\d/) ); // null, no match!
Letâs fix it adding spaces into the regexp \d - \d:
alert( "1 - 5".match(/\d - \d/) ); // 1 - 5, now it works
// or we can use \s class:
alert( "1 - 5".match(/\d\s-\s\d/) ); // 1 - 5, also works
A space is a character. Equal in importance with any other character.
We canât add or remove spaces from a regular expression and expect it to work the same.
In other words, in a regular expression all characters matter, spaces too.
Summary
There exist following character classes:
\dâ digits.\Dâ non-digits.\sâ space symbols, tabs, newlines.\Sâ all but\s.\wâ Latin letters, digits, underscore'_'.\Wâ all but\w..â any character if with the regexp's'flag, otherwise any except a newline\n.
â¦But thatâs not all!
Unicode encoding, used by JavaScript for strings, provides many properties for characters, like: which language the letter belongs to (if itâs a letter), is it a punctuation sign, etc.
We can search by these properties as well. That requires flag u, covered in the next article.
Comments
<code>tag, for several lines â wrap them in<pre>tag, for more than 10 lines â use a sandbox (plnkr, jsbin, codepenâ¦)