About The Author

Faraz is a professional frontend developer, consultant, and writer who is passionate about moving the web forward and promoting patterns and ideas that will …
More about Faraz

If you have ever done any sort of sophisticated text processing and manipulation in JavaScript, you’ll appreciate the new features introduced in ES2018. In this article, we take a good look at how the ninth edition of the standard improves the text processing capability of JavaScript.

There’s a good reason the majority of programming languages support regular expressions: they are extremely powerful tools for manipulating text. Text processing tasks that require dozens of lines of code can often be accomplished with a single line of regular expression code. While the built-in functions in most languages are usually sufficient to perform search and replace operations on strings, more complex operations — such as validating text inputs — often require the use of regular expressions.

Regular expressions have been part of the JavaScript language since the third edition of the ECMAScript standard, which was introduced in 1999. ECMAScript 2018 (or ES2018 for short) is the ninth edition of the standard and further improves the text processing capability of JavaScript by introducing four new features:

These new features are explained in detail in the subsections that follow.

Debugging JavaScript

console.log can tell you a lot about your app, but it can’t truly debug your code. For that, you need a full-fledged JavaScript debugger. Read more →

Lookbehind Assertions

The ability to match a sequence of characters based on what follows or precedes it enables you to discard potentially undesired matches. This is especially important when you need to process a large string and the chance of undesired matches is high. Fortunately, most regular expression flavors provide the lookbehind and lookahead assertions for this purpose.

Prior to ES2018, only lookahead assertions were available in JavaScript. A lookahead allows you to assert that a pattern is immediately followed by another pattern.

There are two versions of lookahead assertions: positive and negative. The syntax for a positive lookahead is (?=...). For example, the regex /Item(?= 10)/ matches Item only when it is followed, with an intervening space, by number 10:

const re = /Item(?= 10)/;

console.log(re.exec('Item'));
// → null

console.log(re.exec('Item5'));
// → null

console.log(re.exec('Item 5'));
// → null

console.log(re.exec('Item 10'));
// → ["Item", index: 0, input: "Item 10", groups: undefined]

This code uses the exec() method to search for a match in a string. If a match is found, exec() returns an array whose first element is the matched string. The index property of the array holds the index of the matched string, and the input property holds the entire string that the search performed on. Finally, if named capture groups are used in the regular expression, they are placed on the groups property. In this case, groups has a value of undefined because there is no named capture group.

The construct for a negative lookahead is (?!...). A negative lookahead asserts that a pattern is not followed by a specific pattern. For example, the pattern /Red(?!head)/ matches Red only if it not followed by head:

const re = /Red(?!head)/;

console.log(re.exec('Redhead'));
// → null

console.log(re.exec('Redberry'));
// → ["Red", index: 0, input: "Redberry", groups: undefined]

console.log(re.exec('Redjay'));
// → ["Red", index: 0, input: "Redjay", groups: undefined]

console.log(re.exec('Red'));
// → ["Red", index: 0, input: "Red", groups: undefined]

ES2018 complements lookahead assertions by bringing lookbehind assertions to JavaScript. Denoted by (?<=...), a lookbehind assertion allows you to match a pattern only if it is preceded by another pattern.

Let’s suppose you need to retrieve the price of a product in euro without capturing the euro symbol. With a lookbehind, this task becomes a lot simpler:

const re = /(?<=€)d+(.d*)?/;

console.log(re.exec('199'));
// → null

console.log(re.exec('$199'));
// → null

console.log(re.exec('€199'));
// → ["199", undefined, index: 1, input: "€199", groups: undefined]

Note: Lookahead and lookbehind assertions are often referred to as “lookarounds”.

The negative version of lookbehind is denoted by (?<!...) and enables you to match a pattern that is not preceded by the pattern specified within the lookbehind. For example, the regular expression /(?<!d{3}) meters/ matches the word “meters” if three digits do not come before it:

const re = /(?<!d{3}) meters/;

console.log(re.exec('10 meters'));
// → [" meters", index: 2, input: "10 meters", groups: undefined]

console.log(re.exec('100 meters'));    
// → null

As with lookaheads, you can use several lookbehinds (negative or positive) in succession to create a more complex pattern. Here’s an example:

const re = /(?<=d{2})(?<!35) meters/;

console.log(re.exec('35 meters'));
// → null

console.log(re.exec('meters'));
// → null

console.log(re.exec('4 meters'));
// → null

console.log(re.exec('14 meters'));
// → ["meters", index: 2, input: "14 meters", groups: undefined]

This regex matches a string containing meters only if it is immediately preceded by any two digits other than 35. The positive lookbehind ensures that the pattern is preceded by two digits, and then the negative lookbehind ensures that the digits are not 35.

Named Capture Groups

You can group a part of a regular expression by encapsulating the characters in parentheses. This allows you to restrict alternation to a part of the pattern or apply a quantifier on the whole group. Furthermore, you can extract the matched value by parentheses for further processing.

The following code gives an example of how to find a file name with .jpg extension in a string and then extract the file name:

const re = /(w+).jpg/;
const str = 'File name: cat.jpg';
const match = re.exec(str);
const fileName = match[1];

// The second element in the resulting array holds the portion of the string that parentheses matched
console.log(match);
// → ["cat.jpg", "cat", index: 11, input: "File name: cat.jpg", groups: undefined]

console.log(fileName);
// → cat

In more complex patterns, referencing a group using a number just makes the already cryptic regular expression syntax more confusing. For example, suppose you want to match a date. Since the position of day and month is swapped in some regions, it’s not clear which group refers to the month and which group refers to the day:

const re = /(d{4})-(d{2})-(d{2})/;
const match = re.exec('2020-03-04');

console.log(match[0]);    // → 2020-03-04
console.log(match[1]);    // → 2020
console.log(match[2]);    // → 03
console.log(match[3]);    // → 04

ES2018’s solution to this problem is named capture groups, which use a more expressive syntax in the form of (?<name>...):

const re = /(?<year>d{4})-(?<month>d{2})-(?<day>d{2})/;
const match = re.exec('2020-03-04');

console.log(match.groups);          // → {year: "2020", month: "03", day: "04"}
console.log(match.groups.year);     // → 2020
console.log(match.groups.month);    // → 03
console.log(match.groups.day);      // → 04

Because the resulting object may contain a property with the same name as a named group, all named groups are defined under a separate object called groups.

A similar construct exists in many new and traditional programming languages. Python, for example, uses the (?P<name>) syntax for named groups. Not surprisingly, Perl supports named groups with syntax identical to JavaScript (JavaScript has imitated its regular expression syntax from Perl). Java also uses the same syntax as Perl.

In addition to being able to access a named group through the groups object, you can access a group using a numbered reference — similar to a regular capture group:

const re = /(?<year>d{4})-(?<month>d{2})-(?<day>d{2})/;
const match = re.exec('2020-03-04');

console.log(match[0]);    // → 2020-03-04
console.log(match[1]);    // → 2020
console.log(match[2]);    // → 03
console.log(match[3]);    // → 04

The new syntax also works well with destructuring assignment:

const re = /(?<year>d{4})-(?<month>d{2})-(?<day>d{2})/;
const [match, year, month, day] = re.exec('2020-03-04');

console.log(match);    // → 2020-03-04
console.log(year);     // → 2020
console.log(month);    // → 03
console.log(day);      // → 04

The groups object is always created, even if no named group exists in a regular expression:

const re = /d+/;
const match = re.exec('123');

console.log('groups' in match);    // → true

If an optional named group does not participate in the match, the groups object will still have a property for that named group but the property will have a value of undefined:

const re = /d+(?<ordinal>st|nd|rd|th)?/;

let match = re.exec('2nd');

console.log('ordinal' in match.groups);    // → true
console.log(match.groups.ordinal);         // → nd

match = re.exec('2');

console.log('ordinal' in match.groups);    // → true
console.log(match.groups.ordinal);         // → undefined

You can refer to a regular captured group later in the pattern with a backreference in the form of 1. For example, the following code uses a capture group that matches two letters in a row, then recalls it later in the pattern:

console.log(/(ww)1/.test('abab'));    // → true

// if the last two letters are not the same 
// as the first two, the match will fail
console.log(/(ww)1/.test('abcd'));    // → false

To recall a named capture group later in the pattern, you can use the /k<name>/ syntax. Here is an example:

const re = /b(?<dup>w+)s+k<dup>b/;

const match = re.exec("I'm not lazy, I'm on on energy saving mode");        

console.log(match.index);    // → 18
console.log(match[0]);       // → on on

This regular expression finds consecutive duplicate words in a sentence. If you prefer, you can also recall a named capture group using a numbered back reference:

const re = /b(?<dup>w+)s+1b/;

const match = re.exec("I'm not lazy, I'm on on energy saving mode");        

console.log(match.index);    // → 18
console.log(match[0]);       // → on on 

It’s also possible to use a numbered back reference and a named backreference at the same time:

const re = /(?<digit>d):1:k<digit>/;

const match = re.exec('5:5:5');        

console.log(match[0]);    // → 5:5:5

Similar to numbered capture groups, named capture groups can be inserted into the replacement value of the replace() method. To do that, you will need to use the $<name> construct. For example:

const str = 'War & Peace';

console.log(str.replace(/(War) & (Peace)/, '$2 & $1'));    
// → Peace & War

console.log(str.replace(/(?<War>War) & (?<Peace>Peace)/, '$<Peace> & $<War>'));    
// → Peace & War

If you want to use a function to perform the replacement, you can reference the named groups the same way you would reference numbered groups. The value of the first capture group will be available as the second argument to the function, and the value of the second capture group will be available as the third argument:

const str = 'War & Peace';

const result = str.replace(/(?<War>War) & (?<Peace>Peace)/, function(match, group1, group2, offset, string) {
    return group2 + ' & ' + group1;
});

console.log(result);    // → Peace & War

s (dotAll) Flag

By default, the dot (.) metacharacter in a regex pattern matches any character with the exception of line break characters, including line feed (n) and carriage return (r):

console.log(/./.test('n'));    // → false
console.log(/./.test('r'));    // → false

Despite this shortcoming, JavaScript developers could still match all characters by using two opposite shorthand character classes like [wW], which instructs the regex engine to match a character that’s a word character (w) or a non-word character (W):

console.log(/[wW]/.test('n'));    // → true
console.log(/[wW]/.test('r'));    // → true

ES2018 aims to fix this problem by introducing the s (dotAll) flag. When this flag is set, it changes the behavior of the dot (.) metacharacter to match line break characters as well:

console.log(/./s.test('n'));    // → true
console.log(/./s.test('r'));    // → true

The s flag can be used on per-regex basis and thus does not break existing patterns that rely on the old behavior of the dot metacharacter. Besides JavaScript, the s flag is available in a number of other languages such as Perl and PHP.

Recommended reading: An Abridged Cartoon Introduction To WebAssembly

Unicode Property Escapes

Among the new features introduced in ES2015 was Unicode awareness. However, shorthand character classes were still unable to match Unicode characters, even if the u flag was set.

Consider the following example:

const str = '';

console.log(/d/.test(str));     // → false
console.log(/d/u.test(str));    // → false

is considered a digit, but d can only match ASCII [0-9], so the test() method returns false. Because changing the behavior of shorthand character classes would break existing regular expression patterns, it was decided to introduce a new type of escape sequence.

In ES2018, Unicode property escapes, denoted by p{...}, are available in regular expressions when the u flag is set. Now to match any Unicode number, you can simply use p{Number}, as shown below:

const str = '';
console.log(/p{Number}/u.test(str));     // → true

And to match any Unicode alphabetic character, you can use p{Alphabetic}:

const str = '漢';

console.log(/p{Alphabetic}/u.test(str));     // → true

// the w shorthand cannot match 漢
console.log(/w/u.test(str));    // → false

P{...} is the negated version of p{...} and matches any character that p{...} does not:

console.log(/P{Number}/u.test(''));    // → false
console.log(/P{Number}/u.test('漢'));    // → true

console.log(/P{Alphabetic}/u.test(''));    // → true
console.log(/P{Alphabetic}/u.test('漢'));    // → false

A full list of supported properties is available on the current specification proposal.

Note that using an unsupported property causes a SyntaxError:

console.log(/p{undefined}/u.test('漢'));    // → SyntaxError

Compatibility Table

Desktop Browsers

Chrome Firefox Safari Edge
Lookbehind Assertions 62 X X X
Named Capture Groups 64 X 11.1 X
s (dotAll) Flag 62 X 11.1 X
Unicode Property Escapes 64 X 11.1 X

Mobile Browsers

ChromeFor Android FirefoxFor Android iOS Safari Edge Mobile Samsung Internet Android Webview
Lookbehind Assertions 62 X X X 8.2 62
Named Capture Groups 64 X 11.3 X X 64
s (dotAll) Flag 62 X 11.3 X 8.2 62
Unicode Property Escapes 64 X 11.3 X X 64

Node.js

  • 8.3.0 (requires --harmony runtime flag)
  • 8.10.0 (support for s (dotAll) flag and lookbehind assertions)
  • 10.0.0 (full support)

Wrapping Up

ES2018 continues the work of previous editions of ECMAScript by making regular expressions more useful. New features include lookbehind assertion, named capture groups, s (dotAll) flag, and Unicode property escapes. Lookbehind assertion allows you to match a pattern only if it is preceded by another pattern. Named capture groups use a more expressive syntax compared to regular capture groups. The s (dotAll) flag changes the behavior of the dot (.) metacharacter to match line break characters. Finally, Unicode property escapes provide a new type of escape sequence in regular expressions.

When building complicated patterns, it’s often helpful to use a regular-expressions tester. A good tester provides an interface to test a regular expression against a string and displays every step taken by the engine, which can be especially useful when trying to understand patterns written by others. It can also detect syntax errors that may occur within your regex pattern. Regex101 and RegexBuddy are two popular regex testers worth checking out.

Do you have some other tools to recommend? Share them in the comments!

Smashing Editorial(dm, il)



Source link

Leave a comment

Your e-mail address will not be published. Required fields are marked *