Friday, February 26, 2016

RegExp lookbehind assertions

Introduced with the third edition of the ECMA-262 specification, regular expressions have been part of Javascript since 1999. In functionality and expressiveness, JavaScript’s implementation of regular expressions roughly mirrors that of other programming languages.

One feature in JavaScript’s RegExp that is often overlooked, but can be quite useful at times, is lookahead assertions. For example, to match a sequence of digits that is followed by a percent sign, we can use /\d+(?=%)/. The percent sign itself is not part of the match result. The negation thereof, /\d+(?!%)/, would match a sequence of digits not followed by a percent sign:
/\d+(?=%)/.exec("100% of US presidents have been male")  // ["100"]
/\d+(?!%)/.exec("that’s all 44 of them")                 // ["44"]

The opposite of lookahead, lookbehind assertions, have been missing in JavaScript, but are available in other regular expression implementations, such as that of the .NET framework. Instead of reading ahead, the regular expression engine reads backwards for the match inside the assertion. A sequence of digits following a dollar sign can be matched by /(?<=\$)\d+/, where the dollar sign would not be part of the match result. The negation thereof, /(?<!\$)\d+/, matches a sequence of digits following anything but a dollar sign.
/(?<=\$)\d+/.exec("Benjamin Franklin is on the $100 bill")  // ["100"]
/(?<!\$)\d+/.exec("it’s is worth about €90")                // ["90"]

Generally, there are two ways to implement lookbehind assertions. Perl, for example, requires lookbehind patterns to have a fixed length. That means that quantifiers such as * or + are not allowed. This way, the regular expression engine can step back by that fixed length, and match the lookbehind the exact same way as it would match a lookahead, from the stepped back position.

The regular expression engine in the .NET framework takes a different approach. Instead of needing to know how many characters the lookbehind pattern will match, it simply matches the lookbehind pattern backwards, while reading characters against the normal read direction. This means that the lookbehind pattern can take advantage of the full regular expression syntax and match patterns of arbitrary length.

Clearly, the second option is more powerful than the first. That is why the V8 team, and the TC39 champions for this feature, have agreed that JavaScript should adopt the more expressive version, even though its implementation is slightly more complex.

Because lookbehind assertions match backwards, there are some subtle behaviors that would otherwise be considered surprising. For example, a capturing group with a quantifier captures the last match. Usually, that is the right-most match. But inside a lookbehind assertion, we match from right to left, therefore the left-most match is captured:
/h(?=(\w)+)/.exec("hodor") // ["h", "r"]
/(?<=(\w)+)r/.exec("hodor") // ["r", "h"]

A capturing group can be referenced via back reference after it has been captured. Usually, the back reference has to be to the right of the capture group. Otherwise, it would match the empty string, as nothing has been captured yet. However, inside a lookbehind assertion, the match direction is reversed:
/(?<=(o)d\1)r/.exec("hodor")  // null
/(?<=\1d(o))r/.exec("hodor")  // ["r", "o"]

Lookbehind assertions are currently in a very early stage in the TC39 specification process. However, because they are such an obvious extension to the RegExp syntax, we decided to prioritize their implementation. You can already experiment with lookbehind assertions by running V8 version 4.9 or later with --harmony, or by enabling experimental JavaScript features (use about:flags) in Chrome from version 49 onwards.

Yang Guo, Regular Expression Engineer

Thursday, February 4, 2016

V8 Extras

V8 implements a large subset of the JavaScript language’s built-in objects and functions in JavaScript itself. For example, you can see our promises implementation is written in JavaScript. Such built-ins are called self-hosted. These implementations are included in our startup snapshot so that new contexts can be quickly created without needing to setup and initialize the self-hosted built-ins at runtime.

Embedders of V8, such as Chromium, sometimes desire to write APIs in JavaScript too. This works especially well for platform features that are self-contained, like streams, or for features that are part of a “layered platform” of higher-level capabilities built on top of pre-existing lower-level ones. Although it’s always possible to run extra code at startup time to bootstrap embedder APIs (as is done in Node.js, for example), ideally embedders should be able to get the same speed benefits for their self-hosted APIs that V8 does.

V8 extras are a new feature of V8, as of our 4.8 release, designed with the goal of allowing embedders to write high-performance, self-hosted APIs via a simple interface. Extras are embedder-provided JavaScript files which are compiled directly into the V8 snapshot. They also have access to a few helper utilities that make it easier to write secure APIs in JavaScript.

An Example

A V8 extra file is simply a JavaScript file with a certain structure:

(function(global, binding, v8) {
  'use strict';
  const Object = global.Object;
  const x = v8.createPrivateSymbol('x');
  const y = v8.createPrivateSymbol('y');
 
  class Vec2 {
    constructor(theX, theY) {
      this[x] = theX;
      this[y] = theY;
    }

    norm() {
      return binding.computeNorm(this[x], this[y]);
    }
  }

  Object.defineProperty(global, 'Vec2', {
    value: Vec2,
    enumerable: false,
    configurable: true,
    writable: true
  });

  binding.Vec2 = Vec2;
});


There are a few things to notice here:
  • The global object is not present in the scope chain, so any access to it (such as that for Object) has to be done explicitly through the provided global argument.
  • The binding object is a place to store values for or retrieve values from the embedder. A C++ API v8::Context::GetExtrasBindingObject() provides access to the binding object from the embedder’s side. In our toy example, we let the embedder perform norm computation; in a real example you might delegate to the embedder for something trickier like URL resolution. We also add the Vec2 constructor to the binding object, so that embedder code can create Vec2 instances without going through the potentially-mutable global object.
  • The v8 object provides a small number of APIs to allow you to write secure code. Here we create private symbols to store our internal state in a way that cannot be manipulated from the outside. (Private symbols are a V8-internal concept and do not make sense in standard JavaScript code.) V8’s built-ins often use “%-function calls” for these sort of things, but V8 extras cannot use %-functions since they are an internal implementation detail of V8 and not suitable for embedders to depend on.
You might be curious about where these objects come from. All three of them are initialized in V8’s bootstrapper, which installs some basic properties but mostly leaves the initialization to V8’s self-hosted JavaScript. For example, almost every .js file in V8 installs something on global; see e.g. promise.js or uri.js. And we install APIs onto the v8 object in a number of places. (The binding object is empty until manipulated by an extra or embedder, so the only relevant code in V8 itself is when the bootstrapper creates it.)

Finally, to tell V8 that we’ll be compiling in an extra, we add a line to our project’s gypfile:

'v8_extra_library_files': ['./Vec2.js']

(You can see a real-world example of this in V8’s gypfile.)

V8 Extras in Practice

V8 extras provide a new and lightweight way for embedders to implement features. JavaScript code can more easily manipulate JavaScript built-ins like arrays, maps, or promises; it can call other JavaScript functions without ceremony; and it can deal with exceptions in an idiomatic way. Unlike C++ implementations, features implemented in JavaScript via V8 extras can benefit from inlining, and calling them does not incur any boundary-crossing costs. These benefits are especially pronounced when compared to a traditional bindings system like Chromium’s Web IDL bindings.

V8 extras were introduced and refined over the last year, and Chromium is currently using them to implement streams. Chromium is also considering V8 extras for implementing scroll customization and efficient geometry APIs.

V8 extras are still a work in progress, and the interface has some rough edges and disadvantages we hope to address over time. The primary area with room for improvement is the debugging story: errors are not easy to track down, and runtime debugging is most often done with print statements. In the future, we hope to integrate V8 extras into Chromium’s developer tools and tracing framework, both for Chromium itself and for any embedders that speak the same protocol.

Another cause for caution when using V8 extras is the extra developer effort required to write secure and robust code. V8 extras code operates directly on the snapshot, just like the code for V8’s self-hosted built-ins. It accesses the same objects as userland JavaScript, with no binding layer or separate context to prevent such access. For example, something as seemingly-simple as global.Object.prototype.hasOwnProperty.call(obj, 5) has six potential ways in which it could fail due to user code modifying the built-ins (count them!). Embedders like Chromium need to be robust against any user code, no matter its behavior, and so in such environments more care is necessary when writing extras than when writing traditional C++-implemented features.

If you’d like to learn more about V8 extras, check out our design document which goes into much more detail. We look forward to improving V8 extras, and adding more features that allow developers and embedders to write expressive, high-performance additions to the V8 runtime.

Posted by Domenic Denicola, Streams Sorcerer