Xterm.js: Use a proper parser for link detection

Created on 3 Mar 2017  路  21Comments  路  Source: xtermjs/xterm.js

There will always be issues with URL matching using regex as some cases simply can't be caught with regex. An example where regex will fail is including brackets in URLs but only when the brackets are opened within the url:

  • Include brackets: http://<domain>.com/foo(bar)
  • Don't include wrapping brackets: (http://<domain>.com/foobar)
  • Include commas: http://<domain>.com/foo,bar
  • Don't include trailing commas: http://<domain>.com/foo, other text
  • Detect spaces (paths are ambiguous?) c:\Users\X\My Documents
arelinks help wanted typenhancement

Most helpful comment

Current state

The way links work right now is that after the viewport has stopped scrolling for 200ms, the links are computed for the current viewport. Embedders can provide a validation callback which allows the embedder to validate the link sometime after the links are computed, note that links are only shown once they are validated. The end result is pretty nice, we can present underlines for all links even when you need a modifier to execute the link, it does however have some fundamental problems:

  • Even though we're just processing when the viewport stops, that's still a lot of work that would go unused.
  • Since the entire viewport is processed, it's impractical/slow to create links for many words, for example files without extensions (https://github.com/microsoft/vscode/issues/22772).
  • Doing this much validation on any text that looks like kind of like a file is way too expensive on some environments, as any sequence of characters could be a file. This is particular bad on remote file systems (https://github.com/microsoft/vscode/issues/79336).
  • The current system is complex; I think there are still race conditions in the linkifier and links disappear if you hide and show the terminal (https://github.com/microsoft/vscode/issues/36072, closed but it's still an issue).
  • It's all regex-based and difficult to maintain and extend (at least for myself).
class Terminal {
    registerLinkMatcher(regex: RegExp, handler: (event: MouseEvent, uri: string) => void, options?: ILinkMatcherOptions): number;
    deregisterLinkMatcher(matcherId: number): void;
}
interface ILinkMatcherOptions {
    matchIndex?: number;
    validationCallback?: (uri: string, callback: (isValid: boolean) => void) => void;
    tooltipCallback?: (event: MouseEvent, uri: string, location: IViewportRange) => boolean | void;
    leaveCallback?: () => void;
    priority?: number;
    willLinkActivate?: (event: MouseEvent, uri: string) => boolean;
}

Proposal

The long standing hope was to move to a "parser-based" link system (https://github.com/xtermjs/xterm.js/issues/583) but it was never really clear how an addon would provide a parser exactly. Here's my very VS Code API-inspired proposal:

class Terminal {
    registerLinkProvider(linkProvider: ILinkProvider): IDisposable;
}

interface ILinkProvider {
    provideLink(position: IBufferCellPosition, callback: (link: ILink | undefined) => void): void;
}

interface ILink {
    range: IBufferRange;
    showTooltip(event: MouseEvent, link: string): void;
    hideTooltip(event: MouseEvent, link: string): void;
    handle(event: MouseEvent, link: string): void;
}

interface IBufferRange {
    start: IBufferCellPosition;
    end: IBufferCellPosition;
}

interface IBufferCellPosition {
    x: number;
    y: number;
}

The basic idea is that instead of evaluating links whenever scrolling has stopped, only evaluate links for the current cursor position when a hover occurs. While file access is slow en masse, single requests just for mouse movement should be reasonable. Some other things of interest:

  • The addon can still use the regex method and translate the results an IBufferRange or use a parser to check the whole line, or just expand outwards from the cursor.
  • It uses the buffer API to access the line, that's pretty cool 馃槑.
  • The link could be cached until a scroll happens for that entire range, meaning no need to recompute just for moving the mouse over the same cells.
  • It allows embedders to use their own link detection mechanism, VS Code has a shared implementation but we've been unable to use it in the terminal (https://github.com/microsoft/vscode/issues/83191).

Going this route would seemingly fix many of the problems VS Code has with links, namely:

  • No more perf issues validating on slow file systems.
  • We can linkify paths without separators since validation will only be triggered under the cursor.
  • By expanding left then right from the cursor we can fix many of the issues with the current link detection (https://github.com/microsoft/vscode/issues/21125), including spaces in Windows paths (this is still a tricky problem but seems more achievable).

The only downside for this is a slight delay in the link appearing as the computing and validation occurs at hover time.

Open questions

  • Do we need an ILinkMatcherOptions.priority equivalent?
  • This would be an ideal time to introduce Promise into the API, working with promises is lovely but we will always have callbacks in the parser for performance reasons. We could just stick to the callback everywhere for consistency?
  • We could use markers for the range instead of numbers, that would allow the link to be cached after a scroll occurs. It probably isn't worth caching like this imo but it does bring up another interesting question in the https://github.com/xtermjs/xterm.js/issues/2480 discussion.

All 21 comments

FYI: These particular cases are handled correctly by gnome-terminal using regexes (pretty complex ones though).

See big rewrite bug, balanced parentheses bug, current implementation, unittests.

When someone tackles this issue there are some more cases on the VS Code issue tracking this https://github.com/Microsoft/vscode/issues/21125

Merged some issues into this one to support chrome-devtools scheme https://github.com/xtermjs/xterm.js/issues/599, no TLD https://github.com/xtermjs/xterm.js/issues/653 and long TLDs https://github.com/xtermjs/xterm.js/issues/601.

Merging UNC path support into this, eg. \\localhost\c$\GitDevelopment\standup\server\db.js

wowi, looking at this thread this problem is so much harder than I thought. So many different formats.

Currently relative paths without a "./" don't work e.g.

echo "README.md"

That doesn't work. But if I put a "./README.md" it works. Python sometimes outputs paths without a "./" and those errors do not link. Thankfully python adds quotes so it should be doable to add a regex for it.

here is example:

  File "somefile.py", line 1, in <module>

tsc uses this format (no initial ./):

src/test/testUtils.ts:5:37 - error TS2322: Type 'Node<K, V> | null' is not assignable to type 'Node<K, V>'.
  Type 'null' is not assignable to type 'Node<K, V>'.

5     public get root(): Node<K, V> { return this._root; }
                                      ~~~~~~~~~~~~~~~~~~

Will this be fixed anytime soon? It's very common for Windows users to have a username Firstname and Lastname. Meaning it will include spaces. Breaking Xterm.js which in term breaks the terminal that VSCode implements. A project folder in the user folder C:\Users\John Doe\Documents\Github\com.example.app would break the "clickable" path's in the terminal.

See: https://github.com/microsoft/vscode/issues/21125

No ETA but it's open to PRs. I imagine the eventually implementation will look similar to this: https://github.com/microsoft/vscode/blob/master/src/vs/editor/common/modes/linkComputer.ts

Also needs a little thinking around how this affects the existing experimental "link matcher" api.

Current state

The way links work right now is that after the viewport has stopped scrolling for 200ms, the links are computed for the current viewport. Embedders can provide a validation callback which allows the embedder to validate the link sometime after the links are computed, note that links are only shown once they are validated. The end result is pretty nice, we can present underlines for all links even when you need a modifier to execute the link, it does however have some fundamental problems:

  • Even though we're just processing when the viewport stops, that's still a lot of work that would go unused.
  • Since the entire viewport is processed, it's impractical/slow to create links for many words, for example files without extensions (https://github.com/microsoft/vscode/issues/22772).
  • Doing this much validation on any text that looks like kind of like a file is way too expensive on some environments, as any sequence of characters could be a file. This is particular bad on remote file systems (https://github.com/microsoft/vscode/issues/79336).
  • The current system is complex; I think there are still race conditions in the linkifier and links disappear if you hide and show the terminal (https://github.com/microsoft/vscode/issues/36072, closed but it's still an issue).
  • It's all regex-based and difficult to maintain and extend (at least for myself).
class Terminal {
    registerLinkMatcher(regex: RegExp, handler: (event: MouseEvent, uri: string) => void, options?: ILinkMatcherOptions): number;
    deregisterLinkMatcher(matcherId: number): void;
}
interface ILinkMatcherOptions {
    matchIndex?: number;
    validationCallback?: (uri: string, callback: (isValid: boolean) => void) => void;
    tooltipCallback?: (event: MouseEvent, uri: string, location: IViewportRange) => boolean | void;
    leaveCallback?: () => void;
    priority?: number;
    willLinkActivate?: (event: MouseEvent, uri: string) => boolean;
}

Proposal

The long standing hope was to move to a "parser-based" link system (https://github.com/xtermjs/xterm.js/issues/583) but it was never really clear how an addon would provide a parser exactly. Here's my very VS Code API-inspired proposal:

class Terminal {
    registerLinkProvider(linkProvider: ILinkProvider): IDisposable;
}

interface ILinkProvider {
    provideLink(position: IBufferCellPosition, callback: (link: ILink | undefined) => void): void;
}

interface ILink {
    range: IBufferRange;
    showTooltip(event: MouseEvent, link: string): void;
    hideTooltip(event: MouseEvent, link: string): void;
    handle(event: MouseEvent, link: string): void;
}

interface IBufferRange {
    start: IBufferCellPosition;
    end: IBufferCellPosition;
}

interface IBufferCellPosition {
    x: number;
    y: number;
}

The basic idea is that instead of evaluating links whenever scrolling has stopped, only evaluate links for the current cursor position when a hover occurs. While file access is slow en masse, single requests just for mouse movement should be reasonable. Some other things of interest:

  • The addon can still use the regex method and translate the results an IBufferRange or use a parser to check the whole line, or just expand outwards from the cursor.
  • It uses the buffer API to access the line, that's pretty cool 馃槑.
  • The link could be cached until a scroll happens for that entire range, meaning no need to recompute just for moving the mouse over the same cells.
  • It allows embedders to use their own link detection mechanism, VS Code has a shared implementation but we've been unable to use it in the terminal (https://github.com/microsoft/vscode/issues/83191).

Going this route would seemingly fix many of the problems VS Code has with links, namely:

  • No more perf issues validating on slow file systems.
  • We can linkify paths without separators since validation will only be triggered under the cursor.
  • By expanding left then right from the cursor we can fix many of the issues with the current link detection (https://github.com/microsoft/vscode/issues/21125), including spaces in Windows paths (this is still a tricky problem but seems more achievable).

The only downside for this is a slight delay in the link appearing as the computing and validation occurs at hover time.

Open questions

  • Do we need an ILinkMatcherOptions.priority equivalent?
  • This would be an ideal time to introduce Promise into the API, working with promises is lovely but we will always have callbacks in the parser for performance reasons. We could just stick to the callback everywhere for consistency?
  • We could use markers for the range instead of numbers, that would allow the link to be cached after a scroll occurs. It probably isn't worth caching like this imo but it does bring up another interesting question in the https://github.com/xtermjs/xterm.js/issues/2480 discussion.

Would this be an addon? Or part of the existing Linkifier?

@Tyriar Good idea to move the expensive link matching to the hover state, will def. remove some "lagging" from the viewport pause state. Few ideas/remarks from my side:

  • Imho its important to note that with a change like this its not possible anymore to prestyle links (underline or some special link color right from the start). Imho not a big issue, a terminal is not meant to work like a web browser in the first place.
  • priority - Not sure what you want it to help with? Nested links?
  • I love promises for their syntactic sweetness, still not sure if we should introduce them here while everything else is callback based (or sync API). Maybe stick to the least common and more performant denominator here - callbacks. Any JS programmer should be able to transform those into lovely promises. Beside the worse runtime promises also screw up stack traces, thus ppl might have reasons to avoid them.
  • Caching is indeed a problem here - w'o using markers its only possible for a "stable viewport" (any line buffer shift would invalidate the cache). Markers alone can not avoid reeval here - a cell might have changed thus a parser should not mark it as a link anymore. Markers would still be helpful in conjunction with a viewport update range event - this way a parser could limit the link eval to that range on hover and just return cache entries otherwise. Also note, that later added cell content that forms a correct link string with older content by accident should not be marked as a valid link, here the viewport statefulness makes it hard to correctly deal with it, which makes me think that we might need not only a line range update event, but a more fine-grained position-range update event.
  • I'd love to see this more as a general API approach to mark things in the viewport/buffer based on some condition (which could even be used by the search addon, partly answers @jmbockhorst question). Not sure if thats feasible, might be hard to harmonize between the different needs.

@jmbockhorst it would be a new Linkifier class (Linkifier2?) that would deprecate the existing registerLinkMatcher API and change the implementation of web-links. For VS Code we would not use web-links any more and instead leave all link detection up to the existing mechanisms.

I'd love to see this more as a general API approach to mark things in the viewport/buffer based on some condition (which could even be used by the search addon

@jerch that's basically where we're headed with decorations https://github.com/xtermjs/xterm.js/issues/1852, https://github.com/xtermjs/xterm.js/issues/1653

priority - Not sure what you want it to help with? Nested links?

Can't remember exactly why it was added, maybe to have web links take precedence over local or vice versa?

I love promises for their syntactic sweetness, still not sure if we should introduce them here while everything else is callback based (or sync API). Maybe stick to the least common and more performant denominator here - callbacks.

Yeah guess we should just commit to this, we can use Promises internally but callbacks in the API for consistency.

Imho its important to note that with a change like this its not possible anymore to prestyle links

Pre-styling isn't desirable imo as underline is a perfectly valid state.

Not sure if I'll have time to get to this in the next couple of months so this is open to PRs if someone want to have a go at implementing a proof of concept. I think we'll want to support both links types until the next major version at which point registerLinkMatcher will get removed.

IMO this would be an ideal use-case to refine our Marker API, so detected links can be tracked by a sticky range marker and can get auto-disposed (invalidated) in the following events:

  • A marked buffer range get's modified (e.g. you have a link in your prompt line and add / delete some characters)
  • A marked buffer range gets cleared out (e.g. clear command)
  • A marked buffer range falls out of the scrollback

Note that this marker should track the whole buffer, not just the viewport.

Once we have that marker that marks a buffer range (sticky), we can add another layer of abstraction that annotates that marked range with additional meaning. For example a link annotation, that references the marker and holds the parsed link.

As a last step, a renderer can iterate annotations in the viewport and draw them accordingly.

Here is a rough implementation of the linkifier that would use such an API:

// listen for buffer changes, we will get an IBufferRange in that indicates
// which part of the buffer was changed
terminal.buffer.onChange((range, buffer) => {

  // do the magic, detect links in that range
  const links = detectLinks(range, buffer);

  // create a marker for those links
  for (let link of links) {
    const linkMarker = buffer.addMarker(link.range);
    const linkAnnotation = buffer.addAnnotation({ type: 'link', data: link.url, marker: marker  });
    // dispose marker and annotation if you don't want them anymore
    // linkMarker.dispose();
    // linkAnnotation.dispose();
  }

});

I think this approach is ideal for performance. The link detection can be throttled, and only modified ranges in the buffer have to be rescanned.

It also cleanly decouples that semantic annotation of buffer ranges from their representation by the renderer.

@mofux Yeah very good idea. I want to extend this:

A marked buffer range get's modified (e.g. you have a link in your prompt line and add / delete some characters)

to all cells in active viewport* - if a marker partly covers cells of that, its content can still be modified by cursor jumps + writes, thus the marker should break. A bonus would be to re-eval those positions (with extends to left and right), as the new content might form a valid link again.

[*] Viewport is not the correct name for that - I mean those bufferlines, that are still reachable by the cursor. Guess we dont even have a name for that. Maybe "editable buffer"?

Edit: Note that a buffer.onChange will be hard to implement for cell level (both code and perf-wise) - we would have to track any cell change during chunk processing and summing them up into such an event. We kinda have that for lines already (dirtyRowService), but not on individual cell level. Maybe we can stick with line level here for the sake of performance? (Ofc this would raise the burden on any consumer of such an event).

@jerch Yeah I guess line level changes would be sufficient. We could still fire an IBufferRange for the affected lines though - so we can potentially support cell level changes more easily in the future without breaking API.

to all cells in active viewport* - if a marker partly covers cells of that, its content can still be modified by > cursor jumps + writes, thus the marker should break. A bonus would be to re-eval those positions (with > extends to left and right), as the new content might form a valid link again.

I'd think that a marker only breaks if a write happens within a line that is inside the marked range. The linkify implementation can over or underscan additional lines for link detection.

Maybe "editable buffer"?

Maybe "mutable buffer", or "non-scrollback buffer"? 馃槄

I'd think that a marker only breaks if a write happens within a line that is inside the marked range.

Yes, if onChange returns a range of line2 - line20 a marker ending on line1 should not be bothered, same for a marker covering line21+. Also this automatically turns any marker living in the scrollbuffer into stable ones (beside the disposing when finally dropping off).

Maybe "mutable buffer", or "non-scrollback buffer"?

Mutable suggests that the scrollback buffer is static/immutable which is not the case for resize. "Cursor-Addressible-Buffer" === CAB? Lol just kidding, all are kinda word monsters...

I think markers are a little heavy for this when you consider how links will be typically be used; the user types a bunch of stuff then wants to click on a link, that link will likely never be clicked again when it leaves the viewport. Caching the viewport would be good but anything outside of that I think is overkill. I would also invalidate when all lines of the link's range leave the viewport (not the CAB/NSB/MB 馃槈).

I think markers are a little heavy ...

Yeah your prolly right, markers still have the update burden for ANY buffer scroll, thus we should use them with caution. (Its a big difference for term.write if there are only 10 markers vs. 1000+.)

Regarding link detection and decoupling from render state - I think its perfectly fine to announce onHover the current mouse position, a link parser would only need to parse one wrapped line to find links (a link cannot span several real lines). This seems rather cheap to me, if a parser really wants to get more involved it could cache the findings based on line content (to make sure to correctly spot content changes). The cache here could help to debounce parsing from every single onHover report, still I'd give the cache entries rather short lifecycles as re-parsing seems so cheap. (Have not yet looked at #2530 :blush:)

Let's call this done with the link provider work which is about to ship in VS Code

Was this page helpful?
0 / 5 - 0 ratings