Vscode: Support syntax highlighting with tree-sitter

Created on 19 May 2018  路  63Comments  路  Source: microsoft/vscode

Please consider supporting tree-sitter grammars in addition to TextMate grammars. TextMate grammars are incredibly difficult to author and maintain and impossible to get right. The over 500 (!) issues reported against https://github.com/Microsoft/TypeScript-TmLanguage are a living proof of this.

This presentation explains the motivation and goals for tree-sitter: https://www.youtube.com/watch?v=a1rC79DHpmY

tree-sitter already ships with Atom and is also used on github.com.

feature-request languages-basic

Most helpful comment

I fully agree with you that TextMate grammars are challenging to implement and have limitations, but it's always a lot of work to create and maintain a grammar. That will not be different with Tree-Sitter.

@aeschli I meanwhile re-implemented my TextMate grammar with tree-sitter because the former proved unmaintainable (templated regexes up to 400 characters long, etc.). Developing the tree-sitter grammar and highlighter from scratch took three days, compared to three weeks for the TextMate grammar. The new highlighter works better and is dramatically easier to maintain. I wish I could use it in VS Code as well.

All 63 comments

tree-sitter is cool technology, and we have our eyes on it.
I fully agree with you that TextMate grammars are challenging to implement and have limitations, but it's always a lot of work to create and maintain a grammar. That will not be different with Tree-Sitter.

If you already have experiences with specific grammars, e.g. the TypeScript grammar or the C-grammar, and think it is superior to the TextMate grammars, let us know. That would be the criteria for us to invest.

This may help in the future with the whole 'embedding one language in another', which is an enfant terrible when it comes to TextMate grammars.

There's also a request in #5408 for .sublime-syntax which has been open since Apr 2016 which would also be a step up from .tmLanguage.

While tree-sitter has an awesome concept I can't say the idea of writing grammar in JavaScript is all that appealing.

@omniomi tree-sitter also supports writing grammars in pure JSON if that's what you prefer. The main & dramatic advantage of tree-sitter is that it's a full parsing system and not an ad-hoc, underspecified, horrifyingly complex yet extremely limited regex contraption.

Integrating tree-sitter would help solve this issue https://github.com/OmniSharp/omnisharp-vscode/issues/2461

@aeschli Atom has switched to tree sitter for C++ and no longer fixing issues with Text Mate: https://github.com/atom/language-c/issues/232#issuecomment-426018195 . Please advise on how we should proceed for improving the C++ syntax highlighting/etc. experience.

:wave: Just to reiterate - the Atom team doesn't intend to disrupt other apps like VSCode that are using modules like language-c. We will definitely continue to accept good PRs that update the text-mate grammar.

The reason that we've been closing issues like that is just to be explicit about the fact that our team won't be prioritizing work on them in the future, since Atom is moving away from text-mate grammars.

@sean-mcmanus we already have our own syntax highlighting stuff (shared with Visual Studio), but haven't been able to use it because we are waiting on an API that lets us turn off tmLanguage and provide the coloring ourselves: #585. Moving to tree-sitter is only relevant to us so long as #585 is incomplete.

Tree-sitter is extensible for other programming languages, and in particular already supports Rust and Ruby as well. Are the Visual Studio APIs ready to be extended with new language support in those ways?

I'm wondering if tree-sitter can solve this https://github.com/Microsoft/vscode/issues/51157

@bobbrow is that a finite decision? Would have been nice to share the code with Atom here.

No plans for this in 2018?

It's going to be 2019!!

Yeah, Atom 1.33 ships with tree sitter and most of the C/C++ colorization bugs have been fixed with it -- the Atom/language-c team is closing the non-tree sitter bugs.

I fully agree with you that TextMate grammars are challenging to implement and have limitations, but it's always a lot of work to create and maintain a grammar. That will not be different with Tree-Sitter.

@aeschli I meanwhile re-implemented my TextMate grammar with tree-sitter because the former proved unmaintainable (templated regexes up to 400 characters long, etc.). Developing the tree-sitter grammar and highlighter from scratch took three days, compared to three weeks for the TextMate grammar. The new highlighter works better and is dramatically easier to maintain. I wish I could use it in VS Code as well.

Still no plan for this? At least an exploration?

@fcurts Could I see your re-implementation? I currently maintain the TextMate C/C++ grammar. Its actually easy to maintain now that its written in Ruby with actual variables and functions instead of Regex, but it doesn't fix the inherent limitations of the TextMate engine.

I'd love to figure out how to convert it to a tree sitter syntax, the Atom tree sitter C++ syntax still isn't complete. To be honest even after spending awhile looking at it, I have no idea how to use it or how it works. Its missing functionality like lambda support, compile attributes, templated function calls, and macro calls. But it's also missing basic stuff like

// a comment \
        still part of the comment

I went and looked at ~the source code here, but its pretty unintuitive 馃槙. Its >6000 lines long, which is double the size of the TextMate grammar.~ Sadly its also already got unfixable issue posted on it :/

aClass thing = aClass(); // initializers can't be highlighted the same
aClass thing = aClass{};

I still think Tree sitters are awesome, but its definitely not going to be a quick win and I could use help with the upgrade process.

I went and looked at the source code here, but its pretty unintuitive and its >6000 lines long. Which is double the size of the TextMate grammar.

No, the source for the Tree-sitter C++ grammar is here. It's 669 lines of JavaScript (not counting the C grammar, which it inherits from).

Its missing functionality like lambda support, compile attributes, templated function calls, and macro calls.

It does fully support lambdas and template function calls, as far as I know. You can see the test suite for these features here. I do think that compile attributes are unimplemented.

That's great to hear, and thanks for correcting me! I'll try looking into the source to get a better understanding.

Sorry about the bad info, what mislead me about the lambdas is that Atom still has them incorrectly marked (although they're not visually messed up). The -> is marked as a member-access. Templated function calls are also not colored like functions (they have no color / theme-scopes). After seeing your tests though, this definitely a usage issue and not a tree sitter grammar issue. 馃憤

That test suite is nice. The nested template example right after the lambda is something I've wanted to solve in TextMate for months (but can't).

FWIW: To make this more annoying for folks to choose - i have patches to add incremental parsing/lexing support to ANTLR4 (both the main java runtime and the optimized typescript runtime port).

Incremental parsing is submitted to both repositories (the main antlr4 one, and the optimized TS one) at this point, incremental lexing is not but will be in the next week or two.

I already use the incremental parser/lexers in my vscode extensions.

The lexing is the same set of algorithms tree-sitter uses, the incremental parsing is actually simpler because LL is top-down.

Speed wise, it can also relex/reparse on every keystroke with no issue.
I mention it since ANTLR has a large collection of language grammars as well.

@aeschli another benefit of tree-sitter besides syntax highlighting is that it lends itself to a much better indentation logic (https://github.com/atom/atom/pull/18321). Just take a look at this example file and compare the indentation there with what VSCode will currently produce. For example:

With tree-sitter :

foo( 2,
  {
    sd,
    sdf
  });

foo( 2, {
  sd,
  sdf
});

vscode:

foo( 2,
  {
    sd,
    sdf
  });

  foo( 2, {
    sd,
    sdf
  });

The second call to foo is at the root level, so why is it indented? The answer is quite simple: an inductive indentation approach that just considers the previous line to determine the indentation for the current line cannot handle multiple scopes opening on the same line but closing on different ones, which is what these examples show. There are ways to deal with that, but they are not as flexible to make the indentation look the way people expect them to be.

If you add tree-sitter to vscode then I'm happy to try and port https://github.com/atom/atom/pull/18321 to vscode.

I recently made an extension that adds support for tree-sitter by replacing the builtin grammar with a simplified grammar that just colors literals and keywords, and then using the setDecorations API to apply tree-sitter based coloring to the tricky parts:

https://marketplace.visualstudio.com/items?itemName=georgewfraser.vscode-tree-sitter

For example, this is what Go looks like before and after installing the extension:

Screen Shot 2019-05-19 at 12 00 34 PM

It currently supports

  • Go
  • C++
  • Typescript

and it's straightforward to add any of the available tree-sitter languages.

Thanks for doing this, I really like the idea. I looked at doing that for elm and decided against it for the moment. I think it would be better to have it in my elm plugin and the best approach would be to actually have it in the language server, whenever this gets merged. https://github.com/Microsoft/language-server-protocol/issues/513

@georgewfraser does this approach also work for the embedded markdown code? For e.g. when you get a completion and it shows some docs, which have some code embedded?

While tree-sitter has an awesome concept I can't say the idea of writing grammar in JavaScript is all that appealing.

I know this is an old comment, but I don't really understand this logic. I've written (or attempted to write) several personal-use grammars for VS Code, and everything about the tmLanguage.json system is hopelessly awkward and unintuitive.

  1. There's a steep learning curve to get even a basic grasp on how TextMate grammars are structured
  2. It uses Oniguruma for its regex library, making the learning curve even steeper
  3. Since the regexes need to be written as strings, extra escape characters are needed, making them more difficult to read and write
  4. JavaScript is the single most widely used language on GitHub, and writing a grammar for Tree Sitter is relatively intuitive, so not only is it a vastly more powerful solution, but the barrier to entry for potential contributors is much lower

I realize that semantic highlighting is on VS Code's roadmap so this may be a moot point entirely (although frankly it has been "on the roadmap" for years with no movement), but I am really not seeing any downsides to "writing grammar in JavaScript" vs. VS Code's current implementation, which is frankly a nightmare

I think what @georgewfraser has made is great in the sense that; I don't think we need to wait on the VS Code's Core team to start work on a VS Code tree parser. I mean, sure, it being merely an extension isn't the best. But extensions can do almost everything core does, which is more than enough to kickstart the work. The more support we build for it the faster it will get merged into core, and I applaud @georgewfraser for effectively taking the first step 馃憦馃憦馃憦

Hello everyone. I've developed and published syntax highlighting extension based on Tree-Sitter. It provides universal syntax coloring engine for almost any programming language (currently, C and C++ are supported OOTB).
It's very easy to add support for a new language. I'm planning to write HowTo in the next couple of days, but you can figure it out from the source code, it's very simple and straightforward. Contributions are welcome.
I've been using the extension by myself for a month, so I suppose it's ready for public use. At least it can be useful until VSCode core provides stronger syntax parser.

You can install it from VSCode Marketplace.
Or download .vsix package from GitHub page and install it manually.
Please note, that extension published in VS Code Marketplace will only work in Windows-x64.
For other operating systems, please download pre-compiled .vsix package.
This will be fixed in the near future with one of the next updates.
Alternatively, you can build extension from sources.

I've noticed, that @georgewfraser published his implementation a couple of days ago. I suppose we had the same thoughts. I'm very glad that qualitative alternatives to limited TextMate grammars begin to appear. Thank you, @georgewfraser.

You guys should join forces :)

As an update to anyone interested in these extensions:
Manual platform-specific .vsix installation is no longer required. @georgewfraser, @EvgeniyPeshkov, and I got both of the extensions running with Web Assembly so now they work out of the box on basically any platform.

Does that extension fix this issue https://github.com/OmniSharp/omnisharp-vscode/issues/2461 ?

It probably would, but the tree-sitter implementation for csharp is incomplete and i think it doesn't know what a define is right now.
https://github.com/tree-sitter/tree-sitter-c-sharp

Even if these extensions provide better syntax highlighting, it's not ideal because we can't use them with other extensions that uses the same Decoration API (see https://github.com/microsoft/vscode/issues/74692)

Well that's disappointing, VScode wouldn't be bad for C# development if MS would properly support the language features. Will continue to use Visual Studio for C# projects. Shame really as VScode would work much faster for Unity projects and if they improved the debugging features on C# would make it the ideal replacement for VS.

I'm just surprised it hasn't been more of an issue.

To continue making progress on this, we can use the repo here to track and discuss the separate issues blocking the tree sitter implementation. I've added issues/labels for all of the known problems including the one @Geobert just pointed out. Feel free to ask questions, make feature requests, and ask for time estimates, or subscribe if you want incremental updates. I'll keep the readme updated with the general progress/timeline of the different extension implementations.

That way the extension contributors can have a central place for tracking/fixing issues, and this thread won't become bloated with every possible topic/question related to the tree-sitter. VS Code contributors can also use it to see what VS Code issues are upstream/blocking the tree sitter.

If there are any major breakthroughs to any of the extensions (or core), we'll make sure post on this thread. Right now the largest challenge is getting generic long term support for themes, with the secondary issue of fast colorization. There will be lots of internal progress this month, but it will likely take a month before the next major announcement.

Incomplete list of features of a perfect syntax highlighter for me

Aside from the obvious like string, comment, keyword, there needs to be a level of granularity for me, and of course semantic recognition.

These are all in the context of JavaScript specifically.

  • It can distinguish types of function contexts: declaration, call, method-call

These are possible in TextMate grammar, but method-call isn't added yet

// declaration
function func() {}
// call
func();
// method-call
obj.func();
  • It scopes parameter variables within a function body

Requires semantic highlighting

let outer = 0;

function func(a, b) {
  let inner = 1;
  // `a` and `b` should have a special parameter scope, both in the def 
  // and when using them
  // `inner` and `outer` are plain variables
  return a + b + inner - outer;
}
  • You can specify particular words to have their own scope, e.g. built-ins

Possible in TextMate, but it was incomplete and was recently removed, I'd like to manually define these if possible

console;
window;
document;
this;
setTimeout;
requestAnimationFrame;
arr.slice();
document.querySelector();
  • It can distinguish data structure syntax

Requires semantic highlighting

const objectLiteral = {}; // {} should not have a `punctuation` scope
const arrayLiteral = []; // [] should not have a `punctuation` scope

// But they should have `punctuation` scope here
{ /*  inside a block */ }
array[0] = 2; 
  • It can distinguish levels of object property access and objects

I think this is possible in TextMate, I believe this feature may exist already, but sometimes was buggy due to built-in DOM scopes/names

// `obj` should have a "top-level" object scope
// `one` and `two` should have a "sub-object" scope
// `three` should have a last-property scope
obj.one.two.three;

// Also object literal def properties have their own scope
obj = {
  prop: true // `prop` has its own scope
};
  • It can distinguish the type of a variable if it's static

Requires semantic highlighting

class Hello {}
Hello; // should be same as the class def

// Primitives constants can (optionally) have their related scope
const number = 0;
number; // should have `number` scope along with regular `variable / constant` scope

// Constants also scoped when declaring & using them
const CONSTANT = '';
  • Certain keywords can have their own scopes

Possible in TextMate, but not every keyword has its own scope and it gets grouped with other ones in the current language grammar

// `import` and `from` should have a particular module keyword scope
import {module} from 'pkg';

// `function` and `return` should have their own scopes if they want as well
function x() {
  return true;
}

The current TextMate implementation already has a lot of these nice features, and when Atom switched to their new tree-sitter syntax, it lost parts of these features. So while it gained nice features, it also lost some.

Many people don't realize these particular granular features can be important (or realize they should exist), but they are important to me. So whatever new implementation gets added, please make sure these are all possible in their respective languages 馃憣

As mentioned by @sean-mcmanus here and here, the dependency on the now unmaintained Atom TextMate grammars, is preventing issues from being addressed in the display of C/C++. You can add Go to the list now, too. The new %w format verb is not highlighted, and even the simple change necessary to make this possible can't go in because neither side is willing to make the change. This is only going to get worse.

The C/C++ grammar is now being maintained at https://github.com/jeff-hykin/cpp-textmate-grammar/ if you have an issue with the syntax highlighting, please file it there.

I am not talking about C/C++, I'm talking about Go and, presumably, any other language for which the highlight grammar is generated from an unmaintained, frozen source.

@flowchartsman clone the atom repo, and add the fix to it. That's the beauty of open source code. You can even make an extension that instantly applies the fix so that you don't have to wait for official VS Code support to get the benefits. I created a repo here fully setup for publishing the go syntax as an example. You're welcome to create issues on it if its unclear how it works.

If there is a better maintained version of TextMate syntax for go lang, I'm confident the VS Code team will switch to it. Someone just needs to take the initiative to create or suggest that more-maintained repo. @matter123 and I did it for C++, which is a good example of it getting better instead of worse. Matter123 and I have also created, and are working on documenting/refining, tools that make it easier for others to do the same.

I don't think Atom or VS Code teams are being stubborn sticking to frozen code. Someone just needs to get their hands dirty and implement/publish the fix themself.

There's a pull request at the Go extension for adding tree sitter support, but it's been sitting around since June: https://github.com/microsoft/vscode-go/pull/2555 . @ramya-rao-a What's the plan for improving syntax colorization for Go?

I don't think Atom or VS Code teams are being stubborn sticking to frozen code. Someone just needs to get their hands dirty and implement/publish the fix themself.

@jeff-hykin that kind of kicks the can on the main thrust of the issue here (multi-language dependence on stale atom code and difficult-to-maintain grammars), but I'll take it for now. Seems downright neglectful if I don't fork/PR now. I'll leave this issue to proceed on its own for now.

Any updates?

We'd love to support VSCode again, but bringing back our TextMate grammar is not an option. Our tree-sitter grammar has proven to be far easier to maintain, and the highlighting is much better (in Atom).

If VSCode team don't want to rely on Atom's original C-based tree-sitter implementation (It's easy to imagine many different reasons not to use native node modules in the core), there might be worth to consider Lezer an JS-based implementation of the idea made by Marijn Haverbeke. The authorship almost guaranties great code quality and maintainability. As a benefit VSCode (Monaco Editor?) can share grammars with upcoming CodeMirror version which, without any doubt, will be written for any possible language.

There are also docs for the new (reworked) tree sitter highlighting now https://tree-sitter.github.io/tree-sitter/syntax-highlighting

Hey everyone, I am a newbie to all this VS Code extension stuff. I wanted to learn how syntax highlighting takes place in the VS Code extension. I started working on a personal project to use TextMate grammar for highlighting javascript in order to learn. Over time, it grew complex. From this thread, I came across a new concept of tree-sitter. So can anybody help me or suggest me nice documentation on, how to port my existing syntax highlighting extension from TextMate grammar to tree-sitter, or if anyone has implemented tree-sitter syntax highlighting for javascript, please reply to this. I can really use some help. Thanks.

or if anyone has implemented tree-sitter syntax highlighting for javascript, please reply to this.

You seem like a nice guy, so I'll try not to make this too harsh: The homepage of Tree Sitter has a link to the implementation of JavaScript by TS's original creator. Good luck, and maybe hunt a little harder next time you have a question like this one.

@adrijshikhar This isn't really the right thread for that kind of a question, a better place would be on the tree sitter repo here

What is relevant here is that: if you port your syntax from TextMate to the Tree Sitter it won't be natively supported in VS Code and the non-native support is still very rough. The Tree Sitter is awesome, so I highly recommend learning it. However if you want something working in VS Code, me and another guy have been working on a library for making maintainable TextMate grammars for more than a year now. The library and documentation was finished recently, and I've been working on creating a tutorial. I'll likely publish the library and the tutorial this winter (2020) on Medium under "Make a TextMate grammar (without wanting to kill yourself)"

Has anyone in the vscode team explored using syntect engine?
Converting TextMate grammar into sublime syntax definition is pretty straightforward.

~The conversion is straightforward, but the Sublime engine still has the vast majority of the same problems.~

Actually it is notably more powerful than TextMate as per https://github.com/sublimehq/sublime_text/issues/2241 (thanks @michaelblyons for pointing that out). It still has some limitations notably:

  1. Being much slower than TreeSitter, especially for single line parsing
  2. It still uses the very-broken TextMate scope selectors instead of the more powerful scm queries

That said, it would be much easier to implement.

2 years later, is there any consensus on how and when to offer a better alternative to TextMate grammars?

2 years later, is there any consensus on how and when to offer a better alternative to TextMate grammars?

@fcurts "Semantic highlighting" has been available for a short while and it's basically the best syntax highlighting possible, especially when combined with the TM grammars.

It's only available for a few languages so far (It might still be in preview? Don't quote me on that), but bassically, it leverages a language service to determine the actual semantic type of any identifiers in your code and highlights them according to your theme's rules for that type. For tokens that don't have or don't need semantic information from a language service (like keywords and punctuation), it falls back to the TM grammars.

For example, in JavaScript:

const onClick = (event) => {
  // do something
}

document
  .querySelector('.my-button')
  .addEventListener('click', onClick) // <-- `onClick` gets colored as a function thanks to semantic highlighting

You need to be using a theme that has semantic highlighting enabled (the default themes do), or manually set it to enabled via a user setting override. As far as I'm aware, the supported languages currently are JavaScript, TypeScript, C++ (via the C++ extension), and C# (via the C# extension). There may be more. It's up to the author of any given language extension to provide the semantic highlighting tokens.

F# also supports it.

But the language server spec for this is still preview and subject to change, as far as I followed it.

Would be so happy to see VSCode uses tree-sitter colorer.

Currently there's no available tree-sitter option unfortunately...

The current extension is broken randomly: https://github.com/georgewfraser/vscode-tree-sitter/issues/28

Btw, Atom has supported tree sitter 2 years ago: https://github.blog/2018-10-31-atoms-new-parsing-system/

Btw, Atom has supported tree sitter 2 years ago: https://github.blog/2018-10-31-atoms-new-parsing-system/

The Atom team developed Tree-Sitter for use with Atom, so it's not really surprising that it's the first (only?) editor to have adopted it. It literally says that in the second paragraph of your link.

VS Code's new Semantic Highlighting implementation is now more accurate than Tree-Sitter since it uses a language service that analyzes your entire project (Tree-Sitter AFAIK has no information about the syntax tree outside of the file it's currently highlighting.)

@dannymcgee semantic highlighting wouldn't/doesn't replace Textmate. The tree sitter replaces textmate, and semantic highlighting would stay as-is.

Having tree sitter and providing an api for it would open the door for many interesting extensions too. We can imagine a paredit that is semantically correct for example.

it's not really surprising that it's the first (only?) editor to have adopted it. It literally says that in the second paragraph of your link.

neovim has it too.

Has this issue https://github.com/OmniSharp/omnisharp-vscode/issues/2461 been fixed yet? or is vscode still inferior to visual studio for better c# development

This issue is finally on the second page of the issues backlog which is nice to see. Visibility on this issue is probably pretty low just because most end users don't know how painful Tm grammars are until they run into syntax highlighting issues. @jeff-hykin maybe it would help get more visibility if you advertised it on the "Better C/C++ Syntax" readme?

Hopefully fixed soon.. Unity game dev is shit with Vscode for a number of reasons.. one of the biggest being so much of the code will use #if #else #end wrappers for different plugins, dev builds, etc any having all that unused code not visually greyed out is just rubbish to work with.. might aswel use notepad.

end users don't know how painful Tm grammars are until they run into syntax highlighting issues. @jeff-hykin maybe it would help get more visibility if you advertised it on the "Better C/C++ Syntax" readme?

@tristan957

With 330 issues (230 closed) on what is essentially a static json file, some issues of which are unsolvable, a general explanation in the readme wouldn't be a bad idea.

@jeff-hykin since you are familiar with at least enough of VSCode to put together syntax definitions, is it possible to have tree-sitter integration in an external extension? I think as Neovim 0.5 has been developed, the tree sitter integration actually lives in a separate extension, at least for now. https://github.com/nvim-treesitter

Like it seems that VSCode semantic highlighting completely takes over from Tm grammars when enabled, but maybe they work side by side together 馃し . If they don't work together could tree sitter be another option like semantic highlighting is an option?

@tristan957 sort of. You can certainly get the tree sitter engine running, I packaged up the WASM version of it into georgewfraser's extension. I worked with him on it for a bit
https://github.com/jeff-hykin/experimental-tree-sitter
https://github.com/georgewfraser/vscode-tree-sitter

  1. The problem is the extension uses decorations, which are slow. Really Really Really slow
  2. There is no good way of accessing the user's theme settings. George's extension has its own colors independent of whatever your theme is. (You can manually customize the colors)
  3. Assuming it did have access (there are hacky ways to gain access), there needs to be an implementation of a converter from the Textmate theme to the Tree sitter query language.

And there's smaller problems along with those. #1 is by far the most difficult issue, and the semantic highlighting might fix that, I'd have to learn more about the API.

I imagine Neovim gives extensions a lot more control than VS Code does with it's extensions.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ryan-wong picture ryan-wong  路  3Comments

mrkiley picture mrkiley  路  3Comments

trstringer picture trstringer  路  3Comments

biij5698 picture biij5698  路  3Comments

omidgolparvar picture omidgolparvar  路  3Comments