Neovim: structured/"rich" text, text annotations/overlay

Created on 3 Jan 2015  Â·  50Comments  Â·  Source: neovim/neovim

With the new plugin structure and things like the MessagePack API, it becomes much easier to have external plugins; one thing that is still awkward is the coloring of information.

I'm thinking about a plugin here (slimv, to be exact, although many others would have the same issue) that wants to have its own buffer to display arbitrary information. To get some highlightning (of "active" fields and other parts), there have to be syntax rules that operate via string matching - and that is

  • cumbersome (needs separators that don't exist in the text)
  • not nice re cursor movement (think concealcursor, conceallevel)
  • slow (for big buffers)
  • unnecessary.

How about being able to specify "classes" for (parts of) text, so that the syntax coloring rules can be applied directly without needing the RE engine inbetween? I imagine sending not a plain string for a line, but something like ["a string ", { class: "Error", text: "some error string"}, "more plain text"].

If that could be stored in Neovim directly it should make lots of things easier - especially for the plugins if they could store (some) arbitrary information in the dictionary as well! (Currently such things have to be put into the line, concealed in some way, and then matched and parsed out again, which is awful.)

To give a specific example - currently a line looks like this:

{[10] "types" []} = {[11] #<HASH-TABLE :TEST EQUALP :COUNT 3 {1009D95AD3}> []} {<3> [remove entry] <>}

while the visual display is

"types"  = #<HASH-TABLE :TEST EQUALP :COUNT 3 {1009D95AD3}>  [remove entry]

(and includes colors, of course).

enhancement extmarks syntax treesitter virtualtext

Most helpful comment

tree-sitter (see also video) looks like a viable option.

  • C lib with minimal dependencies
  • more momentum (and more grammars) than scintillua just by association with Atom editor
  • advanced features like incremental parsing and error recovery (not sure if scintillua has those)

All 50 comments

How about being able to specify "classes" for (parts of) text, so that the syntax coloring rules can be applied directly without needing the RE engine

https://github.com/neovim/neovim/issues/719#issuecomment-52933748 mentions scintillua, which looks like a very good alternative to regex for describing syntax.

It doesn't solve your more general request, which I interpret as "text properties". Text properties are very important but will require careful design, and won't be possible until we achieve more fundamental changes such as abstracting the buffer structure. Currently in the n/vim core, a buffer is basically a character array (memline.c); trying to bolt-on text properties will result in terrible performance.

vis's basic data structure (a piece chain) seems good for this kind of thing, but moving to that could require a major rewrite of the regexp engine.

vis's basic data structure (a piece chain) seems good for this kind of thing, but moving to that could require a major rewrite of the regexp engine.

I saw vis appear on HN too, and it intrigues me a lot. Though at some points in the description I thought: we can't omit that (feature) as he has done, (n)vim needs it.

Being able to mmap files into the memory pace is really really cool, but already breaks down when input conversion needs to be done, as said by @martanne (plus the fact that vim currently scans over the entire file to determine the encoding, which is cool since it has to read it into allocated memory anyway).

@aktau If a buffer is opened with :e ++enc=utf8 we could avoid conversion and scanning. And we could provide a user option that says "always assume utf8".

@aktau If a buffer is opened with :e ++enc=utf8 we could avoid conversion and scanning. And we could provide a user option that says "always assume utf8".

Yes, if the fenc (forced or not) is the same as enc, then a read-only mmap could be done. Otherwise not so much.

Most likely this is the majority case. But for example this would break down when opening binary files which are likely to fail the utf-8 test, which is usually when people really need large file support.

I'm also not sure of the impact of such a split-up on the manageability of the code. As I understand it, the current mem{line,file} combo is actually quite clean.

@aktau as soon as there's a NUL byte in the first few kBytes, the encoding should be seen as "binary", so the UTF8 test shouldn't matter here.

Well, it's equally possible to keep that memory layout, and "just" have an additional map (array or hash or tree or whatever) that can deliver additional details for parts of a line... don't know whether that's the same kind of rewrite, though.

@aktau as soon as there's a NUL byte in the first few kBytes, the encoding should be seen as "binary", so the UTF8 test shouldn't matter here.

As far as I can remember (got an unfinished blog post about this), if enc is utf-8 (which it most likely is), and fenc is something else, conversion will take place. I'm not even sure if binary is a vim encoding, I don't think so.

Well, it's equally possible to keep that memory layout, and "just" have an additional map (array or hash or tree or whatever) that can deliver additional details for parts of a line... don't know whether that's the same kind of rewrite, though.

Like some sort of _conversion overlay_, you mean. This would work well for latin-X to utf-8 or the reverse, but more distincts encodings would probably suffer.

All that said, I think perhaps an on-the-fly _conversion overlay_ could work. It only being generated when a certain piece of the buffer is actually requested. I shudder at the thought of implementing this without any crazy bugs though. The thought-stuff sure is enticing.

On the other hand, large files support seems like yet another concept that could be added on as a plugin. In the case of binary files, syntax/highlighting is obviously not needed, so a "view" of a file could be fed to nvim and the usual motions and non-syntax plugins could work on the partial buffer.

In the case of a large log file, for which syntax/highlighting would be needed, nothing is lost because vim already has maxlines and synmaxcol values which limit the lines evaluated by the syntax engine.

Some problems I can think of with this "view" approach:

  • in-file search (/) and :vimgrep won't work. We would need to fall back to an external search tool (which likely wouldn't support vim-style regex).
  • we would need to modify nvim core to understand the concept of "deferred content". E.g., we only send the current view, but let nvim know the _actual_ line/column count (and other parameters I haven't thought of)

Personally I really prefer trying to leverage robust external solutions and only enhancing the core by adding hooks.

Another reason I like the plugin approach is that in the common case, loading the file in memory is not really a problem and avoids complication. When people load large files they are unhappy about one of two things:

  • too slow
  • not enough features

If you load a giant C# (10 MB) file in Visual Studio, it will buckle (I know this for a fact). Add ReSharper and you might as well get some coffee. So you must either choose fast or good, and that means it is reasonable to disable some features (vim regex, whole-file analysis) on very large files.

I am interested to hear other cases that I am missing which would break with the "partial view" approach.

f you load a giant C# (10 MB) file in Visual Studio, it will buckle (I know this for a fact). Add ReSharper and you might as well get some coffee. So you must either choose fast or good, and that means it is reasonable to disable some features (vim regex, whole-file analys) on very large files.

I wasn't actually thinking about 10MB files as large. It sounds like peanuts. I was more thinking of 1-50GB size files. Which would have trouble fitting in main memory. Off the top my head, I don't know how well (n)vim does with a 10MB source file (syntax highlighted and all), but I would consider it a failure if we don't solve that (in case it has issues).

n/vim (with syntax highlighting and neocomplete) has no problem at all on the same 10 MB C# file (obviously VS/ReSharper are doing a lot more work on that file, so I don't mean to compare the two). I only raised that example to point out that one cannot expect all features in all scenarios.

Migrating the buffer data structure of n/vim is pretty close to a total rewrite. I find it much more interesting to see how far we can get with alternative solutions.

Hmm, to get back to my original request ... how about providing some kind of rich text buffer with some restrictions?

  • readonly, ie. only modifyable via replacing whole lines
  • highlightning only valid within line, so needs to be repeated for each line in a paragraph
  • cannot be saved or loaded

Is text properties not a correct interpretation of your original request? I don't understand what is new in the rich text buffer you describe.

Yeah, _text properties_ might be a good name for it, too.

I'd need not only the highlighning class name, though - storing arbitrary data as well would be nice.

Storing arbitrary data in association with a piece of text, and that association follows the text as edits are made. I believe the existing marks logic could be extended to do this, though it may not be scalable.

Storing arbitrary data in association with a piece of text, and that
association follows the text as edits are made.
Sounds right, although for my use case no (user-)edits are needed.
I'd just replace whole lines via RPC.

As for an easy example, think about netrw directory listings with
coloring, like ls does.
Perhaps with optional other highlightning, eg. files newer than an hour,
files bigger than X, or something like that, to get more colorized items
in a line.

How about allowing arbitrary key/value pairs in the :highlight command(eg: :highlight SomeGroup rgba=#e1e1e1cc) and simplify association of highlight groups with arbitrary ranges? These arbitrary key/value pairs are consumed only by UIs that are interested

The advantage is that we reuse the existing mechanism for decorating text

@phmarek If you can compute the position of the text to highlight since it is static, shouldn't it be possible to use matchaddpos()? That said, it seems to me that would be even more cumbersome than what we currently have; I use the concealed tags method in vim-pad and I know what you mean about it being not as clean as one would want.

Perhaps introducing a virtual key to tag separators (let's say <HSep>, like <SNR>) would help that, so instead of

 {[10] "types" []} = {[11] #<HASH-TABLE :TEST EQUALP :COUNT 3 {1009D95AD3}> []} {<3> [remove entry] <>}

you could have

<Hsep>10 "types" 10<Hsep> = <Hsep>11 #<HASH-TABLE :TEST EQUALP :COUNT 3 {1009D95AD3}> 11<Hsep> <Hsep>3 [remove entry] 3<Hsep>

@tarruda That would be quite helpful, and not only for UIs.

@fmoralesc That might be an option, too, but not much cleaner IMO.

And, in the long run, I'd like to shoot for having a "text" property named img, to have inline images, and this here would be an "easy" first step ;P

matchaddpos works good for e.g. a read-only output buffer, but it's a little bit inconvenient since it only modifies the current window. In a plugin I want to dynamically highlight an output buffer that need not be the current window. Switching current window back and forth kind-of works, but is not entirely reliable. Also it would be more convenient if the added highlighting were associated with a buffer and not a window (if the window is closed and the buffer then reopened, the matches need not be re-added). A bufferwise matchaddpos is perhaps something to consider?

@phmarek Sure, I was only thinking of the issue of having to specify different separators depending on the contents of the region, which that would solve.

I think @tarruda's suggestion could help for implementing img. Actually, _already_ nothing should stop a UI to interpret text like

  ![image](path)

to be displayed as a image, it's just that all UIs currently assume a grid of text. Expanding the :hi command would allow providing hints to UIs about this:

 syn match cmImage /![.\+](.\+)/ 
 hi cmImage type=img

I've thought a mechanism like this could allow plugins like NerdTree to be displayed as native lists (like this), and special buffers to be displayed using non-fixed with fonts.

@bfredl :+1:

Not sure highlight is the right mechanism, rather :syntax. But extending vimscript seems unnecessary to me in this case. We should only reuse the internal structures, but expose the functionality via the API only.

@justinmk Probably. Extending _both_ would be helpful.

(As to extending :syntax, I was thinking of adding a conceahhl attr to it, to allow highlighting different conceals differently, which has been a pain for me for a while at vim-pandoc-syntax).

Extending vimscript can also benefit vanilla vim, if the code could be proposed to vim_dev (I know...)

Sure if vim_dev accepts, but otherwise extending vimscript really complicates the burden of compatibility (or managing, documenting, and providing solutions for incompatibility) and always requires difficult, time-hungry decisions. Incompatibility is always an option (and sometimes we will choose it), but we can also work on making our API really nice to work with (from vimscript, too, using rpcnotify() and friends).

But extending vimscript seems unnecessary to me in this case. We should only reuse the internal structures, but expose the functionality via the API only.

The vimscript changes are minimal, all I'm proposing is to allow arbitrary key/value pairs after the :highlight GROUP command(as opposed to allowing only fixed keys such as guifg, ctermfg, etc). Highlight group data would be stored and passed as dictionaries, and UIs will only extract the information they support.

For example, if a TUI and a GUI are connected to the same instance and need to display a highlight group, the TUI would check for ctermfg/ctermbg while the GUI would check for guifg/guibg and even richer information such as alpha level or images as suggested by @fmoralesc suggested.

For associating the highlight information with arbitrary positions, I vote for @bfredl suggestion: a buffer-awarematchaddpos()

Besides being backwards compatible and allowing arbitrary formatting to be associated with text, this has the advantage of simplifying code (I estimate about 40% of syntax.c could be removed).

allow arbitrary key/value pairs after the :highlight GROUP command

Ok, but storing arbitrary information (including data that has nothing to do with highlighting, e.g. the value of a variable, or a contextual error message) via highlight sounds messy. I hope that we will also provide a proper API call to steer new plugins away from using :highlight for non-highlight purposes.

Just as a reference what I'm talking about: The "inspect" buffer in the video at http://malisper.me/2015/07/14/debugging-lisp-part-2-inspecting/ is a good example.

Coming back to that, having _some_ data stored in the syntax definition might be okay; but for multiple "buttons" in a menu these buttons would need to store per-button arbitrary data.

Kinda like HTML, ya know ;)

Sorry, I didn't follow the recent changes... does Neovim already have "rich" text?

I retract my objection to storing text annotation/metadata in :highlight structures. :highlight is a reasonable place for metadata/annotations. But of course this should be API-driven.

does Neovim already have "rich" text?

No.

tree-sitter (see also video) looks like a viable option.

  • C lib with minimal dependencies
  • more momentum (and more grammars) than scintillua just by association with Atom editor
  • advanced features like incremental parsing and error recovery (not sure if scintillua has those)

The recent version of Atom editor has switched on tree-sitter by default. Do you think this might be a signal to start evaluating it for NeoVim as well? Tree-sitter's feature set described in this talk from the Strange Loop looks very promising.

@narqo I'm in the process of evaluating it. The idea is to expose the tree to in-process lua plugins.

I'm also thinking for quite some time how tree-sitter could be implemented in such a way that it is flexible enough to provide advanced features like the following:

  • multiple languages/ASTs in one file/buffer like described here: http://tree-sitter.github.io/tree-sitter/using-parsers#multi-language-documents
  • (possibly multiple) ASTs via MessagePack API to provide a convenient/faster way for e.g. linting, semantic (code) completion, concealing, advanced code navigation, folding etc.

    • This might be quite difficult to do performantly (don't send the whole AST all the time over the API, possibly differential/incremental updates)

  • syntax highlighting/styling via MessagePack API?

    • I had in mind to implement a plugin(?) that uses a simple query based language like CSS to select nodes in the AST to allow advanced conditional styling (e.g. function parameter identifiers can look different than identifiers in a block statement)

I'm not sure if it is a good idea to do most of the stuff via the MessagePack API, as it could be quite challenging to do it performantly. On the other hand it would provide a lot of freedom for developers to implement all kinds of plugins in their language.
In-process plugins would make it of course quite a lot simpler.

In general tree-sitter seems to be a well implemented API that is pretty fast (parsed the whole screen.c of this repo in about 25 ms on a little bit older i7-4870hq, changes should be pretty fast with incremental parsing).

@bfredl could you explain in more detail how you would do the in-process lua plugins?

@phmarek In my view tree-sitter should be for the things that are "fast enough, so it even doesn't need to be async", while LSP (and other RPC processes) covers the slower, but more semantic stuff.

There is no exact plan, but my aim is to do stuff in lua/luajit as much as possible and add core extension points where needed, #9170 is the first example. I would also like to experiment with callback-driven highlighting, like nvim invoking a lua function per line (or per viewport, for somewhat lower overhead), which returns some kind of display list (highlight of existing byte ranges + perhaps virtual text). If it turns out to be too slow, we can of course do everything in C, but adding these kind of extension points are still useful.

@bfredl
Ok the developer of Tree-sitter
actually says similar things: https://news.ycombinator.com/item?id=18349488
So it's probably really a little bit over the top to use tree-sitter for all the features I proposed.

As I think we should keep neovim as modular as possible, is it possible to write in-process plugins in C that are not compiled within neovim that may access the tree-sitter-tree(s) directly? I have searched quite a bit but couldn't find anything regarding it (maybe an issue worth itself)

I think especially with the "fast enough, so it even doesn't need to be async" strategy (which I support) we should keep in mind that Lua may not be fast enough for a Syntax API for different features, or we have to carefully design this Lua-tree-sitter API so that the main work will still be done by the neovim binary.
But I guess it's a good idea to start with Lua, and test how well it behaves.

As I think we should keep neovim as modular as possible, is it possible to write in-process plugins in C that are not compiled within neovim that may access the tree-sitter-tree(s) directly? I have searched quite a bit but couldn't find anything regarding it (maybe an issue worth itself)

Tree-sitter has TSTree *ts_tree_copy(const TSTree *) which can be used to safely share a otherwise lua-managed tree with some C code (even on a separate thread).

I think especially with the "fast enough, so it even doesn't need to be async" strategy (which I support) we should keep in mind that Lua may not be fast enough for a Syntax API for different features, or we have to carefully design this Lua-tree-sitter API so that the main work will still be done by the neovim binary.
But I guess it's a good idea to start with Lua, and test how well it behaves.

Don't worry, I will use detailed profiling and comparison with the existing regex syntax highlighting (which easily can take 40% of redraw time of nvim core+TUI). The main reason for using lua, is that luajit FFI + a basic lua REPL plugin effectively gives me a REPL for the TS API for quick (but unsafe) experimenting and prototyping. This of course needs to be replaced eventually by at least safe lua C wrappers with proper GC support.

Awesome to see this being worked on! 😄 🎉 Re performance and sync/async, I recalled this blog post by the author of YouCompleteMe: https://plus.google.com/u/1/+StrahinjaMarkovi%C4%87/posts/Zmr5uf2jCHm. His conclusion was that serializing 100kb of text to JSON, shipping back and forth with HTTP and deserializing again took 3 ms => RPC was a none-issue. The author of Xi, which is probably the most perfomance-oriented editor ever made, has reached the same conclusion and uses JSON RPC for the API: https://youtu.be/4FbQre9VQLI?t=664.

tl;dr let's not count out MessagePack RPC as too slow before it's actually been tested and benchmarked, the results might be surprising! 🙂

The reason to not use RPC here is not that RPC is too slow - it rather just that tree-sitter is fast enough that we can do TS syntax highlighting in the main thread. There would be no reason to not _combine_ sync _syntax_ highlighting (say typenames and struct members, like in the TS demo video) with async _semantic_ highlighing from LSP (say to distinguish function-local, static and global names without syntactic difference), or other async source.

I think I've got incremental parsing basically working. Here is a small demo (follow node at cursor position, lookup parents recursively, best-guess recovery with syntax error):
asciicast

Code coming soon (just need to fix the build/link story to not be positively insane)...

@bfredl That’s pretty exciting!

This may be a dumb question, but where does the knowledge behind 'use of undeclared identifier’ come from? Doesn’t that require looking into all the included header files?

That is actually LSP/clangd. I first thought I should turn if off, then I thought maybe not, to showcase the point I tried to make above: combine fast sync stuff with slow async stuff (notice the longer lag until clangd had time to emit the error).

@bfredl very cool! 😄 a question: is the syntax highlighting in the example also coming from TS?

No I'm not that far yet :sweat_smile: Only the matchparen-like limits and the node name in the message area.

@bfredl how is your tree-sitter work coming along? I had to open some multi-megabyte HTML files (terms and conditions) and had to switch highlighting off to move the cursor. That is how I stumbled here!

@felipesere It will be one of my priorities after 0.4 is released (very soon, hopefully).

I'm not sure if this is off-topic but https://github.com/neovim/neovim/issues/1767#issuecomment-435932649 got me thinking: Will it be possible, once the tree-sitter branch is merged, to use language-specific syntax objects like identifier, function_definition directly from VimScript? I'm thinking of cases where for instance iskeyword option isn't precise enough (think e.g. Rust with it's foo!() and foo! being a keyword if foo is a macro, but !foo not being a keyword in if !foo {}). I see a huge potential in this. It always struck me as odd that movements like ( don't make much sense in almost any programming language, only when writing prose. With tree-sitter merged in, there could be language-specific, precise text objects available under a single key in Vim! Another example is targets.vim implementing function argument text object. It easily gets confused when there are commas within a function argument text object. I believe tree-sitter has the potential to fix that shortcoming.

Will it be possible, once the tree-sitter branch is merged, to use language-specific syntax objects like identifier, function_definition directly from VimScript

Yes. Directly from Lua, which is accessible from Vimscript. Initially however, we will provide only a query API. We will document patterns for using the API to query syntax, which can be used to create mappings. Later, we will think about adding first-class Normal-mode commands for common cases like "around Function" (af, if), etc.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

kshmelkov picture kshmelkov  Â·  50Comments

aktau picture aktau  Â·  103Comments

ghost picture ghost  Â·  48Comments

dvidsilva picture dvidsilva  Â·  63Comments

Robinlovelace picture Robinlovelace  Â·  47Comments