Powershell: Add Span based API to script extents

Created on 30 Jan 2019  ·  33Comments  ·  Source: PowerShell/PowerShell

Summary of the new feature/enhancement

Currently if you use the InternalScriptExtent.Text property it will allocate a string of the section of the script that the extent represents. It would be a nice quality of life for compiled editor related projects if we could get a ReadOnlySpan<char> instead.

Proposed technical implementation details (optional)

This could be surfaced with a new interface that implements IScriptExtent.

public interface ISpanEnabledScriptExtent : IScriptExtent
{
    ReadOnlySpan<char> GetTextSpan();
}

And an implementation in InternalScriptExtent

public ReadOnlySpan<char> GetTextSpan()
{
    // StartOffset can be > the length for the EOF token.
    if (StartOffset > PositionHelper.ScriptText.Length)
    {
        return ReadOnlySpan<char>.Empty;
    }

    return ((ReadOnlySpan<char>)PositionHelper.ScriptText).Slice(
        StartOffset,
        EndOffset - StartOffset);
}
Committee-Reviewed Issue-Enhancement

Most helpful comment

I still think there is value in the original proposal. Ultimately @lzybkr's suggestion is better long term, but I think it's a deceptively large undertaking.

I'll include a rough outline of how I picture the changes below, but before that a few things to consider.

  1. Every AST type would need new public constructors to handle the alternate way of providing extent information.

  2. Tokens would also likely new constructors (though there are no public constructors currently)

  3. A significant amount of logic in the parser (and at least some in the compiler) would need to be updated.

I've always wanted to see a separate parser package focused on supporting editors and other related tools. In something like that I think this change would be great because you could throw out IScriptExtent entirely. I could be missing an easy solution (@lzybkr?) but it looks like the amount of work would not likely be worth the performance gain.

Here's my idea of how it could be done, ignoring thread safety and error checking:

// Public version of PositionHelper that is easier to use and create, and stores script text
// as ReadOnlyMemory<char>.
public class ScriptDocumentInfo
{
    private PositionHelper _positionHelper;

    public ScriptDocumentInfo(string file, ReadOnlyMemory<char> scriptText, int[] lineStartMap)
    {
        File = file;
        ScriptText = scriptText;
        LineStartMap = lineStartMap;
    }

    public ReadOnlyMemory<char> ScriptText { get; }

    public int[] LineStartMap { get; }

    public string File { get; }

    // This could possibly be omitted and instead use a new IScriptExtent implementation
    // that would utilize ScriptDocumentInfo instead of PositionHelper.
    internal PositionHelper PositionHelper => _positionHelper ?? _positionHelper = CreatePositionHelper(this);

    public static ScriptDocumentInfo Create(string file, ReadOnlyMemory<char> input)
    {
        // Line map creation from Tokenizer.Initialize method.
        List<int> lineStartMap = new List<int>(100) { 0 };
        for (int i = 0; i < input.Length; ++i)
        {
            char c = input[i];

            if (c == '\r')
            {
                if ((i + 1) < input.Length && input[i + 1] == '\n')
                {
                    i += 1;
                }

                lineStartMap.Add(i + 1);
            }

            if (c == '\n')
            {
                lineStartMap.Add(i + 1);
            }
        }

        return new ScriptDocumentInfo(file, input, lineStartMap.ToArray());
    }

    private static PositionHelper CreatePositionHelper(ScriptDocumentInfo document)
    {
        var helper = new PositionHelper(document.File, document.ScriptText.ToString());
        helper.LineStartMap = document.LineStartMap;
        return helper;
    }
}

// Every AST would need new constructors that will take either ScriptSpan or
// both ScriptSpan and ScriptDocumentInfo (for top level ASTs)
public partial class Ast
{
    private IScriptExtent _scriptExtent;

    private ScriptDocumentInfo _document;

    protected Ast(ScriptSpan scriptSpan)
    {
        ScriptSpan = scriptSpan;
    }

    protected Ast(ScriptDocumentInfo document, ScriptSpan scriptSpan)
    {
        _document = document;
        ScriptSpan = scriptSpan;
    }

    // Original extent based constructor should still populate ScriptSpan and Document.
    protected Ast(IScriptExtent extent)
    {
        _scriptExtent = extent;
        _document = ScriptDocumentInfo.Create(
            extent.File,
            extent.Position.GetFullScript());
        ScriptSpan = new ScriptSpan(extent.StartOffset, extent.EndOffset);
    }

    public ScriptSpan ScriptSpan { get; }

    // Lazily create IScriptExtent from ScriptSpan.
    public IScriptExtent Extent => _scriptExtent ?? _scriptExtent = ScriptSpan.ToScriptExtent(Document);

    public ScriptDocumentInfo Document
    {
        get
        {
            if (_document != null)
            {
                return _document;
            }

            // Get document from a parent if it's not present on this node.
            for (Ast node = Parent; node != null; node = node.Parent)
            {
                if (node._document != null)
                {
                    return _document = node._document;
                }
            }
        }
    }
}

// I used the name ScriptSpan as sort of a reference to Microsoft.CodeAnalysis.Text.TextSpan but
// System.Span makes that a little confusing.
public readonly struct ScriptSpan
{
    public readonly int StartOffset;

    public readonly int EndOffset;

    public ScriptSpan(int startOffset, int endOffset)
    {
        StartOffset = startOffset;
        EndOffset = endOffset;
    }

    // DocumentInfo must be provided when context is needed to keep the struct size small.
    public IScriptExtent ToScriptExtent(ScriptDocumentInfo documentInfo)
    {
        return new InternalScriptExtent(
            documentInfo.PositionHelper,
            StartOffset,
            EndOffset);
    }

    public ReadOnlySpan<char> GetTextSpan(ScriptDocumentInfo documentInfo)
    {
        return documentInfo.ScriptText.Span.Slice(StartOffset, EndOffset);
    }
}

All 33 comments

@SeeminglyScience Please clarify how do you plan get benefits in an editor if InternalScriptExtent is internal class?

@iSazonov The new method would be accessible via the proposed public interface.

I mention InternalScriptExtent specifically because it's the default IScriptExtent created by the parser. If you're working with extents that you haven't created yourself, they will most commonly be (by a large margin) InternalScriptExtent.

There also isn't much value in adding it the public ScriptExtent because you have to pass a string of it's contents to the constructor. Ideally ScriptExtent would still implement it, but mainly just for consistency.

@SeeminglyScience Thanks for clarify. What specific editor project do you mean? I think it will be more convincing for the committee to approve the new public API.
Also I think it should be in IScriptExtent which is already public and I suggest AsSpan().

Thanks for clarify. What specific editor project do you mean? I think it will be more convincing for the committee to approve the new public API.

The suggestion came about while working on the overhaul for EditorServicesCommandSuite which utilizes Span heavily. That said, when Span is eventually added to netstandard there are a lot of places in PowerShellEditorServices where this would be useful.

Also I think it should be in IScriptExtent which is already public

Wouldn't that be a breaking change for current implementations of IScriptExtent? Not sure how many external implementations are out there, but at the very least PowerShellEditorServices and EditorServicesCommandSuite both have at least one.

I suggest AsSpan().

AsSpan is already used a lot as extension methods in instances where a cast would normally work. This is most often used targeting netstandard because the nuget package doesn't include the implicit operators for types like string and T[].

I think AsSpan would imply you are getting the extent itself as a span instead of just the string it represents. I guess GetSpan has a similar problem though now that I think about it. Maybe GetTextSpan would be better?

Wouldn't that be a breaking change for current implementations of IScriptExtent?

With C# 8.0 we get default interface implementations.

I think AsSpan would imply …

Your request is about Text property so TextAsSpan() will be that you want :-)

With C# 8.0 we get default interface implementations.

Okay now that's cool. I had heard about that but I had assumed it would be behind-the-scenes compiler magic. In case anyone else had the same assumption, here's a sharplab showing it's supported in the CLR. Neat 🙂

Since the EMCA spec says interface's cannot provide concrete instance methods, do you happen to know if full CLR already supports this? Core CLR is in general a bit more lax in it's runtime checks from what I've seen.

My understanding is that .Net Framework get new features slowly and only if it is not breaking backward compatibility. All new features is added to .Net Core. So it may not appear at all in Framework.

What I'm more curious about is if it just happens to work. In other words, the spec might say it's not supported but that doesn't mean they specifically coded against it outside of compilers. I'll do some testing with MetadataBuilder to see how/if it blows up. If it does then a new interface would be better, assembly load failures at runtime are no fun.

Edit: Both core and full CLR currently throw when loading an interface with a non-abstract instance method 😕

Seems we could benefit from the API internally too.

@SeeminglyScience Please update the issue description.

@SteveL-MSFT Could PowerShell Committee review the new proposed API?

As long as you're thinking about performance and extents - there are also some nice possible gains in revisiting how the parser stores extents.

Some background- I think I over-generalized with IScriptExtent - my thinking at the time was that there would be tools generating a PowerShell Ast that did not generate script and might want a custom way to represent the extent. I'm not aware of such tools.

If the Ast stored a concrete type (not IScriptExtent) - it could be struct saving some memory (assuming extents aren't shared (I don't think they are) and can't be shared effectively. Some asts will have the same extent, but is sharing a bigger win than a struct?

Also possible - the ScriptText property only needs to be stored in the root of the Ast - children could simply store the start/end pair, and a proper ScriptExtent instance would be synthesized on demand.

Some background- I think I over-generalized with IScriptExtent - my thinking at the time was that there would be tools generating a PowerShell Ast that did not generate script and might want a custom way to represent the extent. I'm not aware of such tools.

Sounds like something I'd do, got any fun ideas? 😉

(assuming extents aren't shared (I don't think they are))

Some AST's share extents with tokens, like string constant expressions:

using namespace System.Management.Automation.Language

$t = $null
$a = [Parser]::ParseInput("'test'", [ref] $tokens, [ref] $null)
[object]::ReferenceEquals(
    $a.EndBlock.Statements[0].PipelineElements[0].Expression.Extent,
    $t[0].Extent)
#  True

Also possible - the ScriptText property only needs to be stored in the root of the Ast - children could simply store the start/end pair, and a proper ScriptExtent instance would be synthesized on demand.

❤️

Related #1806 #7857 #7887 - Trivia

the suggestion from @lzybkr suggests that adding to IScriptExtent might not be the best way to go. Could we get an updated proposal along those lines?

I still think there is value in the original proposal. Ultimately @lzybkr's suggestion is better long term, but I think it's a deceptively large undertaking.

I'll include a rough outline of how I picture the changes below, but before that a few things to consider.

  1. Every AST type would need new public constructors to handle the alternate way of providing extent information.

  2. Tokens would also likely new constructors (though there are no public constructors currently)

  3. A significant amount of logic in the parser (and at least some in the compiler) would need to be updated.

I've always wanted to see a separate parser package focused on supporting editors and other related tools. In something like that I think this change would be great because you could throw out IScriptExtent entirely. I could be missing an easy solution (@lzybkr?) but it looks like the amount of work would not likely be worth the performance gain.

Here's my idea of how it could be done, ignoring thread safety and error checking:

// Public version of PositionHelper that is easier to use and create, and stores script text
// as ReadOnlyMemory<char>.
public class ScriptDocumentInfo
{
    private PositionHelper _positionHelper;

    public ScriptDocumentInfo(string file, ReadOnlyMemory<char> scriptText, int[] lineStartMap)
    {
        File = file;
        ScriptText = scriptText;
        LineStartMap = lineStartMap;
    }

    public ReadOnlyMemory<char> ScriptText { get; }

    public int[] LineStartMap { get; }

    public string File { get; }

    // This could possibly be omitted and instead use a new IScriptExtent implementation
    // that would utilize ScriptDocumentInfo instead of PositionHelper.
    internal PositionHelper PositionHelper => _positionHelper ?? _positionHelper = CreatePositionHelper(this);

    public static ScriptDocumentInfo Create(string file, ReadOnlyMemory<char> input)
    {
        // Line map creation from Tokenizer.Initialize method.
        List<int> lineStartMap = new List<int>(100) { 0 };
        for (int i = 0; i < input.Length; ++i)
        {
            char c = input[i];

            if (c == '\r')
            {
                if ((i + 1) < input.Length && input[i + 1] == '\n')
                {
                    i += 1;
                }

                lineStartMap.Add(i + 1);
            }

            if (c == '\n')
            {
                lineStartMap.Add(i + 1);
            }
        }

        return new ScriptDocumentInfo(file, input, lineStartMap.ToArray());
    }

    private static PositionHelper CreatePositionHelper(ScriptDocumentInfo document)
    {
        var helper = new PositionHelper(document.File, document.ScriptText.ToString());
        helper.LineStartMap = document.LineStartMap;
        return helper;
    }
}

// Every AST would need new constructors that will take either ScriptSpan or
// both ScriptSpan and ScriptDocumentInfo (for top level ASTs)
public partial class Ast
{
    private IScriptExtent _scriptExtent;

    private ScriptDocumentInfo _document;

    protected Ast(ScriptSpan scriptSpan)
    {
        ScriptSpan = scriptSpan;
    }

    protected Ast(ScriptDocumentInfo document, ScriptSpan scriptSpan)
    {
        _document = document;
        ScriptSpan = scriptSpan;
    }

    // Original extent based constructor should still populate ScriptSpan and Document.
    protected Ast(IScriptExtent extent)
    {
        _scriptExtent = extent;
        _document = ScriptDocumentInfo.Create(
            extent.File,
            extent.Position.GetFullScript());
        ScriptSpan = new ScriptSpan(extent.StartOffset, extent.EndOffset);
    }

    public ScriptSpan ScriptSpan { get; }

    // Lazily create IScriptExtent from ScriptSpan.
    public IScriptExtent Extent => _scriptExtent ?? _scriptExtent = ScriptSpan.ToScriptExtent(Document);

    public ScriptDocumentInfo Document
    {
        get
        {
            if (_document != null)
            {
                return _document;
            }

            // Get document from a parent if it's not present on this node.
            for (Ast node = Parent; node != null; node = node.Parent)
            {
                if (node._document != null)
                {
                    return _document = node._document;
                }
            }
        }
    }
}

// I used the name ScriptSpan as sort of a reference to Microsoft.CodeAnalysis.Text.TextSpan but
// System.Span makes that a little confusing.
public readonly struct ScriptSpan
{
    public readonly int StartOffset;

    public readonly int EndOffset;

    public ScriptSpan(int startOffset, int endOffset)
    {
        StartOffset = startOffset;
        EndOffset = endOffset;
    }

    // DocumentInfo must be provided when context is needed to keep the struct size small.
    public IScriptExtent ToScriptExtent(ScriptDocumentInfo documentInfo)
    {
        return new InternalScriptExtent(
            documentInfo.PositionHelper,
            StartOffset,
            EndOffset);
    }

    public ReadOnlySpan<char> GetTextSpan(ScriptDocumentInfo documentInfo)
    {
        return documentInfo.ScriptText.Span.Slice(StartOffset, EndOffset);
    }
}

I think we're thinking roughly the same thing, the primary difference would just be an optimization in not storing ScriptDocumentInfo redundantly in every Ast.

I would like to see _scriptExtent eventually go away. To start maybe it'd be important for performance, but if folks can move away from using the Extent property (maybe deprecate it?) then creating a new one every time wouldn't be too big a deal and it would be much better for gc perf - fewer fields to scan.

I would like to see _scriptExtent eventually go away. To start maybe it'd be important for performance, but if folks can move away from using the Extent property (maybe deprecate it?) then creating a new one every time wouldn't be too big a deal and it would be much better for gc perf - fewer fields to scan.

What about using a static ConditionalWeakTable<Ast, IScriptExtent> so reference equality is kept?

A ConditionalWeakTable could be worse depending on how often the Extent property is used.

I forgot to mention - I'm not excited about basically doubling the constructors. I know it would be a breaking change to remove the variants taking IScriptExtent, but that change might be low impact.

If I understand correctly the idea we keep source code (file) on top level and reference with (start, length) from ast tree. In the case if the file changed (by user in editor) whole ast tree (or all referenses) is needed to rebuild (recalculated). Does it cause a performance problem?

A ConditionalWeakTable could be worse depending on how often the Extent property is used.

Yeah, my thought process was if Extent is deprecated then the performance hit would be acceptable to avoid breaking changes while still optimizing the recommended method.

I forgot to mention - I'm not excited about basically doubling the constructors. I know it would be a breaking change to remove the variants taking IScriptExtent, but that change might be low impact.

That change would have a pretty high impact on quite a few things I'm doing, but that's probably not a great indicator.

@iSazonov

If I understand correctly the idea we keep source code (file) on top level and reference with (start, length) from ast tree. In the case if the file changed (by user in editor) whole ast tree (or all referenses) is needed to rebuild (recalculated). Does it cause a performance problem?

The AST is immutable, so everything is rebuilt with every change already.

The AST is immutable, so everything is rebuilt with every change already.

Interesting, what is Roslin behavior in such case?

Interesting, what is Roslin behavior in such case?

Also immutable so changes are done by recreating the whole tree there as well. They do have some support for incremental parsing though. I haven't looked into it too much, but I would assume they still need to recreate the whole tree with incremental parsing. It's probably used to avoid needing to tokenize the whole source again.

@SeeminglyScience Thanks! I look shortly the Roslyn sources - they uses immutable structs.
For reference https://stackoverflow.com/questions/25076035/roslyn-change-textspan-of-syntaxtoken-or-syntaxnode/25085585#25085585

The Roslyn tree is essentially immutable, but the do smart things to avoid rebuilding the entire tree as that wouldn't scale to large projects

PowerShell can't do things quite like Roslyn because of the Parent link. When I added that, it was a very conscious decision to make certain types of analysis much simpler given that the notion of a project does not exist in PowerShell and scripts tend to be reasonably small in comparison to C# code, e.g. you can do much more in fewer lines of code in PowerShell.

@lzybkr Thanks for sharing the design decision! I would add that PowerShell Parser is very fast so that we shouldn't worry even about modules consisting of thousands of lines that seems to be covering all reasonable scenarios.

I suggest that PowerShell Committee consider this idea (The ScriptText property only needs to be stored in the root of the Ast - children could simply store the start/end pair, and a proper ScriptExtent instance would be synthesized on demand.) so that we can continue to work in the right direction.

@iSazonov - committee involvement isn't necessary for implementation details like that unless there is a breaking change.

I think some complexity can be avoided with a breaking change, but it isn't strictly necessary to get some nice wins.

@lzybkr Initial request is to enhance IScriptExtent (new API). Opposite proposal is to deprecate this interface (a breaking change). The third sentence is to save the interface but implement a new API (additional complicity). It is PowerShell Committee competence.

@SeeminglyScience Is this issue still seeking for adding a new property to IScriptExtent? Or, has the goal been changed to pursue a more fundamental change? If it's the latter, then a RFC would be desired to keep various ideas and discussions well captured. If it's the former, then it will be great that you update the issue description to reflect the most up-to-date proposal of it.

Overall, I think adding TextAsSpan to IScriptExtent is feasible with the upcoming "interface default member" feature in C#. But I'm afraid it might cause you problems when the editor service code that take advantage the new member is running on an older version of pwsh (or maybe that won't happen?).

Is this issue still seeking for adding a new property to IScriptExtent?

It is to add a new interface with the GetTextSpan method (or whatever name is decided, but that is my vote). I believe it should be on a new interface to avoid breaking IScriptExtent implementations that target PowerShell Standard.

Or, has the goal been changed to pursue a more fundamental change?

I think the changes @lzybkr brought up are ultimately the right move at some point in the future. They do however come with a significant amount of risk and effort as IScriptExtent is used heavily in the parser. For those changes to be meaningful at all, that logic would all need to be changed to use the new API. It's also difficult (for me at least) to tell how exactly large of a benefit those changes would really be. In other words I think it's the right move but I can't commit that much spare time to it.

But I'm afraid it might cause you problems when the editor service code that take advantage the new member is running on an older version of pwsh (or maybe that won't happen?).

Yeah I did some tests with MetadataBuilder. Both core and full CLR currently throw on type gen if it comes cross an interface member with a concrete implementation.

Both core and full CLR currently throw on type gen if it comes cross an interface member with a concrete implementation.

My understanding is that it is for binary compatibility - you can load new dll in old code and run.
Also PowerShell Standard can have versions too.

My understanding is that it is for binary compatibility - you can load new dll in old code and run.
Also PowerShell Standard can have versions too.

Ah right I'm thinking about it backwards (loading the new IScriptExtent into an older runtime), maybe that would work. Though from the perspective of IScriptExtent it wouldn't have access to PositionHelper. It would have to look get the full script text with something like IScriptExtent.StartPosition.GetFullScript() before it could get the span.

I'd ask MSFT team weigh the paths

  • if we add TextAsSpan to IScriptExtent today and later deprecate it with new solution - it will then breaking change and extra work
  • there are some fundamental issues that could be considered together - standalone Parser, trivia, the new shared extent proposal - if we want to change something then it must be a balanced movement.

@PowerShell/powershell-committee reviewed this, the desired behavior makes sense and we approve

@PowerShell/powershell-committee reviewed this, the desired behavior makes sense and we approve

To clarify, "the desired behavior" means the proposed behavior: add Span based API to script extents to avoid SubString calls. The implementation detail was not in the committee discussion.

Was this page helpful?
0 / 5 - 0 ratings