I plan to work on this after #1650 is merged.
We are never going to cover everyone's Markdown Standards, so we should create a way to let the User define their own parsing Rules, Block, Inlines, Conditions, etc.
The current way I'm thinking of implementing this, is by having an event called ParsingBeginning, or something similar, in the Args for that event would be the Registered TripChars and ways of Identifying Blocks.
This way you can Add, Insert, and Remove registration for Block and Inline Creation. If, for some reason you want to disable List Block parsing, it could be done with this event.
You could then create your own Inlines to handle with your own custom logic and UI.
To implement UI for this, new methods would be added to the MarkdownRendererBase.
public override void RenderCustomBlock(MarkdownBlock block, IRenderContext context)
{
switch (block)
{
case CustomBlockType1 custom1:
RenderCustomBlockType1(custom1, context);
break;
default:
base.RenderCustomBlock(block, context);
break;
}
}
public override void RenderCustomInline(MarkdownInline inline, IRenderContext context)
{
switch (inline)
{
case CustomInlineType1 custom1:
RenderCustomInlineType1(custom1, context);
break;
default:
base.RenderCustomInline(inline, context);
break;
}
}
No response from the community. ping @nmetulev @Odonno @IbraheemOsama
This issue seems inactive. Do you need help to complete this issue?
Can we close it? The PR is merged and the related issue is closed.
@WilliamABradley can you provide an example using this?
From what I see, that this issue wasn't handled in the referenced PR.
It was just mentioned in PR in a discussion.
I need this and am happy to have a crack at writing it if no one is making progress on it?
This would be what I'm currently searching for.
An Extensible Markdown Parser that provides me with some kind of AST instead of HTML
@WilliamABradley seems like you haven't done much on this lately? Did you get started at all and have some code to share in a branch?
@knightmeister @LokiMidgard if either of you or both would want to take a poke at this, please feel free to jump in! The Markdown control is pretty popular, so I know this would be a welcome edition.
@michael-hawker I'm not sure if I'll come to this soon. However I've made some thoughts about this feature I would like to share.
What can be extended
Markdown has multiple ways to declare different kind of elements.
**, *, ...) for bold, italic and others#, -, 1., >, ...). This is used for lists, headings or quotesI only looked shortly at the source code, But if I understood it correctly Inline parsing is very modular designed. Every InlineElement seems to define characters that may start the inline element. And when a character matches the parse logic is in the inline element itself (which can fail/return null). Order seems to be important here. BoldItalic will be tried before Bold before italic.
If there was code that would dynamically compose those different InlineElements, new custom inline elements could added quite easy I think. With only the problem of ordering. This would allow 1. from the list above.
The only anomaly is parsing links. The ParseInlineChildren method has an parameter to suppress parsing links. This is only set by MarkdownLinkInline. And used by FindNextInlineElement to ignore different HyperlinkInline parse methods.
Would we need a more generic way, so when parsing inline's the caller can define which inline's should not be considered?
For blocks this is not the case. The Logic if something could be a Block is in the Document parse method. However this should be changeable.
Blocks and inline's have a enum property named Type. What will we do with this when we add custom types. Can we add a custom type to the enumeration? I think there could be concerns for background combability.
Do we need an additional Property to specify the Type if custom is used? Could the Type be enough?
Three types of blocks need special consideration since they result in little different behavior then the rest of the blocks. Quotes, underline style header and (YAML) header.
Can custom blocks styles have the same or similar behavior then those build in styles?
The underline style header will change the already parsed paragraph. The last line will be used as the Block and the text before that line will become a paragraph block.
Should we allow an underline style to manipulate more then the last line?
This style will be
Applied to the howl paragraph.
Should this be possible?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Block Quotes can be nested inside another and the Document parse logic will first determent the indentation of the quotes. And find out when the quote finishes. Some blocks also need the indention to work correctly.
Can different quote blocks be nested?
| This is a custom
| Quote block, thats
| used for sample text
> Can we quote this?
> | This is a custom
> | Quote block, thats
> | used for sample text
YAML header block is handled differently. And I must say I haven't figured out why. Normally we try to parse one block type after another. This is done in a loop
loop {
var block = null
if block is null then
block = ParseBlockOrNull1()
if block is null then
block = ParseBlockOrNull2()
if block is null then
block = ParseBlockOrNull3()
CleanUpCurrentLoop()
}
After we found a block we fall through to the end putting the block in the collection and possible emitting an additional paragraph block for those text we got before but is not part of the current block. And then we go back to the start of the loop.
The YAML header is the first block that is checked in the loop. If it matches we delete all previously text that is not yet part of a block. (Normally that would become a paragraph.) Then the YamlHeaderBlock is added and the loop continues, adding an additional block. With other blocks additional blocks are added in the next loop. I currently think this is "only" for performance. But I not completely sure. The fact that it is the first thing at all is checked by the YamlHeaderBlock parse method itself (start == 0).
Did I miss something with YAML? Otherwise we don't need a special handling.
Every Block that could not be parsed as any block will default to Paragraph. (I think that's the only way to get a paragraph block)
Should we allow to define what block will be used as default?
Should the out of the box provided blocks and inline's use the same mechanic as the custom ones?
Should those be removable?
@WilliamABradley proposed using events that are called when parsing happens. I think I would tend to use Interfaces or Abstract classes that have Methods to Parse and return an Inline or Block.
I think this would make it more straight forward to extend this Parser with Library's.
Looking at the MarkdownInline classes they all have static Parse and AddTripChars. Those Methods would be implemented in this Abstract class/interface.
```c#
public abstract AbstractInlineFactory {
public abstract InlineParseResult Parse(string markdown, int start, int maxEnd);
/// <summary>
/// Returns the chars that if found means we might have a match.
/// </summary>
protected abstract IEnumerable<char> GetTripChars();
public void AddTripChars(List<InlineTripCharHelper> tripCharHelpers){
foreach(var c in this.GetTripChars())
tripCharHelpers.Add(new InlineTripCharHelper() { FirstChar = c, Method = this });
}
public virtual IEnumerable<Type> DefaultBeforeFactorys => Array.Empty<Type>();
public virtual IEnumerable<Type> DefaultAfterFactorys => Array.Empty<Type>();
}
The two additional methods `DefaultBeforeFactorys` and `DefaultAfterFactorys` are used to order the factorys correctly. E.g. bold must be parsed before italic. If this is known by the factory it should override those methods.
For Blocks it looks similar.
```c#
public abstract AbstractBlockFactory<T>
where T : MarkdownBlock {
/// <summary>
/// Parses a block.
/// </summary>
/// <param name="markdown"> The markdown text. </param>
/// <param name="startOfLine"> The location of the start of the line. </param>
/// <param name="maxEnd"> The location to stop parsing. </param>
/// <param name="quoteDepth"> The current nesting of quotes. </param>
/// <param name="actualEnd"> Set to the end of the block when the return value is non-null. </param>
/// <returns> A parsed block. Or <c>null</c> if the text can't parsed to this <c>T</c></returns>
public abstract T Parse(string markdown, int startOfLine, int maxEnd, IEnumerable<char> quoteDepth, out int actualEnd);
public abstract bool Consider(bool lineStartsNewParagraph, char firstChar, int firstCharPos, int startOfLine);
/// <summary>
/// If <c>true</c> and Consider is also <c>true</c> Parse will get the previous line.
/// </summary>
public virtual bool IsUnderlineStyle => false;
public virtual IEnumerable<Type> DefaultBeforeFactorys => Array.Empty<Type>();
public virtual IEnumerable<Type> DefaultAfterFactorys => Array.Empty<Type>();
}
The logic if a block should be tried to parse, that is currently in the Document would go into the Consider method.
The quoteDepth parameter of Parse was changed from int to IEnumerable<char> so different characters in a specific order could be used.
I'm not sure if the TripChars are actually needed, we could consider returning null the same way BlockFactory does and continue until the first match.
The Parse could be configured like following
c#
var document = MarkdownDocument.BuildCustom()
.AddParser<AbbreviationInlineFactory>()
.Before<StrikethroughInlineFactory>()
.Before<CommentInlineFactory>()
.After<CodeInlineFactory>()
.AddParser<SampleBlockFactory>()
.RemoveParser<EmojiInlineFactory>()
.CreateDocument();
document.Parse(/*...*/);
Using the Factory's as Generic argument will not allow to configure the Factory using the constructor. The idea is that a user will not write a flexible factory that he want to use multiple times with different configurations. Not using instances but generic types allows to remove factory's without having the correct instance.
The Before and After Methods will influence the execution order. I'm not sure if those should replace the default ones defined by the factory or add to those.
There should also be a check if the ordering is valid. If A is configured to run before B and B is configured to run before A it should throw.
I also think the factory's should be supplied in the constructor or as in the sample with a builder pattern. Changing the Factory's after parsing seems strange. Instantiating the MarkdownDcoument with the parameter less constructor will result in the same behavior as today.
I haven't made much thoughts about rendering yet.
This is all thoughts until know. So there are probably some things that will not work as I currently thinking.
Thanks @LokiMidgard, appreciate the detailed breakdown and thought you've put into this. I especially like the custom parser factories being able to be injected within the specific order context.
I made some thoughts about those questions I wrote down. Here is my opinion on theses
Would we need a more generic way, so when parsing inline's the caller can define which inline's should not be considered?
I think yes. The parse method can get a enumeration of factorys that should be ignored. This would be used by the Link parsing to not parse links inside the brackets [], bold is ok. A problem I see is with custom modules that have compability problems with that and should also not be parsed inside links breakets. Or more general while parsing an inline there could be multiple regions that should be not parse specific inlines (each region could be different). Those inlines that should not be used needs to be editable when composing the factorys. After all there may be a compability issue between factorys that didn't know they existed. Not sure how to do this yet.
Do we need an additional Property to specify the Type if
customis used? Could the Type be enough?
I don't think so. Enum is nice, because you can check if your switch will handle all cases. But with having an string property that has the key I don't see any benefit over using the Type (class). I would actially Obsoleting the Type property
Can custom blocks styles have the same or similar behavior then those build in styles?
I think it would be best to actually write all the default parsing in factorys and only use factorys while parsing. Bothe parse methods (for inline and blocks) would be instance methods on MarkdownDocument
Should we allow an underline style to manipulate more then the last line?
Why not. This should not be hard. I think instead of the usage of the proposed Consider method and the property that returns true if it is an underline style, We could just pass the stringbuillder that holds the paragraph text. That way the blockparser can manipulate every parsed text that is not yet part of a block.
Can different quote blocks be nested?
It is propably a litle more complex (then the already complex) quote logic. But unless its to hard to write we should do this. From Api the API ussage perspactive I think it would make sense to support it.
Did I miss something with YAML? Otherwise we don't need a special handling.
I don't think header need special handling as long we always pass the complete markdown string with start and end index.
Should we allow to define what block will be used as default?
I don't think its nessesary. But would be interrested if someone had a usecase.
Should the out of the box provided blocks and inline's use the same mechanic as the custom ones?
yes! (I think I already said this earlier)
Should those be removable?
Yes. Build in ones may conflict with custom ones and could be replaced with an slight different implementation that does not make problems.
I'm not sure if the TripChars are actually needed, we could consider returning
nullthe same way BlockFactory does and continue until the first match.
I currently think returning null would be enough. I don't know how much performance we lose when using more method calls. And how performance criticle this libary is.
If performance is a concern maybe supporting incremental parsing would be a better Idea. But this would be a different issue.
Would we need a more generic way, so when parsing inline's the caller can define which inline's should not be considered?
I think yes. The parse method can get a enumeration of factorys that should be ignored. This would be used by the Link parsing to not parse links inside the brackets [], bold is ok. A problem I see is with custom modules that have compability problems with that and should also not be parsed inside links breakets. Or more general while parsing an inline there could be multiple regions that should be not parse specific inlines (each region could be different). Those inlines that should not be used needs to be editable when composing the factorys. After all there may be a compability issue between factorys that didn't know they existed. Not sure how to do this yet.
I'm getting the impression that there must be some kind of configuration per factory.
Eventually using an optional lambda for configuration:
c#
public DocumentBuilderConfigurator<TFactory> AddParser<TFactory>(Action<TFactory> configurationCallback = null)
where TFactory : AbstractBlockFactory, new()
I hope that this API design will still emphasize that every Type should only be used once. And it is not allowed to use the same factory multiple times with different configurations.
I tried out to implement some extensible parsing. I created a pull request (#3047) so you can give feedback if the direction I'm heading is OK.
I'll implemented only the Block Parsing for now.
I'm not happy with the quote depth or the proposed IEnumerable<char> of quote characters.
This could be a little unflexible.
-> This is some quite
| That uses different
| characters in different
| Lines
\ And this one
/ is alternating
\ Every
/ line
\ switches
/ the character
Bothe would not work with an array of character's, since those would could change every line.
Maybe having an Array of Delegates that will be called at the beginn of each line dynamicly checking the characters and returning the index where the sub text starts.
Alternetivly extracting the quoted text (over all lines) and parsing it seperatly. That way sub quotes would just parse normaly. we would no longer need the quote depth. Drawback. for every quote we would allocate a new string. Or we would need a special structure that behaves like a string and can skip characters from an original string.
I'm currently considering the last, and will try to implement that.
I will create a class called StringFilter that for now will just wrap a string.
Later I would like to try to replace it with an Array of ReadOnlyMemory<char> and an indexer like this, maybe with a litle caching so iterating does not start always from the beginning:
```c#
public char this[int index]
{
get
{
if (index < 0)
{
throw new IndexOutOfRangeException($"The provided {nameof(index)} ({index}) was < 0");
}
int position = 0;
for (int i = 0; i < spans.Length; i++)
{
if (index < position + spans[i].Length)
{
return spans[i].Span[position - index];
}
position += spans[i].Length;
}
throw new IndexOutOfRangeException($"The provided {nameof(index)} ({index}) was >= then the length ({position})");
}
}
```
Also with most of the string methods. However, I'm not sure if this will actually be more performend then allocating a new string once and then just using the string methods.
@michael-hawker Is it OK to add a reference to SystemMemory? Until now the project only has references to build tools.
I also refactored Inline elements now and removed quoteDepth.
There are currently 9 Tests in the UnitTests.Markdown.Parse failing. However They also failed before I made code changes. I haven't looked in those errors yet.
What's still missing:
The next thing I'll look into will be No 2 I think.
Is there any spec how this markdown should get parsed exactly?
Especially quotes.
A Quote
>> is also part of that quote> after a line with > is not considered a quote.> Single line
>Quote
with line break
> Spaces
> and line continuation
normal text
Results in
Single lineQuote\r\nwith line breakSpaces and line continuationnormal textMarkdownLinkInline. So it can be configured to ignore additional link parsers.Other to both that can be used if the inline or block does not fit in one of the other types.All Test but 4 (see pull request) passes.
I think, from the implementing side most is done now. I need to look at the PullRequest check list now, I guess.
While I started with the documentation, I thought it would be better if the Block and Inline parsing would be more simil盲r.
Currently inlines get the current position, and a position before that includes the text that was not yet parsed by any inline parser. It returns a class that has the inline element and the information of the position of this inline element. The start can be before the current parsing position.
Blocks get the current position and a StringBuffer that holds all text not yet parsed by previous blocks. The returned Block always starts at the current parsing position and returns an index untill it was parsed. If a Block want's to look back it can manipulated the StringBuffer.
Underline Styles use this for blocks. And the EmailAddress parser uses this for inline.
I think the way like Inline works is more elegant.
In addition I've seen that the logic to handle double spaces at the end of a line for a line break, is performed by block and document parsing logic. The document add line breaks to the StringBuffer mentioned above, if it sees double whitespaces at the end of a line. ListBlockParser is doing the same. Possible others too.
I think this logic should be part of the InlineTextRun parser instead of the Block parsers. So not everyone needs to reimplement this logic.
I first thought to deal with this after this feature is finished. But changing it will also change the API and behavior. This would be a breaking change providing the parse methods public.
So probably this should be taken care of while implementing the feature.
Thanks for all the updates @LokiMidgard, I'll try and go through them with a bit of a closer eye to get caught up next week after the release.
Will this change the MarkdownRenderer interface or is this work separate to move it up the pipeline?
For example, we use this in our Sample app to do some custom parsing/processing: https://github.com/windows-toolkit/WindowsCommunityToolkit/blob/master/Microsoft.Toolkit.Uwp.SampleApp/Controls/SampleAppMarkdownRenderer.cs
I'll have added two virtual methods to the renderer, these will be called if an unknown element is encountered. But I could revert this. I'm not happy with this solution.
They are virtual so they break no one.
Finally had time to finish most of the stuff to get this finished.
I have two things that need to be done
- [ ] Tested code with current supported SDKs
- [ ] Contains NO breaking changes
@michael-hawker can you point me in the right directions and what should be changed and what should stay.
How can I test against previous windwos version SDKs? I only have one machine to test on?
I'm not sure if I have breaking changes (I think I do). Some of the unit tests did failed before the refactoring. Now these "bug's" are fixed and text that previous resulted in one AST now results in another.
There are multiple tests that would fail without the breaking change.
Can I change/remove those tests? Or are the tests how it was intended and we want those breaking changes? I personally would like to keep the old behavior for those.
A third option would be supporting both, After all we can now configure the Parser.^^
Failing Tests if behavior does not change
The problem is that blank line without a quote character in the middle of a quote should not split the quote.
> Single line
>Quote
with line break
> Spaces
> and line continuation
normal text
The tests says this should parsed in ONE QuoteBlock containing 3 ParagraphBlocks and an additional ParagraphBlock after the QuoteBlock.
The old implementation parsed a sequence of 3 QuoteBlocks containing one ParagraphBlock followed by a ParagraphBlock.
The sample of MarkdownTextBlock suggests that the old behavior is correct:
>Quote1
>Quote2.1
>>Quote2.Nest1.1
>>
>>Quote2.Nest1.2
>
>Quote2.3
>Quote3.1
>Quote3.2
>Quote4.1
>
>Quote4.2
>Quote5.1
Quote5.2
>Quote6
Plain text.
Two tests make problems here:
Neither [text](ha) nor [text](www.reddit.com) may be parsed. Relative links are allowed (there are other tests for /blog and #abc) but not "invalid" links or domains. Currently to pass those tests I require relative links to start with a / or a #. The MarkdownTextBlock sample also have a ../ relative link that is currently not parsed correctly.
I think every link that does not start with a schema should just pass. after all there could be a file named www.reddit.com or ha lying next to the current document.
The Uri class can't handle spaces in domain's. And I think they are not allowed, but I didn't find the correct specification. However the old implementation could handle an domain with space and replaced it with %20 (which Uri will also not accept).
Currently two tests fail because the parser does not support spaces in domain.
In this case I wouldn't mind this breaking change. Using the Uri class to check if an absolute URL is correct and what parts it has seems better then manually parsing the string. The later is just another surface for bugs. And if spaces are actually invalid for domains there should be no existing document where this would make problems.
Of course if we would just allow anything to be parsed as an URL we wouldn't need to check using the Uri class.
@LokiMidgard See PR #1974 about the Quote Formatting
Thank you @WilliamABradley. That helped a lot.
I updated the Quote tests. There is now one difference compared to the previous behavior.
>Quote1.1
>>Quote1.Nest1.1
>
>>Quote1.Nest2.1
Parses currently:
Quote1.1
Quote1.Nest1.1
Quote1.Nest2.1
Old behavior was:
Quote1.1
Quote1.Nest1.1
Quote1.Nest1.2
The current behavior is like CommonMark does it. And I think it make sense that putting quote characters before some text does not change the AST of the now quoted text.
The Quote parser still has the code to switch the behavior, I just need to implement something that it knows it is currently inside another QuoteBlock so it will parse differently.
If we don't need that I could delete a bunch of code that makes QuoteBlock a little more complex.
@michael-hawker Any Update? Can I remove the "draft" from the pull request?
I created a package in my github nuget package source and started using this.
I hope I find some design flaws I've created when I refactored the code.
I already found some some problems with visibility so far. But I hope the general concept holds.
I head a new Idea how to prevent string operations that will allocate new strings.
The implementation up to now uses the markdown string and also has for every method call index positions for start and end (and some more). This works well for most things. But not for the Quoteblock. That will allocate a new string to parse its contents.
The original implementation also head a quote position to skip the first x characters of a line. But this was not flexible enough when we wan't to support custom parser. I also needed some time to understand how it did work. So at least for me it was a bit more complicated then I would like.
I like the Span<T> approach where I do not need to provide start and end/length but slice the span before I use it. The idea is to have something simular with lines.
A readonly ref struct that will expose lines of ReadonlySpan<char>. Each line can individually sliced, and so can complete lines.
I started with a prototype of such a struct and will try to get it to work with the parsers. I hope this will not take to long to get it working again. Maybe some days (or a few weeks).
By the way, unlike Span<T> this will still allocate for some operations. Generally when changing individuall lines. Slicing whole lines will not allocate. and I think it is possible to prevent allocation if we slice the text from start or end, only modifing the first or last line.
I made some performance test, comparing the original version to the current in the branch.
BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18363
Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.1.200
[Host] : .NET Core 3.1.2 (CoreCLR 4.700.20.6602, CoreFX 4.700.20.6702), X64 RyuJIT
Original : .NET Core 3.1.2 (CoreCLR 4.700.20.6602, CoreFX 4.700.20.6702), X64 RyuJIT
current : .NET Core 3.1.2 (CoreCLR 4.700.20.6602, CoreFX 4.700.20.6702), X64 RyuJIT
Runtime=.NET Core 3.1
| Job | txt | Mean | Error | StdDev | Median | Ratio | RatioSD | Gen 0 | Gen 1 | Gen 2 | Allocated |
|--------- |-------------------- |---------:|----------:|----------:|----------:|------:|--------:|--------:|--------:|------:|----------:|
| Original | Thi(...)END [15718] | 1.024 ms | 0.0385 ms | 0.1104 ms | 0.9670 ms | 1.00 | 0.00 | 82.0313 | 27.3438 | - | 339.59 KB |
| current | Thi(...)END [15718] | 3.422 ms | 0.0678 ms | 0.0858 ms | 3.3921 ms | 2.89 | 0.21 | 39.0625 | 11.7188 | - | 172.80 KB |
When I started, I was 18 times slower then the original. Now it's "only" around 3 times slower. My goal would be 1.0 but I'm not sure if I'll get much faster then now.
The text I use is the sample text from the SampleApp.
Hey @LokiMidgard, sorry for the lack of response lately, there's been a lot going on since the holidays.
I know you've just put a ton of work into these updates; however, we were just discussing recently, because of all the fiddliness and performance implications of Markdown, if it made sense to leverage a more well-tested implementation of a Markdown parser compared to our own custom one. Then just connect it to the renderer we have in the Toolkit for the MarkdownTextBlock (see Issue #3200).
I'd love to hear your input on this new proposal. @MihaZupan from that project has also been helping us evaluate how our Toolkit scenarios can translate. I'm sure he understands the efforts you've gone through above to make sense of the Markdown specs.
@michael-hawker No problem.
I can totally understand that decision. And I think it makes sense. When I made the performance tests and couldn't get near the original performance, I already questioned if it is a good idea to integrate this. Having a drastically performance drop to add a feature that probably many that use this library would not need, don't seems to be the correct choice.
I will probably fork the current Parser, since what I need is a Markdown Parser that I can customize for my own Markdown Flavor. After all the reason I've done this wasn't altruism, but I needed such a feature ;)
Your Project helped me a lot in getting where I am now with this, so thank you all for that.
Thanks @LokiMidgard for your understanding. Appreciate your input on the other thread as well, and I'm glad you've still found use from the work you've done.
With everything in mind though, I think it makes sense to just close your PR #3047 for now until we know the full path forward? Do you have any problems with that?
Thanks again!
No I closed it.
Most helpful comment
@michael-hawker No problem.
I can totally understand that decision. And I think it makes sense. When I made the performance tests and couldn't get near the original performance, I already questioned if it is a good idea to integrate this. Having a drastically performance drop to add a feature that probably many that use this library would not need, don't seems to be the correct choice.
I will probably fork the current Parser, since what I need is a Markdown Parser that I can customize for my own Markdown Flavor. After all the reason I've done this wasn't altruism, but I needed such a feature ;)
Your Project helped me a lot in getting where I am now with this, so thank you all for that.