This is something I'm unsure of... would love feedback.
There are a few different ways a text node could be serialized, and they have different tradeoffs. The reason I open this issue up is because I'm not sure the current structure we've chosen makes the right tradeoffs, and if not I want to fix this sooner rather than later.
For example, given the text:
A line of rich text.
There are a handful of ways to represent it...
This is the current structure. If a text node has marks in it, the ranges array contains ranges that split up the text, according to the overlapping marks in each section. You'd end up with JSON of:
{
kind: 'text',
ranges: [
{
text: 'A ',
marks: [],
},
{
text: 'line',
marks: [{ kind: 'mark', type: 'bold', data: {}}],
},
{
text: ' of ',
marks: [],
},
{
text: 'rich',
marks: [{ kind: 'mark', type: 'bold', data: {}}],
},
{
text: ' text.',
marks: [],
}
]
}
```js
function toPlaintext(node) {
return node.ranges.map(r => t.text).join('')
}
```js
function toXml(node) {
return node.ranges.map((range) => {
return range.marks.reduce((xml, mark) => {
return `<${mark.type}>${xml}</${mark.type}>`
}, range.text)
}).join('')
}
This is similar to the approach Prosemirror uses, although its version has a text node for each range, instead of a text node comprising a list of ranges.
<bold>rich</bold>) because you can simply iterate through the ranges array and build them.ranges.map(r => r.text).join('') which gives it to you.ranges array that is populated with a single range of text, which is slightly more complex.Another approach would be to keep the text as a single string, and have the marks accompanied by offsets in the string, like so:
{
kind: 'text',
text: 'A line of rich text.',
ranges: [
{
start: 2,
end: 6,
marks: [{ kind: 'mark', type: 'bold', data: {}}],
},
{
start: 10,
end: 14,
marks: [{ kind: 'mark', type: 'bold', data: {}}],
},
]
}
```js
function toPlaintext(node) {
return node.text
}
```js
function toXml(node) {
return ????
}
This is the approach Draft.js uses. Although instead of start/end it uses offset/length, which would match our operations more, so that might be preferred. (They're probably pretty equivalent since either can be derived easily from the other.)
<bold>rich</bold>) because you can't just loop the ranges. (Unsure how hard this actually is?)Another approach would be to treat the marks themselves as the primary grouping factor, resulting in the least possible duplication in the mark value, which is the place where the biggest size wasting can be.
{
kind: 'text',
text: 'A line of rich text.',
ranges: [
{
mark: { kind: 'mark', type: 'bold', data: {}},
indexes: [
{ start: 2, end: 6 },
{ start: 10, end: 14 },
]
},
]
}
```js
function toPlaintext(node) {
return node.text
}
```js
function toXml(node) {
return ????
}
<bold>rich</bold>) because the indexes are further nested/complicated?There's another approach that would have the marks defined outside of the text nodes themselves, at the top-level of the document. This is actually the most efficient. However, I'm not going to consider this one because I think having nodes be self-contained is much more important here. Otherwise you'd need to carry that dictionary down the tree for each node you render, which is not fun.
This is something that Draft.js use to use for the "entities", but they've since migrated away I think, for the reasons discussed.
If anyone has thoughts (or even alternate structures I haven't considered) I'd love to hear them! Or if you'd had experience working with multiple structures and have preferences/ideas.
Thanks!
When we store the blocks in our DB we store them closer to how draft blocks were modeled except instead of having that painful way of storing entities we basically have
Decorators=[{
{ kind: 'mark', type: 'bold', startIndex: 0, endIndex: 10, data: {} },
{ kind: 'inline', type: 'link', startIndex: 5, endIndex: 10, data: {...} },
...
}]
And we're pretty happy with this storage because it means we have a pretty decent base from which to convert to a wide variety of formats.
As far as this discussion goes, I think all of the approaches have some merits, but what's important to us is that the Inline blocks do not go the way of Draft Entities where they're stored in an entity map external from the block, because that was a major pain point for us in migration (so much that I posted in terror ;))
Haha thanks @CameronAckermanSEL! Don't worry, we will not go the entity map route.
I'm even thinking that the current way might be the best, for a similar reason. The reason entity map was so horrible was because it makes the objects themselves not self-contained, so you have to keep weird state from elsewhere around as you recurse through the tree. Since Slate is tree-based, where Draft is not, I feel like this might be even more reason to keep the current structure in which ranges are completely self-contained.
From the four above I like 芦Split-text Ranges禄 best.
You could improve the common paragraph case by serializing like inline nodes鈥攏o text ranges at all.
[
{
kind: 'text',
value: 'A '
},
{
kind: 'mark', type: 'bold',
nodes: [{kind: 'text', value: 'line'}]
},
{
kind: 'text',
value: ' of '
},
{
kind: 'mark', type: 'bold',
nodes: [{kind: 'text', value: 'rich'}]
},
{
kind: 'text',
value: ' text.'
}
]
function toPlaintext(node) {
if (node.kind === 'text') {
return node.value
}
return node.nodes.map(toPlaintext).join('')
}
function toXml(node) {
if (node.kind === 'text') {
return node.value
}
return `<${node.type}>${node.nodes.map(toXml).join('')}</${node.type}>`
}
_I cheated a bit here: you'd pass the text nodes parent to the helpers: toXml({kind: 'block', type: 'p', nodes: [...]}). Or add Array.isArray checks and accept node and nodes._
Getting the text string becomes harder鈥攏eeds recursion. But transforming to xml becomes even easier. Size-wise I'd imagine it's similar or better with gzip.
This is also how mdast doest it: https://astexplorer.net/#/gist/0170ef303c9159eba5594ab924f80a30/55c7675eac99592b2f9c31da8a593065c3cb5e0e
Would make slate-html-serializer simpler (and slate-mdast-serializer too).
_Update:_ Added example toPlaintext and toXml, the down side: it might be quite a surprise to get an array of nodes from serializing one node.
@tpreusse interesting, thanks for that link!
I think the array of nodes would simplify the individual case, but like you said it will be unexpected. I noticed that Prosemirror does something similar. Although it really turns the concept of "text" in JSON into "ranges", they're no longer discrete nodes like in the DOM, so I'd be a bit worried since it's a departure from the internal structure. But definitely good to see all the ideas. That AST Explorer is really useful!
What about simply this:
[
{
text: 'A ',
marks: [],
},
{
text: 'line',
marks: [{ kind: 'mark', type: 'bold', data: {}}],
},
{
text: ' of ',
marks: [],
},
{
text: 'rich',
marks: [{ kind: 'mark', type: 'bold', data: {}}],
},
{
text: ' text.',
marks: [],
}
]
I think doing this would put A **line** of **rich** text. on the same footing with A **line** of **rich** text with an @ inside. I don't know if when we render a range we really need to know if it's a range that contain texts only, or if it's some other kind of range (which actually I'm guessing is being treated as an inline node). What I wanted to say is I don't know how much gain it is to know that a range contains text only, and in any case it would be easy to check.
@ianstormtaylor slate could then introduce a a fragment kind to avoid the confusion^^ I am also unsure if it's good idea. And I don't have any issues with the current approach. But I do believe that differentiating between inline nodes and marks is not necessary outside of editing.
It would be nice to have an AST explorer for Slate values. Maybe with slate-hyperscript on one side and json on the other. But both side would need to be modifiable (unlike astexplorer.net).
@tpreusse an AST explorer for Slate would be great!
@tuanmng that's essentially what we have now, except it turns text nodes into arrays instead of objects. Which definitely makes them more terse, although I think it's slightly more confusing for a record to be serialized into an array, especially when multiple records are concerned.
The issue with inlines vs. marks I think extends to outside of editing. (Although, it's a very nuanced distinction, which I often question myself haha.) But basically... inlines are nodes that have some semantic value as a distinct unit鈥攆or example a link.
The thing with marks is that they are order-independent鈥攖hey're stored as a Set. Which is good, because for formatting this is how you want to think of them, either some text is bold or not, but it doesn't matter whether it's bold then italic, or italic then bold.
Since they are order independent, you can render them as <bold><italic>text</italic></bold> text or <italic><bold>text</bold></italic> and that should be equivalent. And since, unlike inlines, they are not a distinct unit, you can rendering overlapping ranges of marks in any way you please, as long as each characters ends up receiving the marks they need.
With marks, once two of the same mark become adjacent, the entire span of text has the mark.
However, with inlines those properties are different. To break an inline into two parts is to change it's meaning, or to have 2 inlines. If you model things that are expected to be inlines as marks, you can end up with unwanted behavior. Consider a bold and link mark interaction:
A line of text with <a href="https://google.com">an </a><strong><a href="https://google.com">important</a></strong><a href="https://google.com"> link</a> in it.
Here you actually end up with three links, each to the same place, because you could not guarantee the render ordering of the marks. Sometimes you'll get 3 links, sometimes 1. And if you style them with underlines for instance, that breakage will be apparent to end users.
In certain cases, if you know the schema of the content, you can use that knowledge to enforce your own rendering order to the marks, so that you could use link marks without this problem happening. But since Slate doesn't inherently know the schema, it doesn't do that.
Thank you all!
After writing this up, reading the comments, and thinking it through some more, I'm happy with the current Slate structure. I think it prioritizes being able to use the structure easily (for rendering, serializing, etc.) and it makes using it the "correct" way simple, which fits nicely with Slate's goal to prevent leaking unnecessary complexity into your codebases. Otherwise it seems like everyone is going to be re-inventing the same, more complex function to parse range indexes into a usable format to render things with.
It does that at a slight tradeoff in terms of efficiency and readability, but since efficiency is largely mitigated by GZIP, and the readability is only in the JSON form which people aren't reading
Most helpful comment
@tpreusse an AST explorer for Slate would be great!
@tuanmng that's essentially what we have now, except it turns text nodes into arrays instead of objects. Which definitely makes them more terse, although I think it's slightly more confusing for a record to be serialized into an array, especially when multiple records are concerned.
The issue with inlines vs. marks I think extends to outside of editing. (Although, it's a very nuanced distinction, which I often question myself haha.) But basically... inlines are nodes that have some semantic value as a distinct unit鈥攆or example a
link.The thing with marks is that they are order-independent鈥攖hey're stored as a
Set. Which is good, because for formatting this is how you want to think of them, either some text isboldor not, but it doesn't matter whether it'sboldthenitalic, oritalicthenbold.Since they are order independent, you can render them as
<bold><italic>text</italic></bold>text or<italic><bold>text</bold></italic>and that should be equivalent. And since, unlike inlines, they are not a distinct unit, you can rendering overlapping ranges of marks in any way you please, as long as each characters ends up receiving the marks they need.With marks, once two of the same mark become adjacent, the entire span of text has the mark.
However, with inlines those properties are different. To break an inline into two parts is to change it's meaning, or to have 2 inlines. If you model things that are expected to be inlines as marks, you can end up with unwanted behavior. Consider a
boldandlinkmark interaction:Here you actually end up with three links, each to the same place, because you could not guarantee the render ordering of the marks. Sometimes you'll get 3 links, sometimes 1. And if you style them with underlines for instance, that breakage will be apparent to end users.
In certain cases, if you know the schema of the content, you can use that knowledge to enforce your own rendering order to the marks, so that you could use
linkmarks without this problem happening. But since Slate doesn't inherently know the schema, it doesn't do that.Thank you all!
After writing this up, reading the comments, and thinking it through some more, I'm happy with the current Slate structure. I think it prioritizes being able to use the structure easily (for rendering, serializing, etc.) and it makes using it the "correct" way simple, which fits nicely with Slate's goal to prevent leaking unnecessary complexity into your codebases. Otherwise it seems like everyone is going to be re-inventing the same, more complex function to parse range indexes into a usable format to render things with.
It does that at a slight tradeoff in terms of efficiency and readability, but since efficiency is largely mitigated by GZIP, and the readability is only in the JSON form which people aren't reading