Slate: consider changing `Text` nodes's JSON serialization structure

Created on 21 Dec 2017 · 7Comments · Source: ianstormtaylor/slate

This is something I'm unsure of... would love feedback.

There are a few different ways a text node could be serialized, and they have different tradeoffs. The reason I open this issue up is because I'm not sure the current structure we've chosen makes the right tradeoffs, and if not I want to fix this sooner rather than later.

For example, given the text:

A line of rich text.

There are a handful of ways to represent it...

Split-text Ranges

This is the current structure. If a text node has marks in it, the ranges array contains ranges that split up the text, according to the overlapping marks in each section. You'd end up with JSON of:

{
  kind: 'text',
  ranges: [
    {
      text: 'A ',
      marks: [],
    },
    {
      text: 'line',
      marks: [{ kind: 'mark', type: 'bold', data: {}}],
    },
    {
      text: ' of ',
      marks: [],
    },
    {
      text: 'rich',
      marks: [{ kind: 'mark', type: 'bold', data: {}}],
    },
    {
      text: ' text.',
      marks: [],
    }
  ]
}

```js
function toPlaintext(node) {
return node.ranges.map(r => t.text).join('')
}

```js
function toXml(node) {
  return node.ranges.map((range) => {
    return range.marks.reduce((xml, mark) => {
      return `<${mark.type}>${xml}</${mark.type}>`
    }, range.text)
  }).join('')
}

This is similar to the approach Prosemirror uses, although its version has a text node for each range, instead of a text node comprising a list of ranges.

Pros

Very easy to construct nested serialized forms from it like XML (eg. <bold>rich</bold>) because you can simply iterate through the ranges array and build them.
Still somewhat easy to construct the entire string of text, because you can ranges.map(r => r.text).join('') which gives it to you.

Cons

The full string of text is not readable, it's very hard to look at a definition for a text node with marks on it and recognize what the text is.
In the common case of a paragraph without marks, the definition still contains a ranges array that is populated with a single range of text, which is slightly more complex.
Can be less efficient size-wise than some other forms in cases where the same mark is used multiple times in a single text node, since the mark is repeated for each use. (I'm not sure if this matters really when you factor in GZIP though?)

Index-based Ranges

Another approach would be to keep the text as a single string, and have the marks accompanied by offsets in the string, like so:

{
  kind: 'text',
  text: 'A line of rich text.',
  ranges: [
    {
      start: 2,
      end: 6,
      marks: [{ kind: 'mark', type: 'bold', data: {}}],
    },
    {
      start: 10,
      end: 14,
      marks: [{ kind: 'mark', type: 'bold', data: {}}],
    },
  ]
}

```js
function toPlaintext(node) {
return node.text
}

```js
function toXml(node) {
  return ????
}

This is the approach Draft.js uses. Although instead of start/end it uses offset/length, which would match our operations more, so that might be preferred. (They're probably pretty equivalent since either can be derived easily from the other.)

Pros

Very easy to read the entire string of text by itself. And easy to get a sense for which marks are applied to the string.

Cons

Although it's easy to see which marks are somewhere in the string, it's not easy to see exactly where they are applied, since you have to do the offset math in your head.
Harder to reason about what the logic would be to build up a nested serialized form like XML (eg. <bold>rich</bold>) because you can't just loop the ranges. (Unsure how hard this actually is?)
Can be less efficient size-wise than some other forms in cases where the same mark is used multiple times in a single text node, since the mark is repeated for each use. (I'm not sure if this matters really when you factor in GZIP though?)

Mark-based Ranges

Another approach would be to treat the marks themselves as the primary grouping factor, resulting in the least possible duplication in the mark value, which is the place where the biggest size wasting can be.

{
  kind: 'text',
  text: 'A line of rich text.',
  ranges: [
    {
      mark: { kind: 'mark', type: 'bold', data: {}},
      indexes: [
        { start: 2, end: 6 },
        { start: 10, end: 14 },
      ]
    },
  ]
}

```js
function toPlaintext(node) {
return node.text
}

```js
function toXml(node) {
  return ????
}

Pros

Still very easy to read the entire string of text.
Probably the absolute most space-efficient in terms of least unnecessary repetition of marks. (Although I'm not sure if this really matters when GZIP is considered.)

Cons

Potentially even harder to build up the nested serialized form like XML (eg. <bold>rich</bold>) because the indexes are further nested/complicated?
Very hard to reason about which marks are exactly where in the text.

External Mark Dictionary

There's another approach that would have the marks defined outside of the text nodes themselves, at the top-level of the document. This is actually the most efficient. However, I'm not going to consider this one because I think having nodes be self-contained is much more important here. Otherwise you'd need to carry that dictionary down the tree for each node you render, which is not fun.

This is something that Draft.js use to use for the "entities", but they've since migrated away I think, for the reasons discussed.

If anyone has thoughts (or even alternate structures I haven't considered) I'd love to hear them! Or if you'd had experience working with multiple structures and have preferences/ideas.

Thanks!

discussion

Source

ianstormtaylor

Most helpful comment

@tpreusse an AST explorer for Slate would be great!

@tuanmng that's essentially what we have now, except it turns text nodes into arrays instead of objects. Which definitely makes them more terse, although I think it's slightly more confusing for a record to be serialized into an array, especially when multiple records are concerned.

The issue with inlines vs. marks I think extends to outside of editing. (Although, it's a very nuanced distinction, which I often question myself haha.) But basically... inlines are nodes that have some semantic value as a distinct unit—for example a link.

The thing with marks is that they are order-independent—they're stored as a Set. Which is good, because for formatting this is how you want to think of them, either some text is bold or not, but it doesn't matter whether it's bold then italic, or italic then bold.

Since they are order independent, you can render them as <bold><italic>text</italic></bold> text or <italic><bold>text</bold></italic> and that should be equivalent. And since, unlike inlines, they are not a distinct unit, you can rendering overlapping ranges of marks in any way you please, as long as each characters ends up receiving the marks they need.

With marks, once two of the same mark become adjacent, the entire span of text has the mark.

However, with inlines those properties are different. To break an inline into two parts is to change it's meaning, or to have 2 inlines. If you model things that are expected to be inlines as marks, you can end up with unwanted behavior. Consider a bold and link mark interaction:

A line of text with <a href="https://google.com">an </a><strong><a href="https://google.com">important</a></strong><a href="https://google.com"> link</a> in it.

A line of text with an important link in it.

Here you actually end up with three links, each to the same place, because you could not guarantee the render ordering of the marks. Sometimes you'll get 3 links, sometimes 1. And if you style them with underlines for instance, that breakage will be apparent to end users.

In certain cases, if you know the schema of the content, you can use that knowledge to enforce your own rendering order to the marks, so that you could use link marks without this problem happening. But since Slate doesn't inherently know the schema, it doesn't do that.

Thank you all!

After writing this up, reading the comments, and thinking it through some more, I'm happy with the current Slate structure. I think it prioritizes being able to use the structure easily (for rendering, serializing, etc.) and it makes using it the "correct" way simple, which fits nicely with Slate's goal to prevent leaking unnecessary complexity into your codebases. Otherwise it seems like everyone is going to be re-inventing the same, more complex function to parse range indexes into a usable format to render things with.

It does that at a slight tradeoff in terms of efficiency and readability, but since efficiency is largely mitigated by GZIP, and the readability is only in the JSON form which people aren't reading

ianstormtaylor on 22 Dec 2017

👍4

All 7 comments

When we store the blocks in our DB we store them closer to how draft blocks were modeled except instead of having that painful way of storing entities we basically have

Decorators=[{
  { kind: 'mark', type: 'bold', startIndex: 0, endIndex: 10, data: {} },
  { kind: 'inline', type: 'link', startIndex: 5, endIndex: 10, data: {...} },
  ...
}]

And we're pretty happy with this storage because it means we have a pretty decent base from which to convert to a wide variety of formats.

As far as this discussion goes, I think all of the approaches have some merits, but what's important to us is that the Inline blocks do not go the way of Draft Entities where they're stored in an entity map external from the block, because that was a major pain point for us in migration (so much that I posted in terror ;))

CameronAckermanSEL on 21 Dec 2017

👍1

Haha thanks @CameronAckermanSEL! Don't worry, we will not go the entity map route.

I'm even thinking that the current way might be the best, for a similar reason. The reason entity map was so horrible was because it makes the objects themselves not self-contained, so you have to keep weird state from elsewhere around as you recurse through the tree. Since Slate is tree-based, where Draft is not, I feel like this might be even more reason to keep the current structure in which ranges are completely self-contained.

ianstormtaylor on 21 Dec 2017

🎉1

From the four above I like «Split-text Ranges» best.

You could improve the common paragraph case by serializing like inline nodes—no text ranges at all.

Directly split into nested array of nodes

[
  {
    kind: 'text',
    value: 'A '
  },
  {
    kind: 'mark', type: 'bold',
    nodes: [{kind: 'text', value: 'line'}]
  },
  {
    kind: 'text',
    value: ' of '
  },
  {
    kind: 'mark', type: 'bold',
    nodes: [{kind: 'text', value: 'rich'}]
  },
  {
    kind: 'text',
    value: ' text.'
  }
]

function toPlaintext(node) {
  if (node.kind === 'text') {
    return node.value
  }
  return node.nodes.map(toPlaintext).join('')
}

function toXml(node) {
  if (node.kind === 'text') {
    return node.value
  }
  return `<${node.type}>${node.nodes.map(toXml).join('')}</${node.type}>`
}

_I cheated a bit here: you'd pass the text nodes parent to the helpers: toXml({kind: 'block', type: 'p', nodes: [...]}). Or add Array.isArray checks and accept node and nodes._

Getting the text string becomes harder—needs recursion. But transforming to xml becomes even easier. Size-wise I'd imagine it's similar or better with gzip.

This is also how mdast doest it: https://astexplorer.net/#/gist/0170ef303c9159eba5594ab924f80a30/55c7675eac99592b2f9c31da8a593065c3cb5e0e

Would make slate-html-serializer simpler (and slate-mdast-serializer too).

_Update:_ Added example toPlaintext and toXml, the down side: it might be quite a surprise to get an array of nodes from serializing one node.

tpreusse on 21 Dec 2017

@tpreusse interesting, thanks for that link!

I think the array of nodes would simplify the individual case, but like you said it will be unexpected. I noticed that Prosemirror does something similar. Although it really turns the concept of "text" in JSON into "ranges", they're no longer discrete nodes like in the DOM, so I'd be a bit worried since it's a departure from the internal structure. But definitely good to see all the ideas. That AST Explorer is really useful!

ianstormtaylor on 22 Dec 2017

What about simply this:

[
    {
      text: 'A ',
      marks: [],
    },
    {
      text: 'line',
      marks: [{ kind: 'mark', type: 'bold', data: {}}],
    },
    {
      text: ' of ',
      marks: [],
    },
    {
      text: 'rich',
      marks: [{ kind: 'mark', type: 'bold', data: {}}],
    },
    {
      text: ' text.',
      marks: [],
    }
]

I think doing this would put A **line** of **rich** text. on the same footing with A **line** of **rich** text with an @ inside. I don't know if when we render a range we really need to know if it's a range that contain texts only, or if it's some other kind of range (which actually I'm guessing is being treated as an inline node). What I wanted to say is I don't know how much gain it is to know that a range contains text only, and in any case it would be easy to check.

tuanmng on 22 Dec 2017

@ianstormtaylor slate could then introduce a a fragment kind to avoid the confusion^^ I am also unsure if it's good idea. And I don't have any issues with the current approach. But I do believe that differentiating between inline nodes and marks is not necessary outside of editing.

It would be nice to have an AST explorer for Slate values. Maybe with slate-hyperscript on one side and json on the other. But both side would need to be modifiable (unlike astexplorer.net).

tpreusse on 22 Dec 2017

@tpreusse an AST explorer for Slate would be great!

With marks, once two of the same mark become adjacent, the entire span of text has the mark.

A line of text with <a href="https://google.com">an </a><strong><a href="https://google.com">important</a></strong><a href="https://google.com"> link</a> in it.

A line of text with an important link in it.

Thank you all!

It does that at a slight tradeoff in terms of efficiency and readability, but since efficiency is largely mitigated by GZIP, and the readability is only in the JSON form which people aren't reading

ianstormtaylor on 22 Dec 2017

👍4

Was this page helpful?

0 / 5 - 0 ratings