Mastodon: Provide plaintext/alternative versions of content for ActivityPub federation

Created on 23 Oct 2018  ·  10Comments  ·  Source: tootsuite/mastodon

Specification defines source property for that. Without it I need to blindly trust HTML from federated servers

suggestion

Most helpful comment

The basic task would be "send IM notification to non HTML enabled clients". Or just 'render to a non HTML media'.

How do I find a mention in the HTML? "Find class=mention link in a 'h-card' span". Is 'h-card' class a universal class for 'vcard' ? Most probably, the answer is 'no'. Is it convenient to parse every HTML document to get a full list of mentions, links, other special objects? Not really.

Twitter entities solves this issue if you have plain text representation. Say, you have a text of 500 symbols, symbols from 12 to 18 is a mention. What is incoming username format? I don't know. But I know it's a username and I can render it the way I need (having details from 'mention' entity). The way my users used to see those mentions. Also, I can send out notifications right away.
Do I have to validate plaintext? Most probably - "no".

You can object to this, that I can parse everything from HTML. But do you have a list of 'special' classes? (AFAIK - no) Can you guarantee the list won't change? Do I need an HTML represenation? I don't know. How do I add ARIA entities to an external HTML? I don't really know. And what is most important is 'why should we reinvent a working thing'? Twitter integration is relatively simple, why can't we use the same thing over again?

All 10 comments

Do not blindly trust HTML. Sanitize it. For example, Mastodon only allows p, br, span and a tags and a limited number of classes and attributes. What will you do with the source property if one software has different syntax rules than yours? If you'll end up standartizing syntax for mentioning and linking, you might as well stick to HTML.

  1. OK, the main problem is absence of entities metadata and their indices.

  2. > different syntax rules

there is a mediaType for that?

the source property is for lossless editing of derived content. It's not so you can avoid sanitizing your HTML

I mean, mediaType is probably useful for markdown vs RestructuredText or whatever, but it won't tell you the syntax for mentioning users or hashtags or other features. One server might have the @ and # convention but there's no guarantee.

Also, entities metadata is not missing. It's the tag property. It includes mentions, hashtags, emojis, anything that might be in the text.

tag objects does not have indices and does not have text properties and tag name include these @ and #, moreover, Mention name have @[email protected] while content have only @nick part and I did not understand how this can help to correctly map "conventions" to my implementation (I have different tag prefix and have non-prefixed users).
Also, content part contains some non-related spans with some html class names.

How it should look:
{"content": "Hey, @user! Any #suggestions?", "tag": [{"type": "Mention", "indices": [5, 9], "text": "user"}, {"type": "Hashtag", "indices": [16,27], "text": "suggestions"}]}
I think this is a "lossless source" version as user does not input any markup.

@vitalyster can you share the details of your use case that drive these design decisions? What are you trying to do that you can't with the HTML?

The basic task would be "send IM notification to non HTML enabled clients". Or just 'render to a non HTML media'.

How do I find a mention in the HTML? "Find class=mention link in a 'h-card' span". Is 'h-card' class a universal class for 'vcard' ? Most probably, the answer is 'no'. Is it convenient to parse every HTML document to get a full list of mentions, links, other special objects? Not really.

Twitter entities solves this issue if you have plain text representation. Say, you have a text of 500 symbols, symbols from 12 to 18 is a mention. What is incoming username format? I don't know. But I know it's a username and I can render it the way I need (having details from 'mention' entity). The way my users used to see those mentions. Also, I can send out notifications right away.
Do I have to validate plaintext? Most probably - "no".

You can object to this, that I can parse everything from HTML. But do you have a list of 'special' classes? (AFAIK - no) Can you guarantee the list won't change? Do I need an HTML represenation? I don't know. How do I add ARIA entities to an external HTML? I don't really know. And what is most important is 'why should we reinvent a working thing'? Twitter integration is relatively simple, why can't we use the same thing over again?

Writing a custom HTML processor isn't a trivial task. At first, HTML is not a structured document like XHTML. For example, the <br> tag has no end tag in HTML so you can't use an XML parser to parse HTML. At second, HTML is a weird mix of different tag types: structural tags (<h1>, <a> etc) and styling tags (<b>, <font> etc). So we need deep knowledge of all the plenty of tags to only recognise what tags are structural. At third, there are no dedicated tag for mentions, so we need to get <a> tags with href attribute and then check that it have the "mention" class! Summarizing, we need to define a limited subset of XHTML (well-structured, limited set of tags, separate tags for mentions and other structural regions) that can be used for message bodies.

Take a look at E-mails: it is a common practice to include an alternative part of stylish HTML message in a plain text format. So I think that providing less stylish and more machine-readable alternative media types is the solution.

i would be amenable to saying "we should provide a structured text format that isn't html". But the original issue asks for plaintext, and the responses seem to make very basic mistakes about the federated content we do provide.

How do I find a mention in the HTML? "Find class=mention link in a 'h-card' span". Is 'h-card' class a universal class for 'vcard' ? Most probably, the answer is 'no'. Is it convenient to parse every HTML document to get a full list of mentions, links, other special objects? Not really.

No, you should look at the "tag" properties. Here's an example tag array from a recent post:

"tag": [
  {
    "type": "Mention",
    "href": "https://social.mecanis.me/users/er1n",
    "name": "@[email protected]"
  }
]

This format is standardized by activitystreams2: https://www.w3.org/TR/activitystreams-core/

@nightpool

we should provide a structured text format that isn't html

We already have it, it is JSON Linked Data. We can extend it to include any required properties. ActivityStreams is just a predefined "vocabulary" for "social network"-like activities. Mastodon already have its own extensions (as:Hashtag for example) so nothing preventing to add new extensions.
As discussed above, tag array does not include information 1) how to represent entities in plain text 2) how to find them in text without HTML parsing.
Workflow like "ActivityStreams vocabulary is limited, so let's just throw away some HTML markup instead and other parties will guess what we want to say" is bad.

Was this page helpful?
0 / 5 - 0 ratings