Specification defines source property for that. Without it I need to blindly trust HTML from federated servers
Do not blindly trust HTML. Sanitize it. For example, Mastodon only allows p, br, span and a tags and a limited number of classes and attributes. What will you do with the source property if one software has different syntax rules than yours? If you'll end up standartizing syntax for mentioning and linking, you might as well stick to HTML.
there is a mediaType
for that?
the source property is for lossless editing of derived content. It's not so you can avoid sanitizing your HTML
I mean, mediaType is probably useful for markdown vs RestructuredText or whatever, but it won't tell you the syntax for mentioning users or hashtags or other features. One server might have the @ and # convention but there's no guarantee.
Also, entities metadata is not missing. It's the tag
property. It includes mentions, hashtags, emojis, anything that might be in the text.
tag
objects does not have indices
and does not have text
properties and tag name include these @
and #
, moreover, Mention
name have @[email protected]
while content
have only @nick
part and I did not understand how this can help to correctly map "conventions" to my implementation (I have different tag prefix and have non-prefixed users).
Also, content
part contains some non-related spans with some html class names.
How it should look:
{"content": "Hey, @user! Any #suggestions?", "tag": [{"type": "Mention", "indices": [5, 9], "text": "user"}, {"type": "Hashtag", "indices": [16,27], "text": "suggestions"}]}
I think this is a "lossless source" version as user does not input any markup.
@vitalyster can you share the details of your use case that drive these design decisions? What are you trying to do that you can't with the HTML?
The basic task would be "send IM notification to non HTML enabled clients". Or just 'render to a non HTML media'.
How do I find a mention in the HTML? "Find class=mention link in a 'h-card' span". Is 'h-card' class a universal class for 'vcard' ? Most probably, the answer is 'no'. Is it convenient to parse every HTML document to get a full list of mentions, links, other special objects? Not really.
Twitter entities solves this issue if you have plain text representation. Say, you have a text of 500 symbols, symbols from 12 to 18 is a mention. What is incoming username format? I don't know. But I know it's a username and I can render it the way I need (having details from 'mention' entity). The way my users used to see those mentions. Also, I can send out notifications right away.
Do I have to validate plaintext? Most probably - "no".
You can object to this, that I can parse everything from HTML. But do you have a list of 'special' classes? (AFAIK - no) Can you guarantee the list won't change? Do I need an HTML represenation? I don't know. How do I add ARIA entities to an external HTML? I don't really know. And what is most important is 'why should we reinvent a working thing'? Twitter integration is relatively simple, why can't we use the same thing over again?
Writing a custom HTML processor isn't a trivial task. At first, HTML is not a structured document like XHTML. For example, the <br>
tag has no end tag in HTML so you can't use an XML parser to parse HTML. At second, HTML is a weird mix of different tag types: structural tags (<h1>
, <a>
etc) and styling tags (<b>
, <font>
etc). So we need deep knowledge of all the plenty of tags to only recognise what tags are structural. At third, there are no dedicated tag for mentions, so we need to get <a>
tags with href
attribute and then check that it have the "mention" class! Summarizing, we need to define a limited subset of XHTML (well-structured, limited set of tags, separate tags for mentions and other structural regions) that can be used for message bodies.
Take a look at E-mails: it is a common practice to include an alternative part of stylish HTML message in a plain text format. So I think that providing less stylish and more machine-readable alternative media types is the solution.
i would be amenable to saying "we should provide a structured text format that isn't html". But the original issue asks for plaintext, and the responses seem to make very basic mistakes about the federated content we do provide.
How do I find a mention in the HTML? "Find class=mention link in a 'h-card' span". Is 'h-card' class a universal class for 'vcard' ? Most probably, the answer is 'no'. Is it convenient to parse every HTML document to get a full list of mentions, links, other special objects? Not really.
No, you should look at the "tag" properties. Here's an example tag array from a recent post:
"tag": [
{
"type": "Mention",
"href": "https://social.mecanis.me/users/er1n",
"name": "@[email protected]"
}
]
This format is standardized by activitystreams2: https://www.w3.org/TR/activitystreams-core/
@nightpool
we should provide a structured text format that isn't html
We already have it, it is JSON Linked Data. We can extend it to include any required properties. ActivityStreams is just a predefined "vocabulary" for "social network"-like activities. Mastodon already have its own extensions (as:Hashtag
for example) so nothing preventing to add new extensions.
As discussed above, tag array does not include information 1) how to represent entities in plain text 2) how to find them in text without HTML parsing.
Workflow like "ActivityStreams vocabulary is limited, so let's just throw away some HTML markup instead and other parties will guess what we want to say" is bad.
Most helpful comment
The basic task would be "send IM notification to non HTML enabled clients". Or just 'render to a non HTML media'.
How do I find a mention in the HTML? "Find class=mention link in a 'h-card' span". Is 'h-card' class a universal class for 'vcard' ? Most probably, the answer is 'no'. Is it convenient to parse every HTML document to get a full list of mentions, links, other special objects? Not really.
Twitter entities solves this issue if you have plain text representation. Say, you have a text of 500 symbols, symbols from 12 to 18 is a mention. What is incoming username format? I don't know. But I know it's a username and I can render it the way I need (having details from 'mention' entity). The way my users used to see those mentions. Also, I can send out notifications right away.
Do I have to validate plaintext? Most probably - "no".
You can object to this, that I can parse everything from HTML. But do you have a list of 'special' classes? (AFAIK - no) Can you guarantee the list won't change? Do I need an HTML represenation? I don't know. How do I add ARIA entities to an external HTML? I don't really know. And what is most important is 'why should we reinvent a working thing'? Twitter integration is relatively simple, why can't we use the same thing over again?