To help fix twitter embedding issues, we've just added support for oEmbed in capturing URL previews. We now use this for previewing twitter links (by default), and we can now receive tweet text without problems. However, twitter does not return any image data for a tweet in its oEmbed response:
{
"url": "https:\\/\\/twitter.com\\/arnaudmez7\\/status\\/1284848614062338053",
"author_name": "The Uncle Mez",
"author_url": "https:\\/\\/twitter.com\\/arnaudmez7",
"html": "\\u003Cblockquote class=\"twitter-tweet\"\\u003E\\u003Cp lang=\"en\" dir=\"ltr\"\\u003EI Absolutely like the new \\u003Ca href=\"https:\\/\\/twitter.com\\/element_hq?ref_src=twsrc%5Etfw\"\\u003E@element_hq\\u003C\\/a\\u003E \\u003Cbr\\u003EBeautiful work !\\u003Cbr\\u003ERun very well on \\u003Ca href=\"https:\\/\\/twitter.com\\/SolusProject?ref_src=twsrc%5Etfw\"\\u003E@SolusProject\\u003C\\/a\\u003E \\u003Ca href=\"https:\\/\\/t.co\\/bLzhmuoFdy\"\\u003Epic.twitter.com\\/bLzhmuoFdy\\u003C\\/a\\u003E\\u003C\\/p\\u003E— The Uncle Mez (@arnaudmez7) \\u003Ca href=\"https:\\/\\/twitter.com\\/arnaudmez7\\/status\\/1284848614062338053?ref_src=twsrc%5Etfw\"\\u003EJuly 19, 2020\\u003C\\/a\\u003E\\u003C\\/blockquote\\u003E\n\\u003Cscript async src=\"https:\\/\\/platform.twitter.com\\/widgets.js\" charset=\"utf-8\"\\u003E\\u003C\\/script\\u003E\n",
"width": 550,
"height": null,
"type": "rich",
"cache_age": "3153600000",
"provider_name": "Twitter",
"provider_url": "https:\\/\\/twitter.com",
"version": "1.0"
}
You'll notice that the html key has a pic.twitter.com URL in it. However, this just leads us to the tweet HTML, and extracting it from this HTML is too twitter-specific anyways.
However, the HTML returned here is the exact same (minus being encoded) as what's shown on publish.twitter.com for this tweet. You can see that this HTML renders into a nice little standardised preview of the tweet. Part of this HTML is a JS script that gets loaded (platform.twitter.com/widgets.js) that will actually do most of the magic render the tweet.
Theoretically, after rendering this HTML output locally, we can just run our standard URL preview code over it and extract an image!
Thus my proposal for support Twitter image embeds with oEmbed that is still generic is to:
photo or video response type is used, or thumbnail* keys are provided.html key.html key exists, render securely and run URL preview code over it.At the moment this is all theory, I haven't tested it in code yet.
I'm not really in favour of this for two reasons:
The Twitter API suggests that clients include https://platform.twitter.com/widgets.js and run twttr.widgets.load() on new URL previews, but that is a obviously twitter specific.
I'm going to close this because I think we've agreed that this isn't the right approach :slightly_smiling_face:
It's actually easy to do and doesn't require all of that.
https://matrix.to/#/!XaqDhxuTIlvldquJaV:matrix.org/$bvBYxFl1vc1_FbDz-VxSb2Lqh1V0kFIPrgHD_KHMhog?via=sw1v.org&via=raim.ist&via=matrix.org
https://mau.dev/maunium/synapse/-/commit/fe01ce7cf786378f72f741c80b6183674aeada50
It seems that has been decided against for some reason but I'm just adding a comment here so at least it is mentioned somewhere on the repo.
For those coming here in the future, Synapse already sends a User-Agent string of Synapse/x.xx.x during it's URL preview fetching: https://github.com/matrix-org/synapse/issues/1859
It seems that the solution @aaronraimist works because twitter allows previews by programs with "bot" in their user-agent string. We're not sure whether we want to add this to the user-agent string, especially if it's not standard practice and twitter-specific.
One may suggest allowing the URL preview UA to be configurable, but having to tell users to change this setting to get services like twitter working isn't a great situation to be in.
Given the above there's not an easy path forward here.
Bot in the user agent doesn't seem like that much of a hack to me. For example https://github.com/matrix-org/synapse/issues/1859 was asking to put bot in the UA string back in 2017 just to show that it was in fact a bot making the requests.
Right now the current situation will never work so even if it only worked temporarily after making this change that's still an improvement. You don't have to guarantee that Twitter previews are going to continue to work after making this change. It can just happily work, until maybe in the future they change something and it stops working.
We don't want to modify the UA header for a twitter-specific reason. However, if putting "bot" in the URL is something industry-wide, or as you say to indicate that it's a request originating from a bot, then it'd be a good reason to do so. What do other link-fetching services do?
After some discussion in #synapse-dev, I'm more favourable towards the configurable UA option, although I do realise that it wouldn't solve the problem for twitter by default.
I don't know if it is a standard but it doesn't seem uncommon. For example most of Google's crawlers have the word bot in the user agent https://support.google.com/webmasters/answer/1061943?hl=en and like the Wikipedia article for user agents says
Automated web crawling tools can use a simplified form, where an important field is contact information in case of problems. By convention the word "bot" is included in the name of the agent.
As a reference for that it is just linking to a blog post but it does seem like something that some people recommend.
https://en.wikipedia.org/wiki/User_agent#Format_for_automated_agents_(bots)
Most helpful comment
I don't know if it is a standard but it doesn't seem uncommon. For example most of Google's crawlers have the word bot in the user agent https://support.google.com/webmasters/answer/1061943?hl=en and like the Wikipedia article for user agents says
As a reference for that it is just linking to a blog post but it does seem like something that some people recommend.
https://en.wikipedia.org/wiki/User_agent#Format_for_automated_agents_(bots)