This is a bug and not a feature from the evil wallabag core team 馃槇 (wallabag everywhere 馃)
When you search for "wallabag" in the search engine, you have all the wallabag can't retrieve contents for this article. Please troubleshoot this issue. articles.
On wallabag.it, you also have the articles where camo proxyfies the pictures URL to static.wallabag.it.
What do you suggest?
Today I don't really know. I just wanted to open a new issue ;-)
We don't have any boolean field saying whether the article was properly fetched? Should be a first step to tell search engine to search only in the title for those articles.
About the second issue about search looking into the HTML fields themselves, the only solution I can think of is using ElasticSearch and convert the HTML into plain text before tokenization and save.
I second that @tcitworld for the boolean in order to solve the first issue.
For the second one, ES is not the only solution. You could also save the text version without tags in the current datastore and continue searching in it as usual.
Another idea would be to replace hostname in img tag with values stored elsewhere. It could let us fix another issue when you change the hostname of your wallabag instance.
You could also save the text version without tags in the current datastore and continue searching in it as usual.
It would pretty much mean saving most of the data twice in your database, so quite inefficient. With ES you would save only the appropriate data once.
Another idea would be to replace hostname in img tag with values stored elsewhere. It could let us fix another issue when you change the hostname of your wallabag instance.
This idea sounds more interesting to me.
And with the flag, you can have a filter to have all your "not parsed" articles. And a command to re-fetch them. I love this idea.
Most helpful comment
We don't have any boolean field saying whether the article was properly fetched? Should be a first step to tell search engine to search only in the title for those articles.
About the second issue about search looking into the HTML fields themselves, the only solution I can think of is using ElasticSearch and convert the HTML into plain text before tokenization and save.