Wallabag: Search "wallabag" gives too much results

Created on 23 Apr 2020  路  6Comments  路  Source: wallabag/wallabag

This is a bug and not a feature from the evil wallabag core team 馃槇 (wallabag everywhere 馃)

When you search for "wallabag" in the search engine, you have all the wallabag can't retrieve contents for this article. Please troubleshoot this issue. articles.
On wallabag.it, you also have the articles where camo proxyfies the pictures URL to static.wallabag.it.

Bug

Most helpful comment

We don't have any boolean field saying whether the article was properly fetched? Should be a first step to tell search engine to search only in the title for those articles.

About the second issue about search looking into the HTML fields themselves, the only solution I can think of is using ElasticSearch and convert the HTML into plain text before tokenization and save.

All 6 comments

What do you suggest?

Today I don't really know. I just wanted to open a new issue ;-)

We don't have any boolean field saying whether the article was properly fetched? Should be a first step to tell search engine to search only in the title for those articles.

About the second issue about search looking into the HTML fields themselves, the only solution I can think of is using ElasticSearch and convert the HTML into plain text before tokenization and save.

I second that @tcitworld for the boolean in order to solve the first issue.

For the second one, ES is not the only solution. You could also save the text version without tags in the current datastore and continue searching in it as usual.
Another idea would be to replace hostname in img tag with values stored elsewhere. It could let us fix another issue when you change the hostname of your wallabag instance.

You could also save the text version without tags in the current datastore and continue searching in it as usual.

It would pretty much mean saving most of the data twice in your database, so quite inefficient. With ES you would save only the appropriate data once.

Another idea would be to replace hostname in img tag with values stored elsewhere. It could let us fix another issue when you change the hostname of your wallabag instance.

This idea sounds more interesting to me.

And with the flag, you can have a filter to have all your "not parsed" articles. And a command to re-fetch them. I love this idea.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

nicosomb picture nicosomb  路  7Comments

nicosomb picture nicosomb  路  3Comments

ANAT01 picture ANAT01  路  7Comments

anarcat picture anarcat  路  4Comments

ZeddZull picture ZeddZull  路  4Comments