Wallabag: Search "wallabag" gives too much results

Created on 23 Apr 2020 · 6Comments · Source: wallabag/wallabag

This is a bug and not a feature from the evil wallabag core team 😈 (wallabag everywhere 🤘)

When you search for "wallabag" in the search engine, you have all the wallabag can't retrieve contents for this article. Please troubleshoot this issue. articles.
On wallabag.it, you also have the articles where camo proxyfies the pictures URL to static.wallabag.it.

Bug

Source

nicosomb

Most helpful comment

We don't have any boolean field saying whether the article was properly fetched? Should be a first step to tell search engine to search only in the title for those articles.

About the second issue about search looking into the HTML fields themselves, the only solution I can think of is using ElasticSearch and convert the HTML into plain text before tokenization and save.

tcitworld on 23 Apr 2020

❤1 👍1

All 6 comments

What do you suggest?

j0k3r on 23 Apr 2020

Today I don't really know. I just wanted to open a new issue ;-)

nicosomb on 23 Apr 2020

We don't have any boolean field saying whether the article was properly fetched? Should be a first step to tell search engine to search only in the title for those articles.

About the second issue about search looking into the HTML fields themselves, the only solution I can think of is using ElasticSearch and convert the HTML into plain text before tokenization and save.

tcitworld on 23 Apr 2020

❤1 👍1

I second that @tcitworld for the boolean in order to solve the first issue.

For the second one, ES is not the only solution. You could also save the text version without tags in the current datastore and continue searching in it as usual.
Another idea would be to replace hostname in img tag with values stored elsewhere. It could let us fix another issue when you change the hostname of your wallabag instance.

Kdecherf on 23 Apr 2020

You could also save the text version without tags in the current datastore and continue searching in it as usual.

It would pretty much mean saving most of the data twice in your database, so quite inefficient. With ES you would save only the appropriate data once.

Another idea would be to replace hostname in img tag with values stored elsewhere. It could let us fix another issue when you change the hostname of your wallabag instance.

This idea sounds more interesting to me.

tcitworld on 23 Apr 2020

And with the flag, you can have a filter to have all your "not parsed" articles. And a command to re-fetch them. I love this idea.

nicosomb on 23 Apr 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Selector to choose font

nicosomb · 7Comments

"Search engine" and "new link" forms are broken

nicosomb · 3Comments

Export to PDF: wrong encoding for Cyrillic sumbols (utf-8)

ANAT01 · 7Comments

performance issue in entries API

anarcat · 4Comments

Open entries from tags page

ZeddZull · 4Comments