News: Reimplement full text scraping

Created on 29 Mar 2019 · 27Comments · Source: nextcloud/news

Full text scraping doesn't work anymore since the switch from picoFeed to feed-io.

With picoFeed, if the option "Enable full text` was activated for a specific feed, I could read the whole article, even if the full content was not included in the feed. Since feed-io (I guess since then, even if I'm not sure about the exact time when I noticed it the first time) full-content scraping doesn't work anymore.

Maybe this is related: https://github.com/alexdebril/feed-io/issues/211

1. to develop Technical debt help wanted regression

Source

biva

👍5

Most helpful comment

I didn't check it yet. And even though I'm the official maintainer in the app store and did some changes. I'm not a PHP expert. So I haven't decided if I'm going to implement this.

But I'm definitely interested :)

Grotax on 5 Apr 2019

👍3

All 27 comments

The full text feature requires parsing of the actual website which is quite complicated.
Full text was only re-enabled with 13.1.5 for feeds that actually contain the whole text.

Grotax on 29 Mar 2019

Maybe the wallabag parser could help? Developed by @j0k3r
https://github.com/j0k3r/graby

biva on 29 Mar 2019

👍2

I was looking for ways to do this as well for the bookmarks app, once. I abandoned that idea, because I thought, the news app already does this quite well and it might only be needed for read-it-later style bookmarking, now I'm reconsidering. In any case, these were the libraries that I was considering ;)
https://github.com/nextcloud/bookmarks/issues/438#issuecomment-364756929

marcelklehr on 29 Mar 2019

👍1

@marcelklehr Why do you think that graby is of questionable quality?

https://github.com/j0k3r/graby (do-it-all grab bag of questionable quality)

I use it with wallabag and I'm very satisfied :)

@Grotax Do you think it's doable with one of the libraries mentioned by @marcelklehr ? I'm available to test if needed (no coding skills unfortunately).

And by the way, I like your new features of grouping the news depending on the original website: I guess it's also related to the switch to feed-io?

biva on 4 Apr 2019

I didn't check it yet. And even though I'm the official maintainer in the app store and did some changes. I'm not a PHP expert. So I haven't decided if I'm going to implement this.

But I'm definitely interested :)

Grotax on 5 Apr 2019

👍3

Showing full text requires site-specific parsing strings. Luckily, full-text-rss (AGPL v3) does precisely that, and @fivefilters even offers an abundance of site-specific config files: ftr-site-config (public domain).

danielrheinbay on 3 May 2019

I don't want to implement something that's always outdated unless you pay though so that won't be the one we use.

SMillerDev on 4 May 2019

👍1

I actually did start working on this but wasn't satisfied with the current lib versions so I guess this will take more time.

Grotax on 4 May 2019

For the record I am pretty satisfied with how most of the content makes it to the app. There are some few blogs that don't give you the full text without scraping, and then there's the likes of reddit where you always need to click on the link. Only for those it would be nice to have this feature but it's not a dealbreaker as we already get the full text for most sites.

nachoparker on 4 May 2019

👎1

To say my opinion too, on my 9 sites I follow using RSS, without this functionnality, only 3 are okay.
Others only show a short text :/

Kcchouette on 4 May 2019

👍1

You should complain to the authors then. This after all is a news reader, not a news scraper. We shouldn't work around restrictions provided by sources but convince them to be less restrictive for the benefit of all users.

SMillerDev on 5 May 2019

I mean, would it be awesome to have? yes
Is it a fundamental part of a RSS reader? no

It would be great, but we have to be aware of the limitations of small open source products, and this is not trivial to implement.

nachoparker on 6 May 2019

I just realized that "Enable full text" doesn't do anything right now. What is it supposed to change? Should we switch from description and content sections of the feed?

nachoparker on 6 May 2019

Right now nothing changes. In the past this enabled the scraper but at this moment there's nothing to scrape so it won't. There's also nothing we can display differently.

SMillerDev on 6 May 2019

👍1

IMO "full text" option should be removed. Whenever (if) a scraper is available, I think it should just always use full text, and we simplify the UI.

nachoparker on 6 May 2019

I don't want to implement something that's always outdated unless you pay though so that won't be the one we use.

Wallabag is using https://github.com/j0k3r/graby and is updated. It's there for years, and the result is super satisfying. It is based on fivefilters mentioned by @danielrheinbay and it improves this solution.

biva on 6 May 2019

I might check graby again after the 2.0 release.
Current version seems outdated.

Grotax on 6 May 2019

I tried the current state of the 2.0 version today and it worked perfectly, all feeds were added completely to the database.

I would go ahead and tidy up the code and make use of the full-text option to allow enabling it for different feeds. Otherwise I would suggest that it tries to fetch if there is no other text supplied by the feed directly.

powerpaul17 on 13 May 2019

🎉2

I'm currently working on heaving testing the 2.0 to ensure it runs smoothly, like the 1.x releases.

j0k3r on 14 May 2019

❤2

Anybody who is interested can check out my branch: here.

I already use it in my setup because most of my feeds require me to click again and open something in a browser which is really annoying for me (apparently I'm not the only one).

Please let me know how to proceed to merge this at some point into the app.

powerpaul17 on 14 May 2019

Wouldn't it be better to tell the author of the feed that it's current practice is annoying?

SMillerDev on 14 May 2019

Wouldn't it be better to tell the author of the feed that it's current practice is annoying?

In an ideal world, yes, but realistically: good luck with that. ;)

powerpaul17 on 14 May 2019

👍1

The diff from @powerpaul17 seems quite reasonable https://github.com/nextcloud/news/compare/nextcloud:master...powerpaul17:full_text_scraping

Full text scrapping was a very nice feature, it would be sad to see this thrown away.

Wouldn't it be better to tell the author of the feed that it's current practice is annoying?

Wouldn't it be annoying to contact every feed's author instead of having a piece of code that can do (even partially) the job? :)

brunob on 21 May 2019

It would be, but you'd be fixing it for everyone and not just yourself. So you get to feel good about it.

Either way, whoever implements it gets to fix it for everyone every time it's broken. I'm not maintaining it.

SMillerDev on 21 May 2019

Please don't get this the wrong way, but I really do not understand what is the problem with being able to activate/deactivate full text fetching on a per feed basis, you just don't have to enable it if you don't want it.

I simply want to read my news feeds anywhere even without my internet connection and without having to open a browser every time I switch to the next article.

Either way, whoever implements it gets to fix it for everyone every time it's broken. I'm not maintaining it.

So finally I understand the real reason for your strong resistance against this feature.. ;) Anyway I just find it kind of sad, because devs are always telling: if you want that feature, why do you not go and implement it yourself, we don't have time, and if someone does, it is not welcome and leaves no other way than a fork (which I have running on my system).. instead of combining forces together to make an awesome project.

powerpaul17 on 21 May 2019

I'm fine with anyone contributing to the project. I'm fine with this feature being re-added by someone else into the project. Just a heads up that any issues that arise from that will not be picked up by me because I think it's the wrong thing to do.

SMillerDev on 22 May 2019

Aaand im going to lock this conversation I think everyone was able to express his opinion.

You can always create a PR and it will be considered.

Keep in mind that maintaining this app is not a full time job and therefore has a low priority.

Grotax on 22 May 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Unmaintained or not?

criwe · 7Comments

Message read status in feed is not correctly updated using keyboard shortcut right arrow

mjanssens · 8Comments

v12.0.2 - directories js/admin and js/build are missing in archive file

j-ed · 7Comments

HowTo? categories / folders

ThomasKujawa · 4Comments

Leave News UI Open in browser window Leads to “Token expires, please reload the site”

F3000 · 6Comments