Freshrss: Clear duplicates

Created on 18 Aug 2015  ·  17Comments  ·  Source: FreshRSS/FreshRSS

Here's the situation:

  • I subscribed to a few planets (like planet-libre.org)
  • I subscribed to a few blogs that put some of its articles (but not all of them) on some of those planets

Once in a while, I have duplicates : the article of the planet subscription and the one of the blog subscription.

I'm wondering if it's possible to think about a way to "clear" the displayed articles in those cases and to display only one of them (like the original one, not the copies on the planets).

Most helpful comment

Example of duplicate from https://www.clubic.com/articles.rss

    <item>
      <title>Réseaux LoRa &amp; Sigfox : il y a une vie en-dehors de la 3G, du Bluetooth et du Wi-Fi !</title>
      <description> [...]</description>
      <pubDate>Sun, 03 Jun 2018 18:45:00 +0200</pubDate>
      <link>http://www.clubic.com/reseau-informatique/article-843857-1-reseaux-lora-sigfox-vie-dehors-3g-bluetooth-wi-fi.html</link>
      <guid isPermaLink="false">843861</guid>
    </item>
    <item>
      <title>Réseaux LoRa &amp; Sigfox : il y a une vie en-dehors de la 3G, du Bluetooth et du Wi-Fi !</title>
      <description>On estime que d'ici 2020, ce sont plus de 50 000 Go de données qui transiteront entre nos machines chaque seconde. Ces échanges massifs se font en partie via des réseaux sans fil, les plus répandus ét [...]</description>
      <pubDate>Sun, 03 Jun 2018 18:30:00 +0200</pubDate>
      <link>http://www.clubic.com/reseau-informatique/article-843857-1-reseaux-lora-sigfox-vie-dehors-3g-bluetooth-wi-fi.html</link>
      <guid isPermaLink="false">843857</guid>
    </item>

All 17 comments

Thanks for the suggestion. I have been thinking of introducing such an option for a while, since it is also something I need.
In your case, it is the same URL for both copies of the article, isn't it?

Any progress on this?
I experience a lot of duplicate articles myself as well.
Case 1 - duplicates from same rss feed source
Case 2 - duplicates from multiple rss feed sources.
Remove duplicates from source in case 1.
Check if an article exists in one of the sources case 2. If exist then reject article.

In case 2 it would be nice to compare articles and choose the most complete article. (check for number of words, check for images)

@Wanabo In your case, how are your duplicates more precisely ?
Do they have the exact same:

  • URL?
  • Title?
  • Content?

URL, may differ because the same article comes from different sources but are the same (=different websites)
URL, is the same and comes from different sources BUT from the same website which has for example a feed for Sport and a feed for Lifestyle. The same article is in both feeds.
Title see URL.
Content see URL.

Image of problem

Hi,

My Situation is similar. My example is the following.

The option on some sites offer to subscribe to ALL articles or by category. Most of the time i am not interested in all but more then one categories. Lets say 2.

So I setup two rss url's, one for each category. But some articles are assigned to both categories... Resulting in 100% duplicate entries across two RRS url's.

My guess is that this is one of the easier duplicates to filter.

Bump on this, any news to an antidupe posts?

For rivers/planets, a dupe is where "guid", "title", "content_bin", "link" and finally "hash" have to be the same : if "content_bin" is not the same, this is a different comment from a river.

dupes

Here, lines 2&3 are dupes, 1&4 are same link, but with different comment, from different sources.

2 options would be great :

  • delete "full" dupes, where "guid", "title", "content_bin", "link" and finally "hash" (didn't check where hash is calculated) are the same,
  • delete "simple" dupes, where only "title" and "link" are the same.

Possible?

@Liandriz Yes, this is a very desirable feature, which I would like to implement. I am thinking to implement an option to automatically mark an article as read when it is a duplicate of one already in the database.

But there are several details to take into account, for instance which version(s) should be marked as read, in particular when feeds are refreshed in a random order. It would probably require to specify which feed is the reference.

One possible solution is to make the categories sortable so that there's a hierarchy. We can then use the hierarchy to determine which source is the reference. Works like a "first-come-first-served" concept.

Any updates on this? Maybe something like inoreader could be implemented.

image

Hello, same, I would like this very useful function :) !

+1

Hi Alkarex,
I wrote a query to find the duplicated titles. I plan to make a cron-job to set the duplicates to is_read = true. It seems that my mariadb version does not hava a rank funktion yet and that is the reason why I use the CASE WHEN.

SELECT *
FROM   (SELECT ( CASE title
                   WHEN @curtype THEN @currow := @currow + 1
                   ELSE @currow := 1
                        AND @curtype := title
                 end ) + 1 AS rank,
               freshrss_thomas_entry.*
        FROM   `freshrss_thomas_entry`,
               (SELECT @currow := 0,
                       @curtype := '') r
        ORDER  BY title DESC) AS o
WHERE  rank > 1 

If someone is interested in a more complete solution let me know. I can not write php but I am able to use the database.

I'm not really bothered by it regardless but titles overlap all the time. URLs tend not to. I'd do either just URL or possibly URL and title. Moreover, in the cases where there's the most overlap (like Planet Debian) only the URLs match. So I'd say title is simultaneously false positive heaven yet almost never matching when you want it to.

@Kaan88 How does that inoreader screenshot you post work exactly? Is it global? Per feed?

Example of duplicate from https://www.clubic.com/articles.rss

    <item>
      <title>Réseaux LoRa &amp; Sigfox : il y a une vie en-dehors de la 3G, du Bluetooth et du Wi-Fi !</title>
      <description> [...]</description>
      <pubDate>Sun, 03 Jun 2018 18:45:00 +0200</pubDate>
      <link>http://www.clubic.com/reseau-informatique/article-843857-1-reseaux-lora-sigfox-vie-dehors-3g-bluetooth-wi-fi.html</link>
      <guid isPermaLink="false">843861</guid>
    </item>
    <item>
      <title>Réseaux LoRa &amp; Sigfox : il y a une vie en-dehors de la 3G, du Bluetooth et du Wi-Fi !</title>
      <description>On estime que d'ici 2020, ce sont plus de 50 000 Go de données qui transiteront entre nos machines chaque seconde. Ces échanges massifs se font en partie via des réseaux sans fil, les plus répandus ét [...]</description>
      <pubDate>Sun, 03 Jun 2018 18:30:00 +0200</pubDate>
      <link>http://www.clubic.com/reseau-informatique/article-843857-1-reseaux-lora-sigfox-vie-dehors-3g-bluetooth-wi-fi.html</link>
      <guid isPermaLink="false">843857</guid>
    </item>

Coucou, un petit up car c'est une fonctionnalité importante qui manque à freshrss, encore merci pour votre travail :)

Hello, a little up because it is an important feature that is missing, thank you again for your work :)

@Jean-Mich-Much Oui, c'est aussi quelque chose que je souhaite moi aussi ajouter assez vite :-)

Was this page helpful?
0 / 5 - 0 ratings