Sonarr: Removing non-alphanumeric characters from all searches doesn't work for some indexers

Created on 7 Apr 2016  Â·  33Comments  Â·  Source: Sonarr/Sonarr

When searching, especially for anime, cleanTitle is not what is needed.

It should (maybe additionally to not break other search APIs) search for the exact scene title. This would be especially helpful for anime.

Nyaa fixed their search API and now properly returns results for '. It needs to be substituted with %27 though.
Here is the now successful search for JoJo's Bizarre Adventure:
http://www.nyaa.se/?page=search&cats=1_37&filter=1&term=JoJo%27s+Bizarre+Adventure

Here is the hastebin that Taloth made about how Sonarr currently searches:
http://hastebin.com/qotubuxeme.vhdl

enhancement medium

Most helpful comment

An option (advanced or otherwise) inevitably means that the user is able to misconfigure it. And in this case it's easily overlooked because it only becomes apparent when the user notices releases missing from the manual results. So it's something to be avoided. With per-indexer control I meant in the code, so that we can override the query title logic for specific indexers as required.

A hard coded list of sites / indexers in the code will require unnecessary administrative overhead on the code base.

Trying to idiot proof the program by removing user choice is infuriating to me and if I do write a patch it will include a toggle option and possibly an option to specify what characters can remain. Some trackers allow a few but not all special characters in a search.

I've been on the opposing end no less than a few dozen times and I hate when a developer tries to single handedly be smarter than the user. If it's an advanced option it's on the user to mess with it. If the user misconfigures it, that's their problem. With adequate documentation and UI design the user should be able to figure out exactly what it is they're tinkering with.

Anyway, doing a query for both variations of such names should not increase the load on the indexer in any significant manner. And has the advantage of 'simply working' regardless of user configuration.

I disagree. It's avoidable and for the same reasoning above, it should be up to the user since it's their account on the indexer the additional load will show up under.

All 33 comments

I think this one just needed a better name, we're using a modified version of the scene names, which works for a lot of indexers, but some are special :smile: We just need a way to modify them later in the process to allow for customization for certain indexers.

We should move the cleanup logic out of the searchcriteria and into the RequestGenerator.

Similar issue with Knight's & Magic, usenet indexers have it as Knights & Magic, but Nyaa.si has Knight's & Magic, but we replace & with and when searching.

Is it perhaps possible to allow users to change the name that's being searched ?
Just a simple override that they can tailor to their index setup and the show that is not working ?

I think it would be great if it was possible to manually modify how the search is done. I know for some torrent trackers you sometimes need to include double quotes around the search term to get a proper result if the show name contains spaces for example.

Knight's & Magic

Alguém conseguiu uma solução para esse problema ? Infelizmente a versao que eu utilizo aqui esta como & e e só é encontrado AND

Knight's & Magic
Has anyone managed a solution to this problem? Unfortunately the version I use here is as & e is only found AND

markus101 can there be an exception for stuff, like so shows with titles with a ' can also search without, and also, titles with an amperstand (&) be searched with AND without, as well as search with it replaced as "and"?

Such as, The Handmaid's Tale, Will & Grace, >> The Handmaids Tale, Will Grace, Will and Grace?

RSS caught The Handmaid's Tale, but manual search doesn't.

Running into the same issue as kat with The Handmaid's Tale -- Jackett catches both, but manual search only catches versions without the apostrophe. Any way we could have it search for both, or perhaps specify which on a per-show basis?

This happens for me using Torznab through Jackett. Debug log:

18-5-20 03:25:33.0|Info|NzbSearchService|Searching 1 indexers for [The Handmaid's Tale : S02E05]
18-5-20 03:25:33.0|Debug|Torznab|Downloading Feed [MY_SERVER]/jackett/api/v2.0/indexers/privatehd/results/torznab/api?t=tvsearch&cat=2000,2010,2030,2040,3000,5000,5030,5040&extended=1&apikey=(removed)&offset=0&limit=100&q=Handmaids%20Tale&season=2&ep=5
18-5-20 03:25:33.5|Debug|NzbSearchService|Total of 0 reports were found for [The Handmaid's Tale : S02E05] from 1 indexers

The Handmaid's Tale gets converted to &q=Handmaids%20Tale

Manual search on Jackett for "Handmaid's Tale" works but not "Handmaids Tale", so the above causes the indexer to return no results.

roman-22 So Jackett is part of the problem... Didn't even think of that part.

I think each indexer's search handles it differently. TL recognises both "Handmaids Tale" and "Handmaid's Tale" whereas PHD (used in above log) needs the apostrophe.

Because Jackett is not given the apostrophe and is only passed "Handmaids Tale" I don't think there's a way for Jackett to solve the problem.

Either Sonarr needs to pass the apostrophe to Jackett, or the indexer needs to adapt their search engine to allow looser matches to be found.

I opened other issue (#2644) with a similar problem. Because mine was closed inmediatly and this one has more attention I'll add my opinion here.

In my case the problem isn't only the single quote auto-removal of Sonarr, it's removing "the" from any series like The handmaid's tale, leaving it like "handmaids tale", which is far from correct, and it's complicating the way indexers works.

@markus101 said that they cannot let indexers sanitize because they don't always do it, but I don't think that is a reason to do things wrong.

If indexers are not sanitizing is not problem of Sonarr, is problem of the indexer. I don't understand that one application should do things that it shouldn't because external applications don't work otherwise.

I wrote one indexer on Jackett, and fix another one to make it Sonarr compliant, and in my case I face the problem that Sonarr is "making up" titles that doesn't match the reality, so I cannot really know the real one.

Remove apostrophes or remove "the" from titles before send it to indexer is out of Sonarr scope.

If indexers are not sanitizing is not problem of Sonarr, is problem of the indexer. I don't understand that one application should do things that it shouldn't because external applications don't work otherwise.

This is NOT the fault or problem of indexers. Sonarr picks up results IN RSS but not in searches. The implementation could be added with tweaks to the search algorithm for titles by also searching for results with special characters stripped in Sonarr. In fact, releases are actually meant to be untouched (including their filenames) due to standards of release groups and the Scene which make sure files do not contain special characters for the sake of compatibility and consistency.

As per removing "the" from titles and adding them to search results, this occurs but is seemingly harmless in and of itself. It does not appear to be injuring RSS snatches.

@kat953162 I think that you didn't understand my point. I'm saying that is problem of the indexer to sanitize the title, not Sonarr. Obviously RSS works, because Sonarr doesn't send any query.

The origin of the problem is Sonarr, Sonarr is removing characters from titles that it shouldn't. But they say that they do it because indexers don't sanitize, so they have to. Wrong. If indexers are bad implemented is problem of the indexer, not Sonarr, Sonarr should do things right, because if it doesn't is way more complicate to implement an indexer that needs the removed parts.

I think there would be two harmless solutions to this problem without affect any implemented indexer:

  • In sanitized titles realize two queries, one with the original title and other with the sanitized (Disadvantage: lower performance).
  • Add an indexer toggle to enable/disable Sonarr sanitization (Disadvantage: harder setup and interface overload).

The indexers have protocols to follow. Higher-level indexers are not going to rename releases if a group uses a certain title. Its not "sanitization" if the indexer adds extra special characters to a title. The Scene will not suddenly start allowing characters other than A-Z, a-z, 0-9, periods, and dashes as per the rules, and other release groups generally follow these standards as well but with flexibility.

Sonarr needs to perform queries with specials characters removed in order to capture a full set of results. It will then notice items that would have an apostrophe or other special symbols (ampersands) removed from the title. It will only have a slightly lower speed (not performance) for titles with symbols, which is not too common, and it is far better than not having the results at all from the query in the first place. Instead of making one API request, it would be making two for titles with symbols, which isn't a big deal.

The majority of indexers with newznab use sphinx indexing and are usually configured to strip special characters like that during the indexing process. It would be nice if they did the same for the api query, but they often don't.
It isn't realistic to demand those indexer to fix it when we're talking about 90%+ of our supported indexers, the world simply doesn't work that way. Otherwise the same could be said for 99% of the indexers in Jackett since they do not have an api nor proper indexing capabilities.
Furthermore most indexers are actually _true_ indexers in the sense that they accept tvdbid/tvmazeid query parameters, The q= parameter is a keyword search, not a title search and is purely intended as fallback in case the id-based search fails.
Regarding 'Expanse', we can leave out 'the' because we'd be getting both results on any decent indexer. (in fact, most indexer will happily return all the relevant results based on tvdbid)

It doesn't make sense to demand Sonarr queries by unmodified titles simply because you desire it so and break it for all the sphinx indexers in the process.
To be frank, I much prefer to support indexers that have a proper api instead of breaking them in order to support a site that does not have an api and actually has a cryptominer on their site. I mean, wtf...

Currently there is no way for newznab/torznab indexers to convey their keyword format in the t=caps capabilities, otherwise we could use that, so it's not possible to implement different behavior depending on the site. At least, not at this time.
For Mejor, an alternative would be to index all the series titles in jackett in a cache (~26 site queries). That would arguably make the whole thing faster and more reliable in the process.

I didn't say break anything. None of my proposals will break even one indexer currently working.

Maybe since Jackett is a "known good" indexer, we could create a new indexer "type" (vs. "Torznab") for it that would support passing escaped queries?

Would having an option on specific indexers to not replace special characters (or just apostrophes) be a possibility?

Found this issue today when adding Jojo's Bizarre Adventure.
Removing the ' breaks both RSS and search on nyaa.si

What's being done about this?

I really like @curiositycasualty 's approach. Maybe add a new category for jacket based indexers where you can toggle if you want to search both cases (omitted special char and 'full' search).

Hi. I would also love to have the option to disable clean titles for searches.

So, since this was opened 3 years ago and I'm running into this problem - I may work on a patch and then do a PR.

@lps-rocks 👍 Regarding implementation, the idea was to move SearchCriteriaBase.QueryTitles/GetQueryTitle to the *RequestGenerator instances, so there's more per-indexer control possible, but that means introducing a helper class since there are multiple RequestGenerators that will need the same functions.
Please note that there should _not_ be an extra configuration option for this, it has to deal with the situation automatically. For newznab/torznab it'd suffice to return both strings.

@lps-rocks 👍 Regarding implementation, the idea was to move SearchCriteriaBase.QueryTitles/GetQueryTitle to the *RequestGenerator instances, so there's more per-indexer control possible, but that means introducing a helper class since there are multiple RequestGenerators that will need the same functions.
Please note that there should _not_ be an extra configuration option for this, it has to deal with the situation automatically. For newznab/torznab it'd suffice to return both strings.

So in your first statement you want per-indexer control to be possible, but in the second - you don't want to expose this control in an advanced feature to the user? I'm confused as to why?

The only other way it could automatically handle it would be to execute two unique search queries to the Torznab/Newznab provider. I think that's a waste of resources unless absolutely necessary since this is an edge case that affects a small number of releases on a small number of index sources. With that being said - an advanced feature would be the best option IMHO. The advanced feature would make it execute a second query for the indexer with the special characters in-tact.

An option (advanced or otherwise) inevitably means that the user is able to misconfigure it. And in this case it's easily overlooked because it only becomes apparent when the user notices releases missing from the manual results. So it's something to be avoided.
With per-indexer control I meant in the code, so that we can override the query title logic for specific indexers as required. Possibly even based on the _reported_ capabilities of specific indexers.

For normal Newznab indexers we use tvdbid for regular shows, so only anime and/or Jackett needs special title queries. (We also fall back to a title search if tvdbid searches yield absolutely zero results, but that's not common either.)
Anyway, doing a query for both variations of such names should not increase the load on the indexer in any significant manner. And has the advantage of 'simply working' regardless of user configuration.

Anyway, doing a query for both variations of such names should not increase the load on the indexer in any significant manner. And has the advantage of 'simply working' regardless of user configuration.

I disagree. I know I'm not the only one constantly getting rate limited on indexers when I add full seasons of various series due to the excessive amount of requests. Double the requests and you just add to that problem. An option with default setting on or off depending on indexer would be a much better option.

An option (advanced or otherwise) inevitably means that the user is able to misconfigure it. And in this case it's easily overlooked because it only becomes apparent when the user notices releases missing from the manual results. So it's something to be avoided. With per-indexer control I meant in the code, so that we can override the query title logic for specific indexers as required.

A hard coded list of sites / indexers in the code will require unnecessary administrative overhead on the code base.

Trying to idiot proof the program by removing user choice is infuriating to me and if I do write a patch it will include a toggle option and possibly an option to specify what characters can remain. Some trackers allow a few but not all special characters in a search.

I've been on the opposing end no less than a few dozen times and I hate when a developer tries to single handedly be smarter than the user. If it's an advanced option it's on the user to mess with it. If the user misconfigures it, that's their problem. With adequate documentation and UI design the user should be able to figure out exactly what it is they're tinkering with.

Anyway, doing a query for both variations of such names should not increase the load on the indexer in any significant manner. And has the advantage of 'simply working' regardless of user configuration.

I disagree. It's avoidable and for the same reasoning above, it should be up to the user since it's their account on the indexer the additional load will show up under.

@lps-rocks I think you did not understand what @Taloth tried to explain.

There's just no place where a user-editable setting would make sense. The indexers are already hardcoded in Sonarr. The only option you have to add other indexers is via custom newznab or custom torznab.

If someone has a newznab or torznab api, they can provide the capabilities of their api to Sonarr. Then Sonarr can adjust its queries accordingly.
So where exactly is the user supposed to meddle with this?

Once such capabilities are in Sonarr, it simply needs to be added to the few indexers that are integrated. And since Jackett is just another torznab indexer, if the user sets up custom indexers through Jackett, they can just set the api capabilities from there.

It would need to be an option under the custom torznab feed in Sonarr. Sonarr is the responsible party that’s modifying the name to be ’safe’.

On May 20, 2019, at 1:44 PM, xelra notifications@github.com wrote:

I think you did not understand what @Taloth https://github.com/Taloth tried to explain.

There's just no place where a user-editable settings would make sense. The indexers are already hardcoded in Sonarr. The only options you have to add other indexers is via custom newznab or custom torznab.

If someone has a newznab or torznab api, they can provide the capabilities of their api to Sonarr. Then Sonarr can adjust its queries accordingly.
So where exactly is the user supposed to meddle with this?

Once such capabilities are in Sonarr, it simply needs to be added to the few indexers that are integrated. And since Jackett is just another torznab indexer, if the user sets up custom indexers through Jackett, they can just set the api capabilities from there.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/Sonarr/Sonarr/issues/1225?email_source=notifications&email_token=ACTDUV6EU6SG3AEY7Y7KXH3PWLWPZA5CNFSM4CAGHEE2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVZXGIQ#issuecomment-494105378, or mute the thread https://github.com/notifications/unsubscribe-auth/ACTDUV6XE6F4OK534Q2VK63PWLWPZANCNFSM4CAGHEEQ.

Newznab indexers mostly use sphinx as search engine, and thus the query titles were formatted for that.

Back in the day we proposed and succeeded in getting the supportedParams attribute added to the newznab-specification caps response, allowing indexers to specify which query parameters they supported. This was first introduced in torznab specifically for Jackett. And after tvrage disappeared, it was successfully proposed to the actual newznab specification and their codebase. It was an improvement because it allowed the indexers to _specify_ what they supported and clients to act accordingly. All without requiring the user to fiddle with configuration.
This is no different, I can imagine that a 'searchFormat' parameter could be added, defaulting to 'sphinx'. Which Jackett for example can set to 'raw', to prevent Sonarr from doing any cleanup whatsoever.
For non newznab/torznab indexers we'll need to find out what cleanup is required.

So the decision logic that determines which QueryTitle cleanups are needed has to be moved to RequestGenerator so that the indexer capabilities can be taken into account.
Then likely two new QueryTitle cleanups formats need to be added. A 'raw' format that does no cleanup, and a 'unknown' format, which queries for multiple formats.
Then the Jackett devs need to be contacted to see if they're interested in collaborating for a change in the torznab capabilities to include the appropriate value, so Sonarr can adjust accordingly. Given our history together I don't expect that to be an issue at all. However, if they use the 'raw' format, then they will need to do the entire cleanup themselves, which is likely the best approach given the variation of indexers they support.
For the nyaa.se indexer in Sonarr, we will need to try a few formats and see what format is required for their full text search. Useful here would be to come up with a test set of titles that usually go wrong.
It's also possible that sphinx already needs multiple titles to be queried, but I guess we'll only find that out by trying those titles.
AnimeTosho and NyaaPantsu are probably the interesting ones because they are not proxies like Jackett, but not use the newznab codebase. So we need to find out what formats apply to them. If we can come up with a sensible caps attribute then I again would expect them to be amicable to add it to their site.

I discussed this with markus and he also does not want to add a user setting for this behavior. The correct format should be automatically determined, but if that is not possible or inconclusive then both titles should be queried instead.

Feels like a short-term fix is to query both the sanitised and non-sanitized titles, and that makes a lot of sense. I also agree that it would be worth asking the Jackett devs to support a "sanitized" field - they could have that stored in the Jackett DB and then it would solve the problem for all feeds everywhere, but reduce the need to double-query on Sonarr.

This change would solve/fix a lot of manual matches/searches that I have to do.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

markus101 picture markus101  Â·  3Comments

antigravity83 picture antigravity83  Â·  3Comments

cjamesdesigner picture cjamesdesigner  Â·  3Comments

mommalongnips picture mommalongnips  Â·  3Comments

pimlie picture pimlie  Â·  4Comments