Apps-android-commons: Make category search non case-sensitive

Created on 17 Oct 2019  Â·  48Comments  Â·  Source: commons-app/apps-android-commons

Received a report from a 2.11 user on our FB page that category search is case-sensitive for her, which means that sometimes she'll type the right category in the search field but nothing will show up.

AFAIK the MW API that we use is inherently case-sensitive, but the upload wizard seems to be able to find a way around that and produces the same category suggestions regardless of case.

categorization enhancement

Most helpful comment

@misaochan sure I'm on it, will update ASAP

All 48 comments

Maybe if we convert everything to lowercase then the server performs a non-case-sensitive search? That's just an hypothesis, I have not tried.

@nicolas-raoul Possible! We'll try it out with a direct query first.

Can I take this issue?

@ankit-kumar-dwivedi please feel free!

@misaochan Is this issue free to be worked upon? if so can i take it?

Hey! Yes sure you should start working on it as I'm not working on it right
now.

On Sun, Jan 12, 2020, 7:21 PM Kshitij Bhardwaj notifications@github.com
wrote:

@misaochan https://github.com/misaochan Is this issue free to be worked
upon? if so can i take it?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/commons-app/apps-android-commons/issues/3179?email_source=notifications&email_token=AI7ACH2SIIJZ5DQVUZWKGVDQ5MN6ZA5CNFSM4JBW444KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIW2OWQ#issuecomment-573417306,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AI7ACHYZUMAZFWYYVHDK77LQ5MN6ZANCNFSM4JBW444A
.

Thank you:)

I'm re-opening this as I believe there's a problem with how this issue was fixed.

@kbhardwaj123 Can you clarify a doubt that I have regarding your PR #3326? In the description you say:

Tested from the MW api fuzzy search url that the category suggestions would deliver the desired results no matter what case you sent in the api call so to fix the issue api call has been converted to lower case.

Are you sure the API really doesn't care about the case of the category name given to it? I'm doubtful about that for several reasons. Here are a couple:

  1. The logical one: If the API really doesn't care about the case of the search text sent to it, this issue shouldn't exist to begin with. Right? IOW, if the API is returning us all categories that match a search text despite the case in which we send the query, then there's no point in just lower-casing the search text we send to the API. Got my point? But the mere existence of this issue indicates otherwise. Correct me if I'm missing something.
  2. The practical one: I just checked with a couple of API calls and I get different results based on the case of the search text I send to the query. Here are a couple of queries which return different results despite only the case of the search text differing:

In case you're wondering why the test case didn't fail. Here's the catch:

The page title is case-sensitive except the first character.

From Manual:Page title - MediaWiki

I think the quote speaks for itself. I'll share the actual problem w.r.t to the app in the next comment.

@sivaraam while I was working on this I went with what @nicolas-raoul suggested so I ensured that all the category strings being passed to the OkHttpClient are converted to lower case and I wrote new tests regarding that and they worked fine but I guess I must have missed something I will take a look at at again

Ok. Here's the issue with the respect to the app: category search doesn't return any categories with a prefix that has a upper case character in it (other than the first one, of course). See #3582 for proof.

In case the issue is not clear to you from #3582, here's another example.

Here's what I get when I search for categories with "COVID" (mind the case) in the app (version: 2.12.3.629~a63a358):
Screenshot_2020-03-28-21-27-39

Now, consider the linked example query which returns 25 categories which have "COVID" as it's prefix. Here are the categories that the query returns:

Category:COVID-19 guidelines in Brazil
Category:COVID-19 guidelines in Argentina
Category:COVID-19 guidelines in Albania
Category:COVID-19
Category:COVID-19 guidelines by country
Category:COVID-19 guidelines in Czechia
Category:COVID-19 guidelines
Category:COVID-19 guidelines in Denmark
Category:COVID-19 guidelines in Esperanto
Category:COVID-19 Clinical Cohort Research Conference, March 18, 2019, National Medical Center, Republic of Korea
Category:COVID-19 coronavirus
Category:COVID-19 guidelines by language
Category:COVID-19 guidelines in Arabic
Category:COVID-19 guidelines in English
Category:COVID-19 guidelines in Basque
Category:COVID-19 guidelines in China
Category:COVID-19 guidelines in Estonian
Category:COVID-19 guidelines in Bengali
Category:COVID-19 guidelines in Bangladesh
Category:COVID-19 guideline cartoons by Anika Nawar Eeha and Abdullah Al Mamun in Bengali
Category:COVID-19 guidelines in Bengali by Anika Nawar Eeha and Abdullah Al Mamun
Category:COVID-19 guidelines in Catalan
Category:COVID-19 guidelines in East Timor
Category:COVID-19 DIY
Category:COVID-19 guidelines in Breton

As you can see, none of the above categories are shown in the category suggestions.

@misaochan Given that we've now accidentally reduced the category search space rather than increasing it, you might want to ensure we fix this before releasing the next version.

Added to the release list, thanks for the heads up!

Hi @kbhardwaj123 , are you currently still working on this? Please do keep us updated, thanks!

@misaochan sure I'm on it, will update ASAP

Thanks @kbhardwaj123 ! As we are planning to include this in v2.13, when you submit your PR could you please rebase and submit it on the 2.13-release branch?

I investigated about the problem and here are my findings.

  • thanks @sivaraam for such in depth details, you are right the API is indeed case sensitive
  • We have 2 search options available at our disposal we can search by prefix or we can use the normal search (in between search, is case-insensitive)

So Suppose I want to find the category Temple of Ishtar at Mari by entering temple of ishtar
these are the results using

  • using generator=allcategories in the url API result which apparently is for prefix search and is case-sensitive
  • using generator=search in the url we get API result which is case in-sensitive and gives us the required Temple of Ishtar at Mari category.

Now on reading the logs i realized that the method searchAll() in CategoriesModel was calling for prefix search and that right there is where the problem is, so i when i fix that by calling both prefix and search API and combining the result we finally get a case insensitive search.

But there's a catch
We are using the beta flavor of the APIs which give the following results

  • prefix result API Link which doesn't give the required result category (this was expected)

    • search result API result, now here we expected Temple of Ishtar at Mari right? but it seems like the beta flavor of the API is unable to give the required result

Possible Solution
AFAIK there are two ways

  • (efficient)Use the production version of the commons api which are capable of delivering case-insensitive results by themselves
  • Or make multiple API calls to the existing API by manipulating the search string

@misaochan @sivaraam @nicolas-raoul @maskaravivek I need your opinions on my investigation on this to fix it for v2.13 i mean are we going to use the production flavor of the APIs in v2.13

@kbhardwaj123 Thanks for the analysis. I'll look into it and share my comments soon. I have a quick doubt about one particular thing:

We are using the beta flavor of the APIs which give the following results

What do you mean by beta flavor of API? Do you mean the API hosted in the beta server (https://commons.wikimedia.beta.wmflabs.org/w/api.php) as opposed to the production server (https://commons.wikimedia.org/w/api.php)?

For category search (and really any testing that does not involve actually uploading), please use the prodDebug flavor of the app. The beta server is unusable for most testing.

@sivaraam yes that's exactly what i meant the API hosted on beta server https://commons.wikimedia.beta.wmflabs.org/w/api.php has server API but it is unable to give the required result where as the production API https://commons.wikimedia.org/w/api.php gives the expected result as shown bu the links in by previous comment.
@nicolas-raoul does prodDebug flavor use https://commons.wikimedia.org/w/api.php APIs ?

@kbhardwaj123 Yes, prod* flavors use the production APIs, for instance https://commons.wikimedia.org/w/api.php . Sorry that our beta servers are not representative of production :'-(

@nicolas-raoul sure then the problem is solved already, I will create the pull request

@sivaraam yes that's exactly what i meant

Thanks for the clarification.

... the API hosted on beta server commons.wikimedia.beta.wmflabs.org/w/api.php has server API but it is unable to give the required result where as the production API commons.wikimedia.org/w/api.php gives the expected result as shown bu the links in by previous comment.

Let me now clarify something. The API hosted in the beta servers and the production servers _would not_ differ in a functional manner. I'm glossing over a little but I believe it's fine for the case in question. You see different results when you use the beta server _only because_ all the categories present that are present in the prod server _are not present_ in the beta server. So, forget the fact that the APIs in the production servers are case-sensitive and the APIs in the beta servers are not, because both of them behave the same way. You can read more about the beta cluster in the following wiki page: Beta Cluster - MediaWiki

sure then the problem is solved already, I will create the pull request

Can you explain how you're going to fix this? I'm asking this to ensure everyone's on the same page. Also, I would suggest you to not rush this. I say this because making the category search case insensitive seems to be a lot complicated than it seems. It's better to know our options and choose the most appropriate one. If we have the release coming up soon soon we can always just revert the changes done in PR #3326 (which we would have to do anyway) and move with the release. We can then make the change after that in that case. @misaochan can comment better about the deadline.

You see different results when you use the beta server only because all the categories present that are present in the prod server are not present in the beta server. So, forget the fact that the APIs in the production servers are case-sensitive and the APIs in the beta servers are not, because both of them behave the same way.

Ok. Here's a proof for the fact that the Beta server behaves just the same way as the production server.

https://commons.wikimedia.beta.wmflabs.org/w/api.php?action=query&format=json&formatversion=2&generator=search&gsrnamespace=14&gsrsearch=testcat&gsrlimit=25&gsroffset=0

This returns the Category:TestCat despite the search term being testcat. So, the beta server's generator=search is case-insensitive too.

Let me now clarify something. The API hosted in the beta servers and the production servers would not differ in a functional manner. I'm glossing over a little but I believe it's fine for the case in question. You see different results when you use the beta server only because all the categories present that are present in the prod server are not present in the beta server. So, forget the fact that the APIs in the production servers are case-sensitive and the APIs in the beta servers are not, because both of them behave the same way. You can read more about the beta cluster in the following wiki page: Beta Cluster - MediaWiki

@sivaraam initially what i meant was that since beta servers don't have all the categories ( the working of both APIs is same that was clear from their documentation) this is what i wanted to show:
Suppose i want category Temple of Ishtar at Mari by typing temple of ishtar only
If the prodDebug APIs are used they give what one expects:

  • with generator = allcategories result it doesn't give the required category because it has prefix only search

  • with generator = search result gives the required category due to it's case insensitivity

But the beta servers have a problem which is that they contain the category Temple of Ishtar at Mari using generator=allcategories see here but The generator=search is incapable of returning the category when provide it with temple of ishtar see here

Comprehensively the results displayed by the beta server's case sensitive API (generator=allcategories) is delivering a category which the case-insensitive API is not able to return and **no such problem is there in the prodDebug APIs

How i intend to solve this is that the searchAll method is the one at fault here, it only calls the prefixSearch API for searching categories so we we make a call using generator=search and combine both prefix and normal search results our problem would be solved.
And yes we need to use the prodDebug APIs because of the point i just mentioned above.

@sivaraam yes i agree that the beta ones are case insensitive but they don't seem to return Category: Temple of Ishtar at Mari (using generator=search) while the case insensitive beta API (generator=allcategories) return the category see result

So i implemented the solution and with the beta server's APIs and this is how it looks with screenshots

using category suggested by @sivaraam Category:TestCat
Screenshot_20200402_135943_fr free nrw commons beta

Now with Category:Temple of Ishtar at Mari (here i am showing that it exists on beta server):
Screenshot_20200402_140016_fr free nrw commons beta

But from the following screenshot it is visible that generator=search doesn't return that which leaves this category as case sensitive
Screenshot_20200402_140022_fr free nrw commons beta

And as soon as i change the flavor of the APIs to prodDebug all these problems dissappear

@kbhardwaj123 Thanks for your explanations. I see your problem now.

But from the following screenshot it is visible that generator=search doesn't return that which leaves this category as case sensitive

It's prudent to explore more before coming to conclusions. AFAIK, you can't just make some categories case sensitive and others case insensitive. It doesn't even make any sense, does it? Anyways, I'll try to clarify what's going on here. Here's the description of the search API from API:Search - MediaWiki [emphasis mine]:

GET request to search for a title or text in a wiki.

Just assume the search API does not search for the titles for now, I'll come back to the why such an assumption? later. Note that the search API looks for the text in the wiki pages. So, any query you send to generator=search looks for the search text in the contents of the wiki page (the category pages are the wiki pages, in our case). So, the results you get in the beta and production server depend not just on the presence of the categories it also are based on what content is present in the category pages. Let's take your case of the "Category:Temple of Ishtar at Mari".

  • Category:Temple of Ishtar at Mari - Wikimedia Commons
    This is the category page in the production server. As you can see the category page has content such as info boxes which has your search text "temple of Ishtar". This could have been the reason for that category to be included in the your API query to the production server: query link.

    Also, the fact that the search API searches the content is very clear from the results of the above query which include categories such as Category:Astarte (goddess), Category:Passing lion Babylon (Louvre, AO21118)

  • Category:Temple of Ishtar at Mari - Wikimedia Commons BETA
    This is the category page in the beta server. Don't get tricked on seeing the 'Media in category "Temple of Ishtar at Mari"' and presume it's text present in that category page. It's a standard section shown for all category pages showcasing the pictures that belong to that category. The category page itself doesn't exist yet. So, no wonder the search query you sent to the beta cluster didn't return this category as a result: query link.

I'm not very sure about how/why a page is included in the result as the algorithm seems to be more involved. Relevant quote from the "Additional notes" section in API:Search page page:

Depending on which search backend is in use, how srsearch is interpreted may vary. On Wikimedia wikis which use CirrusSearch, see Help:CirrusSearch for information about the search syntax.

Coming to the why assume search API doesn't search the title part. Try the following query:

https://commons.wikimedia.org/w/api.php?action=query&format=json&formatversion=2&generator=search&gsrnamespace=14&gsrsearch=temple%20of%20ishtar&gsrlimit=25&gsroffset=0&gsrwhat=title

I've just added gsrwhat=title to the query which tells it to search just the title. As you can see it would clearly say: "title" search is disabled.. Thus my assumption. See also: https://stackoverflow.com/q/14337219/5614968

I hope I've clarified your confusion about beta server not returning the results you expect, now. Let me know if I have not.

To conclude, the search API does more than what's needed (a category title search) and particularly doesn't seem to be searching the title at all. I don't think that would be a good choice. So, as I mentioned earlier we'll have to explore the proper way to achieve a case insensitive search. Here are a couple of related API pages:

Also, I believe we could ask the wikitech-l mailing list about this.

@sivaraam Thanks for such a comprehensive explanation :).
I agree with you that search generator could be an overkill as you pointed out that searching temple of ishtar returns some completely unrelated categories as they contain that term in their wiki text body. So what i am thinking is that in the question on stackoverflow which you mentioned one person gave a workaround of using intitle as:
srsearch=intitle:temple%20of%20ishtar
could solve our issue and return only those categories with the required search term.
Kindly give your opinions on this

@sivaraam i tried it and it returns exactly what we want, checkout the following link:
https://commons.wikimedia.org/w/api.php?action=query&format=json&formatversion=2&generator=search&gsrnamespace=14&gsrsearch=intitle:temple%20of%20ishtar&gsrlimit=25&gsroffset=0

I feel that there's a tradeoff, i mean on one hand if we search the title of category then it gives quite relevant results but might eliminate some (though less relevant)possibly more suited categories but on the other hand it may also suggest some completely irrelevant results as you pointed out:

Also, the fact that the search API searches the content is very clear from the results of the above query which include categories such as Category:Astarte (goddess), Category:Passing lion Babylon (Louvre, AO21118)

I need opinions on: If the category search should be restricted to title (of it's wiki) only

The search URL above looks better than what we currently have indeed, but still not perfect, I think Commons has a better one for us.

Users will type and should see results appear as they type.
For instance, let's say I take a picture of a supermarket in Tokyo. I start typing "supermarkets in to"

How about using the API that sits behind that website search box? Is there any reason why it is not good enough?

I need opinions on: If the category search should be restricted to title (of it's wiki) only

In a word: yes. It's best to keep the search title only to ensure that the results are predictable and straightforward. Also, searching more than just the title is out of scope for this issue which is about making category search case insensitive. We can discuss enhancing the category search separately and focus on just making the category title search case insensitive for now.

How about using the API that sits behind that website search box?

Good idea. We would have to find how it works.

Is there any reason why it is not good enough?

I think we can answer this only after knowing how that works :)

@sivaraam @nicolas-raoul so i will make it title only search and create a separate issue for improving our category search functionality in favour of something similar to that of website search

so i will make it title only search ...

If you think of using intitle: part of search API then here's a problem I noticed with that. It only seems to be returning category pages for which a category page exists (just as expected). Examples:

This might not be a problem if we use the results of the generator=search API as a supplement to the results from the allcategories API (which does not have such a problem). But I just wonder if there is a better way to properly achieve this case insensitive category title search. That's why I was suggesting that we ask the wikitech-l mailing list. We could get a reliable answer of how to go about doing this.

@sivaraam sure in that case I agree that we should ask on the mail list, I will hold my PR on this issue

Any luck with the mailing list? We are holding 2.13 for this at the moment. :)

Any luck with the mailing list?

Apologies. I never got around to sending the e-mail to the mailing list. Got hung up with other things. I'll try to send it by tomorrow if no one else beats me to it :)

We are holding 2.13 for this at the moment. :)

You don't have to hold it anymore ;). I've create #3636 that reverts the changes done in the PR #3326. You can merge that and move on with the release. We can handle the case insensitivity in the next release :)

@sivaraam i agree this would be our best option for now and i am really keen to see what mailing list would suggest regarding this issue :)

I did some searching and phabricator and came to know that case insensitive category title search is a long standing feature request that is yet to be addressed [[ref 1](https://phabricator.wikimedia.org/T59302)] [[ref 2](https://phabricator.wikimedia.org/T187342)]. The linked comment is a nice TL;DR of the status quo.

It seems we really can't use search API for the reason outlined in the comment I referred to previously and another comment in the same ticket which I'm quoting here:

Is it not possible to use the article search engine with (invisible) category: prefix instead ?

Wouldn't that search for pages in the category namespace, rather than actual categories? Some categories don't have associated pages, and you can create pages in the category namespace for non-existent categories.

That's right. To add to that here's another reason for why search API is not a proper fit. There's something called hidden categories [[ref 1](https://commons.wikimedia.org/wiki/Commons:Categories#Categories_marked_with_%22HIDDENCAT%22)] [[ref 2](https://www.mediawiki.org/wiki/Help:Categories#Hidden_categories)] in Mediawiki (the wiki engine behind Commons). My understanding of them is that these hidden categories aren't meant to be added by users directly. An example of such a hidden category in Commons is Category:Uses of Wikidata Infobox - Wikimedia Commons. There's a way to identify such hidden categories using the allcategories API while the search API doesn't have such an option. [side note: we should think about filtering away hidden categories before showing category suggestions. That's for another issue though :)]

Despite all this, I sent an e-mail to the mailing list just to confirm if my understanding is correct.

To conclude, it seems we really can't provide a case-insensitive category title search for now :(

What we could do is to mention about our category title search use case to the following phabricator ticket to clarify that category search is not as "niche" a feature as they think it is.
https://phabricator.wikimedia.org/T187342
That might help us get an API that we can use soon.

@sivaraam thanks for such elaborate insights :). I have a doubt regarding the linked comment which you mentioned, it says:

Our regular search feature (aka "prefix index"), used for the main search field and used for the input field when creating an article link, is case-insensitive in most cases

The "prefix index" which is being reffered here is it the allcategory API's prefix search, if so then how is it case-insensitive in most cases, i am a little confused by this

You don't have to hold it anymore ;). I've create #3636 that reverts the changes done in the PR #3326.

Awesome, thank you!

@sivaraam thanks for such elaborate insights :). I have a doubt regarding the linked comment which you mentioned, it says:

Our regular search feature (aka "prefix index"), used for the main search field and used for the input field when creating an article link, is case-insensitive in most cases

I would quote that fully to as it makes sense only when it is complete. Here it is for the sake of discussion:

Our regular search feature (aka "prefix index"), used for the main search field and used for the input field when creating an article link, is case-insensitive in most cases. On Wikimedia wikis this comes from CirrusSearch. On other wikis (and on WMF until recently) this was provided by the TitleKey extension. The search feature has a namespace filter as well. Which would allow us to do case-insensitive search of page titles in the Category namespace.

Read that fully before reading further.

The "prefix index" which is being reffered here is it the allcategory API's prefix search, if so then how is it case-insensitive in most cases, i am a little confused by this

I'm reasonably confident that the comment either refers to the API:Prefixsearch or someother API. It definitely is not referring to the allcategories API as it mentions a namespace filter which the allcategories doesn't have (and doesn't have any need for).

Hope that clarifies your doubt.

Despite all this, I sent an e-mail to the mailing list just to confirm if my understanding is correct.

And here's our confirmation of my observation:

https://lists.wikimedia.org/pipermail/wikitech-l/2020-April/093295.html

Re-opening as this issue seems to have been closed by mistake.

@sivaraam This has been fixed via #3913. Does that not fix the issue for you?

3913 brings back the old case-sensitive behaviour. This issue is about making category search case-_insensitive_ which is still an open question, to my understanding.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

madhurgupta10 picture madhurgupta10  Â·  4Comments

neslihanturan picture neslihanturan  Â·  3Comments

misaochan picture misaochan  Â·  3Comments

maskaravivek picture maskaravivek  Â·  3Comments

psh picture psh  Â·  3Comments