Guava: Add containsIgnoreCase to StringUtils

Created on 11 Dec 2017 · 10Comments · Source: google/guava

I regularly use StringUtils.isEmptyOrNull, inspired by it came the idea for the function "containsIgnoreCase" of StringUtils. It is a function I usually find myself writing over and over again because of its usefulness.

It compares two strings with the "contains" operation, but turns both strings lowercase before doing so (hence the "ignoreCase" aspect of it).

Use cases would be especially search filters. When a search string (in a search field) should be compared to items, this function is particularly useful. Case doesn't matter in most cases and partial results are often desired as well, which is why contains is useful.

Upon acceptance I would happily implement this and submit a pull request.

P3 package=base status=triaged type=addition

Source

martinfrancois

👍2

Most helpful comment

For string comparisons, we've taken the position that there are 2 reasonable ways to do them:

handle ASCII only
use ICU4J

Guava doesn't depend on ICU4J, so we'd be looking at the ASCII approach here. That's fine; I just want to define the scope.

We do already have the Ascii class. It contains equalsIgnoreCase but not containsIgnoreCase.

With containsIgnoreCase, unlike equalsIgnoreCase, parameter order matters. I think that people would naturally expect (haystack, needle), but it's a small potential problem. [Edited to add: Another case where we see something like this is Iterables.contains. But there the compiler can usually confirm that you got the types right. So maybe that method is best viewed as further evidence that (haystack, needle) is clearly correct.]

How often are these needed?

toLowerCase(...).contains(....toLowerCase()) appears almost twice as often as toLowerCase(...).equals(....toLowerCase()). But Ascii.equalsIgnoreCase is called more than enough to make up the difference -- and in fact we have static analysis internally that suggests Ascii.equalsIgnoreCase in exactly this case :) Still, that's an argument that contains, while probably not more common than equals, is comparably common.
toLowerCase(...).contains("...") is called more than toLowerCase(...).equals("..."). (And the literal ("...") case is similarly common to the non-literal case, for whatever that's worth. Such users don't benefit quite as much from a library, since we're substituting containsIgnoreCase(...) for only toLowerCase().contains(...), not toLowerCase().contains(....toLowerCase()).)

One nice thing about equalsIgnoreCase is that it's more likely to be a building block for higher-level APIs like an Equivalence. But I can imagine using containsIgnoreCase in this way, too, like as part of a Truth Correspondence.

The main things I wonder are:

Is there a clear place to draw the line? startsWithIgnoreCase/endsWithIgnoreCase? (ASCII) compareToIgnoreCase (with Comparator)? indexOfIgnoreCase? replaceIgnoreCase? We could add them all if we wanted, but the value probably falls off quickly after startsWith/endsWith... and even those look to be used mostly with literals, where the method doesn't help as much. Then again, probably some of the contains hits I found above should really be using startsWith/endsWith.
Similar to Strings.isNullOrEmpty, I wonder how many people would be better off just normalizing their input from the beginning. Then normal methods Just Work, so users won't need to remember that certain strings require use of certain special methods.

For my reference, searches:

toLowerCase[^{}|&;]*equals[^{}|&;]*toLowerCase pcre:yes lang:java case:yes
toLowerCase[^{}|&;]*contains[^{}|&;]*toLowerCase pcre:yes lang:java case:yes
toLowerCase[^{}|&;]*equals[(]\s*\042 pcre:yes lang:java case:yes
toLowerCase[^{}|&;]*contains[(]\s*\042 pcre:yes lang:java case:yes

cpovirk on 11 Dec 2017

👍3

All 10 comments

For string comparisons, we've taken the position that there are 2 reasonable ways to do them:

handle ASCII only
use ICU4J

Guava doesn't depend on ICU4J, so we'd be looking at the ASCII approach here. That's fine; I just want to define the scope.

We do already have the Ascii class. It contains equalsIgnoreCase but not containsIgnoreCase.

How often are these needed?

toLowerCase(...).contains(....toLowerCase()) appears almost twice as often as toLowerCase(...).equals(....toLowerCase()). But Ascii.equalsIgnoreCase is called more than enough to make up the difference -- and in fact we have static analysis internally that suggests Ascii.equalsIgnoreCase in exactly this case :) Still, that's an argument that contains, while probably not more common than equals, is comparably common.
toLowerCase(...).contains("...") is called more than toLowerCase(...).equals("..."). (And the literal ("...") case is similarly common to the non-literal case, for whatever that's worth. Such users don't benefit quite as much from a library, since we're substituting containsIgnoreCase(...) for only toLowerCase().contains(...), not toLowerCase().contains(....toLowerCase()).)

The main things I wonder are:

Is there a clear place to draw the line? startsWithIgnoreCase/endsWithIgnoreCase? (ASCII) compareToIgnoreCase (with Comparator)? indexOfIgnoreCase? replaceIgnoreCase? We could add them all if we wanted, but the value probably falls off quickly after startsWith/endsWith... and even those look to be used mostly with literals, where the method doesn't help as much. Then again, probably some of the contains hits I found above should really be using startsWith/endsWith.
Similar to Strings.isNullOrEmpty, I wonder how many people would be better off just normalizing their input from the beginning. Then normal methods Just Work, so users won't need to remember that certain strings require use of certain special methods.

For my reference, searches:

toLowerCase[^{}|&;]*equals[^{}|&;]*toLowerCase pcre:yes lang:java case:yes
toLowerCase[^{}|&;]*contains[^{}|&;]*toLowerCase pcre:yes lang:java case:yes
toLowerCase[^{}|&;]*equals[(]\s*\042 pcre:yes lang:java case:yes
toLowerCase[^{}|&;]*contains[(]\s*\042 pcre:yes lang:java case:yes

cpovirk on 11 Dec 2017

👍3

The OP mentions "search" as a use case; does the ASCII-only approach suffice in that case? Depending on the application it may be more correct to go for a locale-sensitive/ICU4J-approach.

Stephan202 on 11 Dec 2017

Hey @cpovirk, thanks for your quick and very thorough answer :)

With containsIgnoreCase, unlike equalsIgnoreCase, parameter order matters. I think that people would naturally expect (haystack, needle), but it's a small potential problem.

I agree, that's also something I faced. But as you said, this problem exists with the standard implementation of contains as well, even though it would be nice if there was a way to circumvent these issues.

It seems common enough that I think it would justify the implementation.

But I can imagine using containsIgnoreCase in this way, too, like as part of a Truth Correspondence.

Nice idea! Haven't thought of this use case, but sounds good!

Is there a clear place to draw the line?

Good question. I think, especially starting with indexOfIgnoreCase its value falls off a lot, especially since I imagine (...).toLowerCase().indexOf(...) would be used more than (...).toLowerCase().indexOf((...).toLowerCase()) which would further reduce the value, like you mentioned above "Such users don't benefit quite as much from a library ".

Similar to Strings.isNullOrEmpty, I wonder how many people would be better off just normalizing their input from the beginning.

Also true, normalizing the input beforehand would surely be the better solution here, but in some cases I feel like it introduces too much overhead. While I also agree that surely there are cases where this overhead would be mitigated again by the amount of saved computational power that is involved with fewer operations of toLowerCase(). I think it surely doesn't hurt - at the very least it could improve readability for people who tend to not like to normalize their input. Another possibility would be to include it and mention input normalization as a better alternative in the JavaDoc comment? This could lead to better code, but I don't know if such a thing would be accepted.

@Stephan202 Good point. I think cases that would require a locale-sensitive/ICU4J-approach are mostly very specific anyways and justify writing a separate function. Also, as discussed before, I think in such cases it makes even more sense to properly normalize the input, which would not profit from this function.

martinfrancois on 11 Dec 2017

@martinfrancois are you currently working on this or planning to in the near future? If not I would very much like to pick the task up. Although, I acknowledge the task may not have been fully fleshed out yet.

In terms of the implementation, indexOfIgnoreCase could be a good place to start (at least internally) as it creates a nice base for all xxxIgnoreCase methods and would streamline equalsIgnoreCase and containsIgnoreCase to 1 line each.

@martinfrancois I feel a comment in the JavaDoc about input normalization is troubling, it suggests that the attached method is the wrong solution. These xxxIgnoreCase methods favor one time usage per string, a second call on the same string would almost certainly be less 'optimal' than a normalized inputs approach. That said, because there is actually a use case where usage would be optimal I think you could justify the note in the JavaDoc.

On a side note @cpovirk , although 9 times out of 10 xxxIgnoreCase methods are not the better solution, I think replaceIgnoreCase is different. Since it is a manipulation function, it has the potential use case of preserving the original formatting of unaltered text that normalization would destroy.

eganjs on 21 Dec 2017

Hey @jegan, I started working on it but I'm not done yet. Feel free however to review the pull request once I'm done and include your thoughts as well, it may lead to an even better solution overall.

Sounds like a good idea, thanks! I think I'd have to look into the performance differences in this case. I haven't looked into what kind of implementation Java uses for contains, but I'll consider it for sure.

Thanks also for your comment on the JavaDoc. That's what I was concerned with as well. Maybe a good compromise would be to include a note saying that using multiple calls on the same string would benefit from input normalization?

martinfrancois on 24 Dec 2017

@jegan I opened up a pull request now, but I didn't refactor equalsIgnoreCase in it, since the implementation details are different - feel free however to open up an issue by yourself to propose the change, as I think it would require some discussion to finally decide if it's the right thing to do.
Thanks a lot for your suggestion!

martinfrancois on 3 Jan 2018

Implemented in PR #3023

martinfrancois on 29 Jan 2018

Let's keep the issue open until PR #3023 is merged.

avarzar on 17 May 2018

👍2

I think this will be an important update in java. Hoping it finally comes to implementation.

Willz01 on 2 Mar 2020

@Willz01 it's already implemented in PR #3023, however the Guava team still didn't get around to review it (since more than 2 years by now). Let's hope it will get reviewed / merged eventually.

martinfrancois on 2 Mar 2020

Was this page helpful?

0 / 5 - 0 ratings