Julia: add `Base.isunique`

Created on 8 Apr 2016  ·  55Comments  ·  Source: JuliaLang/julia

A common idiom for testing if values in a collection C are unique is

length(unique(C))==length(C)

or variations, eg when the length n is known,

length(unique(C))==n

These show up in various tests, assertions, inner constructors. However, if the goal is only to test for uniqueness, not to obtain a list of unique elements, then something like

function isunique(C)
    seen = Set{eltype(C)}()
    for x in C
        if in(x, seen)
            return false
        else
            push!(seen, x)
        end
    end
    true
end

is much more efficient: it can terminate early and avoids the construction of the intermediate array. Also, programmer's the intent is more clear.

I am of course reluctant to suggest adding functions to Base, but it would good to have unique and isunique in the same module.

All 55 comments

I agree that this is a nice thing to have, but I'm maybe we can find a
better name? It's not really about whether the argument is unique, but
whether it has unique elements. I feel that this could lead to some
confusion.

I guess that you would also want to add a method that always returns true
for sets, and maybe for key iterators for dicts.

Another name I was thinking of is allunique, but maybe someone has a better suggestion.

And of course there should be specialized methods when they make sense (eg also for subtypes of Range).

If you agree on a name, I would be happy to submit a PR.

alldistinct?

+1 for alldistinct or isdistinct.

To bolster @tpapp's point, I did a quick search of registered Julia packages, and found at least 10 packages which use this idiom.

A couple of additional suggestions:

  • Might be nice to have an isdistinct(collections...) which loops over multiple collections, so that you don't need to concatenate them into a single collection first just to check uniqueness. Though technically this is redundant with isdistinct(flatten([collections....])), given flatten from #14805.
  • I found a number of packages with checks like length(unique(C)) >= M or length(unique(C)) == N for some small constant N, e.g. length(unique(C)) == 1 checks. Obviously, this can be done much more efficiently just by adding length(seen) >= M && return true to something like your implementation above. Maybe that could be combined into the same function via a keyword argument isdistinct(collection, numdistinct=N, mindistinct=M)

+1 This introduces simplification wlog and improves 'at a glance' understanding.
alldifferent

Is this going to cause confusion with functions (that don't currently exist, but might one day) to check if elements of an array alias each other? unique strikes me as referring to values but distinct sounds more like identity to me.

@stevengj: For the second use case (which actually uses the value of length(unique(C))) I find the keyword argument solution confusing. One could introduce countdistinct, or even better, a function distinctset, which is like unique but returns a set instead of a vector. Indeed, if unique did not need to maintain order and could return a set, the two functions would be the same.

However, in the PR for this issue I don't want to complicate things, so I will just do a PR for alldistinct. The other function can be added later on if there is a demonstrated need.

Regarding the first use case: I really like the idiom with flatten in #14805. Given that it will be available, I don't think it is necessary to provide alldistinct(collections...), even though alldistinct(C, s::Set=Set()) is tempting because I could just chain them together.

+1 for alldistinct, and for keeping it simple.

@tpapp, for the second use case, the point is you _don't_ need to know the value of length(unique(C)). You only need to know a lower bound on the value. In most cases this will allow you to exit the loop early, before examining all the elements.

Sorry for restarting the bikeshedding so late, but what's the reason to prefer alldistinct to allunique? Saying "distinct" instead of "unique" sounds like it relies on a different concept, which isn't the case.

Scott proposed another possible name: hasduplicates.

@nalimilan: I would be fine with either as I see no _a priori_ reason for picking one over the other (I guess this happens with names frequently); however, alldistinct is supported by multiple people above so I did the PR that way. If there is a compelling reason I can redo it, but if not, I would prefer to have #15914 merged so that the issue can be closed.

Choice of terms is indeed in large part arbitrary. But once a term is chosen, it's important IMHO to be consistent across the API. In the present case, I thought allunique would be better for consistency with unique. For example, ?unique would list both functions. (Again, sorry for the annoyance.)

@StefanKarpinski @stevengj Any reason why you preferred alldistinct rather than allunique?

FWIW, I like this argument for consistency.

On Sunday, April 24, 2016, Milan Bouchet-Valat [email protected]
wrote:

Choice of terms is indeed in large part arbitrary. But once a term is
chosen, it's important IMHO to be consistent across the API. In the present
case, I thought allunique would be better for consistency with unique.
For example, ?unique would list both functions. (Again, sorry for the
annoyance.)

@StefanKarpinski https://github.com/StefanKarpinski @stevengj
https://github.com/stevengj Any reason why you preferred alldistinct
rather than allunique?


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
https://github.com/JuliaLang/julia/issues/15803#issuecomment-213969515

I don't think that saying the elements are "all unique" is correct. Each element is in itself unique, but the fact that they are all individually unique doesn't mean anything – that's true of any collection. When you say unique(itr) you're asking for its unique values. It is meaningful to say that all of the values are distinct, on the other hand.

To argue against myself, the term "unique" means "unlike any other" which in this case could implicitly be taken to mean "within this collection" – and if each value is unlike any other in the collection then that is what we want. I guess I just had a preference for "distinct" since it focuses on the relationship between values and not a property of each value. Maybe either would be ok.

Indeed no element is "unique" individually, but that also applies to "distinct". I really don't see the difference. :-)

That's why I think we should keep the "unique" terminology.

Yeah, I'm ok with allunique then.

I see the difference. A is distinct from B iff there is a distinction that distinguishes A from B (and this is the sense that alldistinct carries). A is unique iff there is exactly one occurrence of A (singleton types are allunique).

It seems that the only thing that's preventing a merge is a decision on naming. I have a weak preference for alldistinct for reasons mentioned by @StefanKarpinski and @jeffreysarnoff-dev, but given a signal from a Julia language member that this is all that is preventing the merge and a decision for an alternative name, I would be happy to update the PR to any other name.

While I understand that various arguments can be made for various names, IMO the benefit of having a function for this common idiom is the most important thing.

I was under the impression that I won the fight for allunique since nobody appears to be in opposition to it. :-) So why not update the PR to use that name? Then the merge won't need to wait.

+1 on allunique

I also +1 on allunique (See my comment, below.)

Thanks for the quick replies, made the change in the PR.

I missed the change in opinion here, but allunique to me has the connotation of all(unique(foo)) for collections of collections which this is not doing

Given that all(unique(x)) and all(x) do the same thing this confusion seems unlikely.

Not if x is a collection of collections. Elsewhere when we concatenate names like maxabs2 it's because combining the composition into a single pass is more efficient. This isn't consistent with that precedent.

I actually liked the hasduplicates suggestion, and that has no chance for confusion since there will likely never be a one argument has

On reflection, I think I may have been a little hasty with my +1...

I think that @StefanKarpinski's concession...

I guess I just had a preference for "distinct" since it focuses on the relationship between values and not a property of each value. Maybe either would be ok.

...was a mistake.

Distinct-ness has a clear linguistic meaning, whereas unique is a little ambiguous in this case; _"[They are] readily distinguishable"_ vs _"the only one of its kind"._

I do see the merit in @tkelman's thought, though. I doubt there would be any confusion with that one.

Based on which, alldistinct or hasduplicates would both be better suited to this.

It is worth noting however, that if we were to call this function alldistinct, it might not make a lot of sense to have unique; it might want a rename to distinct. Something to think about.

If unique were renamed to distinct then alldistinct would have the same might-imply-composition downside as allunique has now. We might want to use the name distinct for something later, and you could argue that our use of unique is a little different than Matlab's and coreutils uniq since we don't output sorted results (or require sorted inputs).

Would anyone object to renaming to hasduplicates? If I don't hear anything I'll open a PR, though we should leave it open a little longer this time to get more opinions.

Late to the party here but hasduplicates is in my view a better name that is easier to mentally parse.

I don't see the problem with allunique.

It's unrelated to all, that name sounds like a composition to me but it's really a different collection property concept. Trying to see what's overall least objectionable.

The term duplicates is used e.g. in DataFrames.jl, but nowhere in Base, so I still prefer allunique. I don't think there's any possible confusion with all(unique, x). We don't ship any function starting with all which would be a shortcut for all(f, x).

I was actually quite surprised that the convention was so uniform, we have nothing else with an all prefix and only haskey with a has prefix, every other exported property checking function starts with is.

What is duplicates used for in DataFrames? It seems more self explanatory than unique which you'd have to be familiar with from another language to recognize what that concept is referring to.

duplicates finds rows that would violate a unique key constraint if all of the columns of the DataFrame were a unique key.

Perhaps the names at https://stat.ethz.ch/R-manual/R-devel/library/base/html/duplicated.html are better: duplicated and anyduplicated.

Judging by

anyDuplicated(.) is a “generalized” more efficient shortcut for any(duplicated(.)).

I'm guessing you mean "better" for DataFrames' usage rather than the function being discussed in this issue?

I was actually suggesting anyduplicated as the name being bikeshedded here.

That would indeed make sense if Base had a duplicated function (which would return true for elements which are equal to at least one element preceding them in the collection). Doesn't sound like an absurd addition to Base.

Wouldn't the more general version of this be a function which given a vector returns an Int vector of the same length where each value is the number of times that element has been seen at that point? I'm not entirely convinced that such a specific thing belongs in the standard library, however – it's pretty easy to write this function yourself, after all.

Ah. I didn't really follow what "violate a unique key constraint ..." meant.

The R precedent is good enough for me, final up/down on anyduplicated? #16152

Honestly, I think I'm going to think allunique and then remember, "oh wait, it's not called that, it's called something weird... oh yeah, anyduplicated... and then I need to negate it too."

I'm also just not concerned about ambiguity between allunique(x) and all(unique(x)) because the latter is a silly thing to do.

What does the "all" mean in the name then?

Given that we have unique(a), I think that allunique(a) is easier to remember and is reasonably unambiguous. English is not a very precise language, but I think "allunique = "all elements are unique" = "all elements are distinct" = "no elements are duplicated"` should be clear to most people.

(If you google "all unique" elements you'll find that this usage is pretty commonplace.)

How about allsame(). Doesn't it have the same argument for inclusion?

@ymer That's easy to write efficiently as all(x->x == X[1], X).

function allsame(X)
    isempty(X) && return true
    X1 = first(X)
    return all(x -> isequal(x, X1), X)
end

would be more correct (doesn't assume X is indexable), and uses isequal like allunique. The fact that it is easy to get this wrong is an argument for inclusion.

On the other hand Python doesn't seem to have an all_same function (though of the Python posted solutions on stackoverflow are buggy for the same reasons as above), nor does Ruby, nor does Haskell. If other languages don't include an allsame function in their standard libraries, that's an argument against including it in Julia.

But those languages don't have an allunique function either. So it seems to me that the argument for inclusion of allsame (or allidentical maybe) still stands, given that allunique is included. It's also assymmetrical to have one function and not the other, even if the implementation is easier for one of them.

I also prefer alldistinct to allunique semantically. (And prefer both to anyduplicated.) That's what I would say in normal speech "they are all distinct (from each other)". But I see the point that unique is already a function. Interestingly the Iterators package has a distinct function.

Maybe there is a common pattern here:

function allsatisfy(pred, itr)    # is there a nicer way to code this?
    state = start(itr)
    done(itr, state) && return true
    (prev_elt, state) = next(itr, state)
    while !done(itr, state)
        (elt, state) = next(itr, state)
        if pred(prev_elt, elt)
            prev_elt = elt
        else
            return false
        end
    end
    true
end

allsatisfy(<, 1:10)             # true
allsatisfy(>, 1:10)             # false
allsatisfy(==, 1:10)            # false
allsatisfy(<, [])               # true
allsatisfy(<, [1])              # true
allsatisfy(==, fill(1,10))      # true

Caveats:

  1. not tested extensively,
  2. I am terrible at naming things,
  3. perhaps this should go in a separate issue? it has little to do with allunique IMO.

I am not attending these:

| same | duplicate |
|---------|--------------|
| more often we apply our computational | too two-y |
| skill to ascertain and utilize nonsameness | less flexible |

:calendar:

| candidates | the contraquality | usefulness with its dual |
|-------------------------|----------------------------|:------:|
| distinct | indistict, absent discriminative factor | 2
| unique | alike, with discernable commonality | 3

Conceptually, anyunique is more crisp andclearly stated than anydistinct.
That is how I arrive here to suggest we use allunique.

Was this page helpful?
0 / 5 - 0 ratings