Julia: add `Base.isunique`

Created on 8 Apr 2016 · 55Comments · Source: JuliaLang/julia

A common idiom for testing if values in a collection C are unique is

length(unique(C))==length(C)

or variations, eg when the length n is known,

length(unique(C))==n

These show up in various tests, assertions, inner constructors. However, if the goal is only to test for uniqueness, not to obtain a list of unique elements, then something like

function isunique(C)
    seen = Set{eltype(C)}()
    for x in C
        if in(x, seen)
            return false
        else
            push!(seen, x)
        end
    end
    true
end

is much more efficient: it can terminate early and avoids the construction of the intermediate array. Also, programmer's the intent is more clear.

I am of course reluctant to suggest adding functions to Base, but it would good to have unique and isunique in the same module.

Source

tpapp

All 55 comments

I agree that this is a nice thing to have, but I'm maybe we can find a
better name? It's not really about whether the argument is unique, but
whether it has unique elements. I feel that this could lead to some
confusion.

I guess that you would also want to add a method that always returns true
for sets, and maybe for key iterators for dicts.

toivoh on 8 Apr 2016

Another name I was thinking of is allunique, but maybe someone has a better suggestion.

And of course there should be specialized methods when they make sense (eg also for subtypes of Range).

If you agree on a name, I would be happy to submit a PR.

tpapp on 8 Apr 2016

alldistinct?

StefanKarpinski on 8 Apr 2016

+1 for alldistinct or isdistinct.

To bolster @tpapp's point, I did a quick search of registered Julia packages, and found at least 10 packages which use this idiom.

A couple of additional suggestions:

Might be nice to have an isdistinct(collections...) which loops over multiple collections, so that you don't need to concatenate them into a single collection first just to check uniqueness. Though technically this is redundant with isdistinct(flatten([collections....])), given flatten from #14805.
I found a number of packages with checks like length(unique(C)) >= M or length(unique(C)) == N for some small constant N, e.g. length(unique(C)) == 1 checks. Obviously, this can be done much more efficiently just by adding length(seen) >= M && return true to something like your implementation above. Maybe that could be combined into the same function via a keyword argument isdistinct(collection, numdistinct=N, mindistinct=M)

stevengj on 8 Apr 2016

+1 This introduces simplification wlog and improves 'at a glance' understanding.
alldifferent

JeffreySarnoff on 9 Apr 2016

Is this going to cause confusion with functions (that don't currently exist, but might one day) to check if elements of an array alias each other? unique strikes me as referring to values but distinct sounds more like identity to me.

JaredCrean2 on 9 Apr 2016

@stevengj: For the second use case (which actually uses the value of length(unique(C))) I find the keyword argument solution confusing. One could introduce countdistinct, or even better, a function distinctset, which is like unique but returns a set instead of a vector. Indeed, if unique did not need to maintain order and could return a set, the two functions would be the same.

However, in the PR for this issue I don't want to complicate things, so I will just do a PR for alldistinct. The other function can be added later on if there is a demonstrated need.

Regarding the first use case: I really like the idiom with flatten in #14805. Given that it will be available, I don't think it is necessary to provide alldistinct(collections...), even though alldistinct(C, s::Set=Set()) is tempting because I could just chain them together.

tpapp on 9 Apr 2016

+1 for alldistinct, and for keeping it simple.

toivoh on 9 Apr 2016

@tpapp, for the second use case, the point is you _don't_ need to know the value of length(unique(C)). You only need to know a lower bound on the value. In most cases this will allow you to exit the loop early, before examining all the elements.

stevengj on 9 Apr 2016

Sorry for restarting the bikeshedding so late, but what's the reason to prefer alldistinct to allunique? Saying "distinct" instead of "unique" sounds like it relies on a different concept, which isn't the case.

nalimilan on 23 Apr 2016

👍1

Scott proposed another possible name: hasduplicates.

nalimilan on 24 Apr 2016

@nalimilan: I would be fine with either as I see no _a priori_ reason for picking one over the other (I guess this happens with names frequently); however, alldistinct is supported by multiple people above so I did the PR that way. If there is a compelling reason I can redo it, but if not, I would prefer to have #15914 merged so that the issue can be closed.

tpapp on 24 Apr 2016

Choice of terms is indeed in large part arbitrary. But once a term is chosen, it's important IMHO to be consistent across the API. In the present case, I thought allunique would be better for consistency with unique. For example, ?unique would list both functions. (Again, sorry for the annoyance.)

@StefanKarpinski @stevengj Any reason why you preferred alldistinct rather than allunique?

nalimilan on 24 Apr 2016

FWIW, I like this argument for consistency.

On Sunday, April 24, 2016, Milan Bouchet-Valat [email protected]
wrote:

Choice of terms is indeed in large part arbitrary. But once a term is
chosen, it's important IMHO to be consistent across the API. In the present
case, I thought allunique would be better for consistency with unique.
For example, ?unique would list both functions. (Again, sorry for the
annoyance.)

@StefanKarpinski https://github.com/StefanKarpinski @stevengj
https://github.com/stevengj Any reason why you preferred alldistinct
rather than allunique?

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
https://github.com/JuliaLang/julia/issues/15803#issuecomment-213969515

kmsquire on 25 Apr 2016

I don't think that saying the elements are "all unique" is correct. Each element is in itself unique, but the fact that they are all individually unique doesn't mean anything – that's true of any collection. When you say unique(itr) you're asking for its unique values. It is meaningful to say that all of the values are distinct, on the other hand.

StefanKarpinski on 25 Apr 2016

To argue against myself, the term "unique" means "unlike any other" which in this case could implicitly be taken to mean "within this collection" – and if each value is unlike any other in the collection then that is what we want. I guess I just had a preference for "distinct" since it focuses on the relationship between values and not a property of each value. Maybe either would be ok.

StefanKarpinski on 25 Apr 2016

Indeed no element is "unique" individually, but that also applies to "distinct". I really don't see the difference. :-)

That's why I think we should keep the "unique" terminology.

nalimilan on 25 Apr 2016

👍1

Yeah, I'm ok with allunique then.

StefanKarpinski on 25 Apr 2016

I see the difference. A is distinct from B iff there is a distinction that distinguishes A from B (and this is the sense that alldistinct carries). A is unique iff there is exactly one occurrence of A (singleton types are allunique).

jeffreysarnoff-dev on 25 Apr 2016

It seems that the only thing that's preventing a merge is a decision on naming. I have a weak preference for alldistinct for reasons mentioned by @StefanKarpinski and @jeffreysarnoff-dev, but given a signal from a Julia language member that this is all that is preventing the merge and a decision for an alternative name, I would be happy to update the PR to any other name.

While I understand that various arguments can be made for various names, IMO the benefit of having a function for this common idiom is the most important thing.

tpapp on 27 Apr 2016

I was under the impression that I won the fight for allunique since nobody appears to be in opposition to it. :-) So why not update the PR to use that name? Then the merge won't need to wait.

nalimilan on 27 Apr 2016

👎1 👍1

+1 on allunique

StefanKarpinski on 27 Apr 2016

~~I also +1 on allunique~~ (See my comment, below.)

Tetralux on 27 Apr 2016

Thanks for the quick replies, made the change in the PR.

tpapp on 27 Apr 2016

👍1

Closed by https://github.com/JuliaLang/julia/commit/2cc803bd832eb51e22fb0cf843d3ef1bda921064

StefanKarpinski on 27 Apr 2016

I missed the change in opinion here, but allunique to me has the connotation of all(unique(foo)) for collections of collections which this is not doing

tkelman on 28 Apr 2016

Given that all(unique(x)) and all(x) do the same thing this confusion seems unlikely.

StefanKarpinski on 28 Apr 2016

Not if x is a collection of collections. Elsewhere when we concatenate names like maxabs2 it's because combining the composition into a single pass is more efficient. This isn't consistent with that precedent.

tkelman on 28 Apr 2016

I actually liked the hasduplicates suggestion, and that has no chance for confusion since there will likely never be a one argument has

tkelman on 28 Apr 2016

👍1

On reflection, I think I may have been a little hasty with my +1...

I think that @StefanKarpinski's concession...

I guess I just had a preference for "distinct" since it focuses on the relationship between values and not a property of each value. Maybe either would be ok.

...was a mistake.

Distinct-ness has a clear linguistic meaning, whereas unique is a little ambiguous in this case; _"[They are] readily distinguishable"_ vs _"the only one of its kind"._

I do see the merit in @tkelman's thought, though. I doubt there would be any confusion with that one.

Based on which, alldistinct or hasduplicates would both be better suited to this.

It is worth noting however, that if we were to call this function alldistinct, it might not make a lot of sense to have unique; it might want a rename to distinct. Something to think about.

Tetralux on 28 Apr 2016

👍1

If unique were renamed to distinct then alldistinct would have the same might-imply-composition downside as allunique has now. We might want to use the name distinct for something later, and you could argue that our use of unique is a little different than Matlab's and coreutils uniq since we don't output sorted results (or require sorted inputs).

tkelman on 28 Apr 2016

Would anyone object to renaming to hasduplicates? If I don't hear anything I'll open a PR, though we should leave it open a little longer this time to get more opinions.

tkelman on 30 Apr 2016

Late to the party here but hasduplicates is in my view a better name that is easier to mentally parse.

KristofferC on 30 Apr 2016

I don't see the problem with allunique.

stevengj on 30 Apr 2016

It's unrelated to all, that name sounds like a composition to me but it's really a different collection property concept. Trying to see what's overall least objectionable.

tkelman on 30 Apr 2016

The term duplicates is used e.g. in DataFrames.jl, but nowhere in Base, so I still prefer allunique. I don't think there's any possible confusion with all(unique, x). We don't ship any function starting with all which would be a shortcut for all(f, x).

nalimilan on 30 Apr 2016

I was actually quite surprised that the convention was so uniform, we have nothing else with an all prefix and only haskey with a has prefix, every other exported property checking function starts with is.

What is duplicates used for in DataFrames? It seems more self explanatory than unique which you'd have to be familiar with from another language to recognize what that concept is referring to.

tkelman on 30 Apr 2016

duplicates finds rows that would violate a unique key constraint if all of the columns of the DataFrame were a unique key.

Perhaps the names at https://stat.ethz.ch/R-manual/R-devel/library/base/html/duplicated.html are better: duplicated and anyduplicated.

johnmyleswhite on 30 Apr 2016

Judging by

anyDuplicated(.) is a “generalized” more efficient shortcut for any(duplicated(.)).

I'm guessing you mean "better" for DataFrames' usage rather than the function being discussed in this issue?

tkelman on 30 Apr 2016

I was actually suggesting anyduplicated as the name being bikeshedded here.

johnmyleswhite on 30 Apr 2016

That would indeed make sense if Base had a duplicated function (which would return true for elements which are equal to at least one element preceding them in the collection). Doesn't sound like an absurd addition to Base.

nalimilan on 30 Apr 2016

Wouldn't the more general version of this be a function which given a vector returns an Int vector of the same length where each value is the number of times that element has been seen at that point? I'm not entirely convinced that such a specific thing belongs in the standard library, however – it's pretty easy to write this function yourself, after all.

StefanKarpinski on 30 Apr 2016

Ah. I didn't really follow what "violate a unique key constraint ..." meant.

The R precedent is good enough for me, final up/down on anyduplicated? #16152

tkelman on 3 May 2016

Honestly, I think I'm going to think allunique and then remember, "oh wait, it's not called that, it's called something weird... oh yeah, anyduplicated... and then I need to negate it too."

StefanKarpinski on 3 May 2016

👍1

I'm also just not concerned about ambiguity between allunique(x) and all(unique(x)) because the latter is a silly thing to do.

StefanKarpinski on 3 May 2016

What does the "all" mean in the name then?

tkelman on 3 May 2016

Given that we have unique(a), I think that allunique(a) is easier to remember and is reasonably unambiguous. English is not a very precise language, but I think "allunique = "all elements are unique" = "all elements are distinct" = "no elements are duplicated"` should be clear to most people.

stevengj on 3 May 2016

(If you google "all unique" elements you'll find that this usage is pretty commonplace.)

stevengj on 3 May 2016

How about allsame(). Doesn't it have the same argument for inclusion?

ymer on 5 Jan 2017

@ymer That's easy to write efficiently as all(x->x == X[1], X).

nalimilan on 5 Jan 2017

function allsame(X)
    isempty(X) && return true
    X1 = first(X)
    return all(x -> isequal(x, X1), X)
end

would be more correct (doesn't assume X is indexable), and uses isequal like allunique. The fact that it is easy to get this wrong is an argument for inclusion.

On the other hand Python doesn't seem to have an all_same function (though of the Python posted solutions on stackoverflow are buggy for the same reasons as above), nor does Ruby, nor does Haskell. If other languages don't include an allsame function in their standard libraries, that's an argument against including it in Julia.

stevengj on 5 Jan 2017

But those languages don't have an allunique function either. So it seems to me that the argument for inclusion of allsame (or allidentical maybe) still stands, given that allunique is included. It's also assymmetrical to have one function and not the other, even if the implementation is easier for one of them.

ymer on 5 Jan 2017

I also prefer alldistinct to allunique semantically. (And prefer both to anyduplicated.) That's what I would say in normal speech "they are all distinct (from each other)". But I see the point that unique is already a function. Interestingly the Iterators package has a distinct function.

ymer on 5 Jan 2017

Maybe there is a common pattern here:

function allsatisfy(pred, itr)    # is there a nicer way to code this?
    state = start(itr)
    done(itr, state) && return true
    (prev_elt, state) = next(itr, state)
    while !done(itr, state)
        (elt, state) = next(itr, state)
        if pred(prev_elt, elt)
            prev_elt = elt
        else
            return false
        end
    end
    true
end

allsatisfy(<, 1:10)             # true
allsatisfy(>, 1:10)             # false
allsatisfy(==, 1:10)            # false
allsatisfy(<, [])               # true
allsatisfy(<, [1])              # true
allsatisfy(==, fill(1,10))      # true

Caveats:

not tested extensively,
I am terrible at naming things,
perhaps this should go in a separate issue? it has little to do with allunique IMO.

tpapp on 5 Jan 2017

I am not attending these:

| same | duplicate |
|---------|--------------|
| more often we apply our computational | too two-y |
| skill to ascertain and utilize nonsameness | less flexible |

:calendar:

| candidates | the contraquality | usefulness with its dual |
|-------------------------|----------------------------|:------:|
| distinct | indistict, absent discriminative factor | 2
| unique | alike, with discernable commonality | 3

Conceptually, anyunique is more crisp andclearly stated than anydistinct.
That is how I arrive here to suggest we use allunique.

jeffreysarnoff-dev on 22 Feb 2017

Was this page helpful?

0 / 5 - 0 ratings