Dataframes.jl: How to get unique rows from a DataFrame based on a specific column

Created on 20 Sep 2013 · 23Comments · Source: JuliaData/DataFrames.jl

Would be great to extend unique or have a uniquerows, so that one can create a dataframe from an existing one that has unique rows - but the uniqueness could be based on unique values in a particular column. For eg.

unique(df, 5)

gives a new dataframe based on the unique values in column 5.

feature

Source

ViralBShah

Most helpful comment

Fixed in https://github.com/JuliaStats/DataFrames.jl/pull/909

cjprybol on 18 Aug 2017

❤2

All 23 comments

Part of this functionality already exists, but needs to be renamed: it is currently called drop_duplicates!. Being able to select rows is a nice extension that shouldn't be too hard.

I am planning to do a massive slew of renaming in this package next week.

johnmyleswhite on 20 Sep 2013

Thanks @johnmyleswhite

ViralBShah on 21 Sep 2013

Is this issue still live? I would love to see this functionality. Here is my use case (I copied this from closed issue #908 ).
I very often need to extract the values of one or more columns from a DataFrame based on whether the value of one or more columns is similar. So if I have a DataFrame X with:

| site | foo | bar |
| "a"  |  1  |  1  |
| "a"  |  1  |  2  |
| "b"  |  2  |  3  |

In this case, the bar values are redundant, and each site value is always associated with the same foo value. I want a DataFrame that matches site to foo values and discards bar values.
In R I would do:

ret <- X[!duplicated(X$site), c(["site", "foo"]]

which would give

| site | foo | 
| "a"  |  1  |  
| "b"  |  2  |

But in julia, the nonunique(df) function only accepts a full DataFrame, leading to ugly code like

ret = X[!nonunique(DataFrame(tmp = X[:site]), [:site, :foo]]
# or prettier, as pointed out by @alyst :
ret = X[!nonunique(X[:,[:site]])[:site, :foo]]

It would be great to have a second optional argument to unique() and nonunique() that let me specify the rows to compare.

mkborregaard on 5 Feb 2016

👍1

It shouldn't be hard to implement if you want to give it a try. Have a look at nonunique and unique! in abstractdataframe.jl. Passing a vector of symbols for column names, and using that to subset the DataFrameRow before comparing should work.

nalimilan on 5 Feb 2016

Constructing a dataframe that have a subset of columns should be inexpensive, so it could be as simple as nonunique(df, colnames) = nonunique(df[colnames]).

alyst on 5 Feb 2016

Constructing a dataframe that have a subset of columns should be inexpensive, so it could be as simple as nonunique(df, colnames) = nonunique(df[colnames]).

Well, the problem with that solution is that one would lose all columns which were not retained for the comparison.

nalimilan on 5 Feb 2016

@nalimilan nonunique() just returns the row indices, so it should be fine.

alyst on 5 Feb 2016

Ah, right. So that should be easy.

nalimilan on 5 Feb 2016

Happy to do the PR but uncertain about the preferred style. I would prefer to have methods for a second argument with at least 4 types: symbol, vector {symbol}, Int and vector{Int}.

mkborregaard on 5 Feb 2016

You can leave out type annotations: if the type of the new argument is not supported, an error will be raised at that point, and it should be clear enough for the caller.

nalimilan on 5 Feb 2016

IIRC the guidelines recommend using ::Any type qualifier in that case.

alyst on 5 Feb 2016

I have written the code, but github tells me I do not have permissions to do a PR on DataFrames?
I ended up implementing with an ::Any qualifier but then also with methods for a single symbol or integer - so that @ViralBShah s example unique(df, 5) would also work. I also implemented this in unique() and unique!().

mkborregaard on 6 Feb 2016

@mkborregaard GitHub or git? I don't see any fork of DataFrames.jl on your account, which is the first step before doing a PR. At least, if you can point us to the branch on your fork, we can have a look.

nalimilan on 6 Feb 2016

Sorry, I did something wrong. Here it is #909 .

mkborregaard on 6 Feb 2016

Fixed the issues as requested and wrote some tests

mkborregaard on 8 Feb 2016

Fixed in https://github.com/JuliaStats/DataFrames.jl/pull/909

cjprybol on 18 Aug 2017

❤2

how does one select unique row ?

ranjan1608 on 24 Aug 2017

unique

mkborregaard on 24 Aug 2017

df.colname.unique()

AmithGowda04 on 16 Dec 2019

😕1

I think you meant unique(df, colname)?

bkamins on 16 Dec 2019

No dataset name followed by the column name from where you want to get the unique values.
datasetname.colname.unique()
Try it out and let me know

AmithGowda04 on 16 Dec 2019

This is Julia, what you propose would work in a language like Python. The way I have shown is how you should do it in Julia, e.g.:

julia> df = DataFrame(a=[1,2,3,1,2,3], b=6:-1:1)
6×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 6     │
│ 2   │ 2     │ 5     │
│ 3   │ 3     │ 4     │
│ 4   │ 1     │ 3     │
│ 5   │ 2     │ 2     │
│ 6   │ 3     │ 1     │

julia> unique(df, :a)
3×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 6     │
│ 2   │ 2     │ 5     │
│ 3   │ 3     │ 4     │

bkamins on 16 Dec 2019

😄1

Oh okay sorry about that

AmithGowda04 on 16 Dec 2019

Was this page helpful?

0 / 5 - 0 ratings