Dataframes.jl: How to get unique rows from a DataFrame based on a specific column

Created on 20 Sep 2013  ยท  23Comments  ยท  Source: JuliaData/DataFrames.jl

Would be great to extend unique or have a uniquerows, so that one can create a dataframe from an existing one that has unique rows - but the uniqueness could be based on unique values in a particular column. For eg.

unique(df, 5)

gives a new dataframe based on the unique values in column 5.

feature

Most helpful comment

All 23 comments

Part of this functionality already exists, but needs to be renamed: it is currently called drop_duplicates!. Being able to select rows is a nice extension that shouldn't be too hard.

I am planning to do a massive slew of renaming in this package next week.

Thanks @johnmyleswhite

Is this issue still live? I would love to see this functionality. Here is my use case (I copied this from closed issue #908 ).
I very often need to extract the values of one or more columns from a DataFrame based on whether the value of one or more columns is similar. So if I have a DataFrame X with:

| site | foo | bar |
| "a"  |  1  |  1  |
| "a"  |  1  |  2  |
| "b"  |  2  |  3  |

In this case, the bar values are redundant, and each site value is always associated with the same foo value. I want a DataFrame that matches site to foo values and discards bar values.
In R I would do:

ret <- X[!duplicated(X$site), c(["site", "foo"]]

which would give

| site | foo | 
| "a"  |  1  |  
| "b"  |  2  |  

But in julia, the nonunique(df) function only accepts a full DataFrame, leading to ugly code like

ret = X[!nonunique(DataFrame(tmp = X[:site]), [:site, :foo]]
# or prettier, as pointed out by @alyst :
ret = X[!nonunique(X[:,[:site]])[:site, :foo]]

It would be great to have a second optional argument to unique() and nonunique() that let me specify the rows to compare.

It shouldn't be hard to implement if you want to give it a try. Have a look at nonunique and unique! in abstractdataframe.jl. Passing a vector of symbols for column names, and using that to subset the DataFrameRow before comparing should work.

Constructing a dataframe that have a subset of columns should be inexpensive, so it could be as simple as nonunique(df, colnames) = nonunique(df[colnames]).

Constructing a dataframe that have a subset of columns should be inexpensive, so it could be as simple as nonunique(df, colnames) = nonunique(df[colnames]).

Well, the problem with that solution is that one would lose all columns which were not retained for the comparison.

@nalimilan nonunique() just returns the row indices, so it should be fine.

Ah, right. So that should be easy.

Happy to do the PR but uncertain about the preferred style. I would prefer to have methods for a second argument with at least 4 types: symbol, vector {symbol}, Int and vector{Int}.

You can leave out type annotations: if the type of the new argument is not supported, an error will be raised at that point, and it should be clear enough for the caller.

IIRC the guidelines recommend using ::Any type qualifier in that case.

I have written the code, but github tells me I do not have permissions to do a PR on DataFrames?
I ended up implementing with an ::Any qualifier but then also with methods for a single symbol or integer - so that @ViralBShah s example unique(df, 5) would also work. I also implemented this in unique() and unique!().

@mkborregaard GitHub or git? I don't see any fork of DataFrames.jl on your account, which is the first step before doing a PR. At least, if you can point us to the branch on your fork, we can have a look.

Sorry, I did something wrong. Here it is #909 .

Fixed the issues as requested and wrote some tests

how does one select unique row ?

unique

df.colname.unique()

I think you meant unique(df, colname)?

No dataset name followed by the column name from where you want to get the unique values.
datasetname.colname.unique()
Try it out and let me know

This is Julia, what you propose would work in a language like Python. The way I have shown is how you should do it in Julia, e.g.:

julia> df = DataFrame(a=[1,2,3,1,2,3], b=6:-1:1)
6ร—2 DataFrame
โ”‚ Row โ”‚ a     โ”‚ b     โ”‚
โ”‚     โ”‚ Int64 โ”‚ Int64 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1   โ”‚ 1     โ”‚ 6     โ”‚
โ”‚ 2   โ”‚ 2     โ”‚ 5     โ”‚
โ”‚ 3   โ”‚ 3     โ”‚ 4     โ”‚
โ”‚ 4   โ”‚ 1     โ”‚ 3     โ”‚
โ”‚ 5   โ”‚ 2     โ”‚ 2     โ”‚
โ”‚ 6   โ”‚ 3     โ”‚ 1     โ”‚

julia> unique(df, :a)
3ร—2 DataFrame
โ”‚ Row โ”‚ a     โ”‚ b     โ”‚
โ”‚     โ”‚ Int64 โ”‚ Int64 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1   โ”‚ 1     โ”‚ 6     โ”‚
โ”‚ 2   โ”‚ 2     โ”‚ 5     โ”‚
โ”‚ 3   โ”‚ 3     โ”‚ 4     โ”‚

Oh okay sorry about that

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ahalwright picture ahalwright  ยท  3Comments

tlienart picture tlienart  ยท  8Comments

bkamins picture bkamins  ยท  8Comments

garborg picture garborg  ยท  8Comments

xiaodaigh picture xiaodaigh  ยท  5Comments