Would be great to extend unique or have a uniquerows, so that one can create a dataframe from an existing one that has unique rows - but the uniqueness could be based on unique values in a particular column. For eg.
unique(df, 5)
gives a new dataframe based on the unique values in column 5.
Part of this functionality already exists, but needs to be renamed: it is currently called drop_duplicates!. Being able to select rows is a nice extension that shouldn't be too hard.
I am planning to do a massive slew of renaming in this package next week.
Thanks @johnmyleswhite
Is this issue still live? I would love to see this functionality. Here is my use case (I copied this from closed issue #908 ).
I very often need to extract the values of one or more columns from a DataFrame based on whether the value of one or more columns is similar. So if I have a DataFrame X with:
| site | foo | bar |
| "a" | 1 | 1 |
| "a" | 1 | 2 |
| "b" | 2 | 3 |
In this case, the bar values are redundant, and each site value is always associated with the same foo value. I want a DataFrame that matches site to foo values and discards bar values.
In R I would do:
ret <- X[!duplicated(X$site), c(["site", "foo"]]
which would give
| site | foo |
| "a" | 1 |
| "b" | 2 |
But in julia, the nonunique(df) function only accepts a full DataFrame, leading to ugly code like
ret = X[!nonunique(DataFrame(tmp = X[:site]), [:site, :foo]]
# or prettier, as pointed out by @alyst :
ret = X[!nonunique(X[:,[:site]])[:site, :foo]]
It would be great to have a second optional argument to unique() and nonunique() that let me specify the rows to compare.
It shouldn't be hard to implement if you want to give it a try. Have a look at nonunique and unique! in abstractdataframe.jl. Passing a vector of symbols for column names, and using that to subset the DataFrameRow before comparing should work.
Constructing a dataframe that have a subset of columns should be inexpensive, so it could be as simple as nonunique(df, colnames) = nonunique(df[colnames]).
Constructing a dataframe that have a subset of columns should be inexpensive, so it could be as simple as nonunique(df, colnames) = nonunique(df[colnames]).
Well, the problem with that solution is that one would lose all columns which were not retained for the comparison.
@nalimilan nonunique() just returns the row indices, so it should be fine.
Ah, right. So that should be easy.
Happy to do the PR but uncertain about the preferred style. I would prefer to have methods for a second argument with at least 4 types: symbol, vector {symbol}, Int and vector{Int}.
You can leave out type annotations: if the type of the new argument is not supported, an error will be raised at that point, and it should be clear enough for the caller.
IIRC the guidelines recommend using ::Any type qualifier in that case.
I have written the code, but github tells me I do not have permissions to do a PR on DataFrames?
I ended up implementing with an ::Any qualifier but then also with methods for a single symbol or integer - so that @ViralBShah s example unique(df, 5) would also work. I also implemented this in unique() and unique!().
@mkborregaard GitHub or git? I don't see any fork of DataFrames.jl on your account, which is the first step before doing a PR. At least, if you can point us to the branch on your fork, we can have a look.
Sorry, I did something wrong. Here it is #909 .
Fixed the issues as requested and wrote some tests
how does one select unique row ?
unique
df.colname.unique()
I think you meant unique(df, colname)?
No dataset name followed by the column name from where you want to get the unique values.
datasetname.colname.unique()
Try it out and let me know
This is Julia, what you propose would work in a language like Python. The way I have shown is how you should do it in Julia, e.g.:
julia> df = DataFrame(a=[1,2,3,1,2,3], b=6:-1:1)
6ร2 DataFrame
โ Row โ a โ b โ
โ โ Int64 โ Int64 โ
โโโโโโโผโโโโโโโโผโโโโโโโโค
โ 1 โ 1 โ 6 โ
โ 2 โ 2 โ 5 โ
โ 3 โ 3 โ 4 โ
โ 4 โ 1 โ 3 โ
โ 5 โ 2 โ 2 โ
โ 6 โ 3 โ 1 โ
julia> unique(df, :a)
3ร2 DataFrame
โ Row โ a โ b โ
โ โ Int64 โ Int64 โ
โโโโโโโผโโโโโโโโผโโโโโโโโค
โ 1 โ 1 โ 6 โ
โ 2 โ 2 โ 5 โ
โ 3 โ 3 โ 4 โ
Oh okay sorry about that
Most helpful comment
Fixed in https://github.com/JuliaStats/DataFrames.jl/pull/909