Base.join has a clear docstring which is to join an array of strings with delimiters. This seems like a different operation than what the join on DataFrames provides. Since the operation is different, it shouldn't share the namespace (DataFrames.join shouldn't extend Base.join)
Indeed this is not very clean.
The problem we have is that the operation we do in DataFrames.jl is universally called join. For sure we do not want to force users to write DataFrames.join.
I think we could do two things:
jointables and make it live in DataAPI.jl or something similar?I am not sure what would be better. Also adding @piever to the discussion as JuliaDB.jl also extends Base.join for exactly the same purpose as DataFrames.jl does.
FWIW I don't see an issue with the current terminology/extension. join(df1, df2) is so natural that I doubt there would be any confusion, and even if there is, people can just check the docstring for join to read about the method for DataFrames.
(I'm also not a fan of the idea that if someone has used a word as a function name to mean a particular thing, that no one else can use that word to mean anything else. It's just a word, that's what documentation is for.)
(I'm also not a fan of the idea that if someone has used a word as a function name to mean a particular thing, that no one else can use that word to mean anything else. It's just a word, that's what documentation is for.)
You seem to be confused about how generic programming works. Someone creates a function in a namespace. That function has a certain meaning (which is indicated by the docstring) envisioned by the author who created the function. For example, Base.push!(a, b) means to add the element b to the collection a. It does not mean that b gives a a physical push (like in a game). Other types can extend (overload) this function by creating methods Base.push(::MyType, ...) =. The meaning of the function is still the same though. This is important so that when you e.g. see untyped code like
function func(x, a)
for i in foo(x)
push!(a, i)
end
end
you have some clue what it does and what the meaning of it is.
If a game want to use push! to actually mean that a player gives another player a push that is done by creating a new function push! (that does not extend Base.push!). This function has it's own docstring. People reading code can from looking at the namespace of push! know which function is used and code makes sense.
This is not unique to Julia, for example from the C++ code guidelines (https://google.github.io/styleguide/cppguide.html#Function_Overloading):
Use overloaded functions (including constructors) only if a reader looking at a call site can get a good idea of what is happening without having to first figure out exactly which overload is being called.
In other words, you shouldn't need to know which method of Base.push! that gets called to know sort of what the functionality of the code is. That is why you can give docstrings to function at all. To convey their meaning because they are more than "a name".
You seem to be confused about how generic programming works.
Perhaps I didn't convey my point accurately but this is pretty rude.
I don't disagree with most of the points you've made. My point is that there are some operations with names that are sufficiently generic that they mean different things in different contexts anyway, like join. No one is going to be reading tabular data manipulation code and see join(df1, df2) and assume that df1 is a vector of strings and df2 is a separator. The reuse of the name is of very little practical consequence, and requiring a module qualification is just annoying for the user.
My point is that there are some operations with names that are sufficiently generic that they mean different things in different contexts anyway, like join.
Sure, but those things should then not extend the same function.
However, I do think that perhaps
- leave the situation as is (knowing it is not ideal, but for years people used it and never complained);
might be the best thing from a pragmatic point of view.
We recently decided that a new flatten function for DataFrames would not be an extension of Iterators.flatten and instead a flatten should live in DataAPI.
So there is precedent for this. Making join in DataAPI seems natural in this case.
The difference is that flatten is not in Base, so in normal workflows you do not have to write DataAPI.flatten but just flatten in the code. While join is in Base, so you would always have to write DataFrames.join when you would want to do a join of two data frames.
While join is in Base, so you would always have to write DataFrames.join when you would want to do a join of two data frames.
Unless you du using DataFrames: join I guess?
If only there was a way to namespace things based on the object :P
This will not solve a problem:
julia> join(["1", "2"])
"12"
julia> module A
join(::Float64) = 0.0
end
Main.A
julia> using Main.A: join
WARNING: ignoring conflicting import of A.join into Main
and if you reverse the order of calls you get:
julia> module A
join(::Float64) = 0.0
end
Main.A
julia> using Main.A: join
julia> join(["1", "2"])
ERROR: MethodError: no method matching join(::Array{String,1})
requiring people to use DataFrames.join would be frustrating. I would bet most users (or, desired audience) have never encountered a namespace error before.
While I agree that from a purist's point of view Base.join should not be overloaded for some very different behavior, I would tend to agree that having to qualify it with the package name would be very annoying. OTOH I would find an exported function with a different name, like datajoin or jointables as proposed by @bkamins, less annoying.
The only thing I wanted to point out, though, is that eventually it'd be nice to be able to define fallbacks for data manipulations for all table structures (things that respect the Tables.jl interface). With that in mind, it is tricky that one of these operations is an extended Base function, because then it would be impossible to extend join to arbitrary tables (no common supertype). In the case of other Base functions (e.g. map or filter) that are also useful on tables, this is less of an issue as the Base fallback already does a sensible thing, e.g. map(f, Tables.rows(sometable)) would return the correct table.
On a related note, JuliaDB uses broadcast(merge, t1, t2) as a synonym (with some differences) of join, with the idea that primary keys are indices of the table, so broadcasting should make a join (and merge merges the rows, which are named tuples, so that the output has columns from both inputs), but this is very specific to keyed tables and may not make sense for DataFrames.
How about splitting join into separate functions like innerjoin and leftjoin? This solves the overloading problem and would be consistent with both SQL and dplyr.
We do many types of join: inner, outer, left, right, semi, anti, and cross.
We do many types of join: inner, outer, left, right, semi, anti, and cross.
Yes, I meant why not export all seven as separate functions in the public API? SQL defines separate clauses for each different kind of join. dplyr also defines separate functions for each kind of join. So the precedent is there. I also think the join syntax is slightly prettier without the kind keyword argument:
leftjoin(name, job, on = :ID)
# vs
join(name, job, on = :ID, kind = :left)
I would be OK with this if people would find it OK. I will ask on Slack to up/down vote your proposal and let us see the response.
Base.join and this join are almost disambiguated by being functions operating on one argument versus multiple arguments (Base.join would nowadays perhaps use keyword arguments instead of positional arguments.) I feel like with different arity different things can share the same name. Are there instances of that in Base?
Base.joinand thisjoinare almost disambiguated by being functions operating on one argument versus multiple arguments
According to the docs on Julia 1.3.1, this is the signature of Base.join:
join([io::IO,] strings [, delim [, last]])
so Base.join can have an arity of 1, 2, 3, or 4.
(Base.join would nowadays perhaps use keyword arguments instead of positional arguments.)
Base.join does not currently use keyword arguments...
I wrote βalmost disambiguatedβ for that reason. Also read the last sentence as
Base.join would perhaps use keyword arguments instead of positional arguments if introduced nowadays.
I like the proposal of using separate function names for the different kinds; one reason is this: I've seen a non-trivial number of cases where someone messes up a left vs. inner join and hurt themselves w/ bad data or frustrated because things don't seem to be working like they think. Encouraging good behavior by being more explicit would help, which I feel is the kind of design motivation we've used before (e.g. 3-value logic w/ missing)
I really like the explicit function name proposal: in most cases it saves characters (no more kind=), and it is a lot more explicit in the default situation. Other than it being a breaking change I really don't see any downside. I think even if Base.join didn't exist I would prefer that over the current API.
The main downside I can see (apart from breakage) is that there wouldn't be a single place to get the documentation about the possible join types (like ?join currently), and that people might still look at ?join and be confused by not seeing anything about tables there. Of course that's probably not the end of the world as dplyr does that.
We could easily at-ref the other join functions in each join functions' documentation. And even if they do ?join, it should show innerjoin, leftjoin, etc, as "related searches".
OK - I think that there is a general agreement on adding innerjoin etc.
I will make a PR introducing it.
See #2101 for an implementation (huge diff unfortunately).
people might still look at
?joinand be confused by not seeing anything about tables there
It looks like you can define docstring without defining a method. Perhaps it still makes sense to add docstring in Base.join even after this method is completely removed?
julia> module Demo
"hello"
Base.join
end
Main.Demo
help?> join
search: join joinpath adjoint typejoin isdisjoint
join([io::IO,] strings [, delim [, last]])
Join an array of strings into a single string, inserting the given delimiter (if any) between adjacent
strings. If last is given, it will be used instead of delim between the last two strings. If io is given, the
result is written to io rather than returned as as a String.
strings can be any iterable over elements x which are convertible to strings via print(io::IOBuffer, x).
strings will be printed to io.
Examples
β‘β‘β‘β‘β‘β‘β‘β‘β‘β‘
julia> join(["apples", "bananas", "pineapples"], ", ", " and ")
"apples, bananas and pineapples"
julia> join([1,2,3,4,5])
"12345"
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
hello
After #2101 the help for join looks like:
julia> using DataFrames
help?> join
search: join joinpath adjoint typejoin semijoin leftjoin antijoin rightjoin outerjoin innerjoin crossjoin
... some more stuff on join ...
and as it was commented earlier I think it is easy to spot the specific methods + we will be providing deprecation for join if someone uses it.
The biggest breaking change since getindex redesign got merged. Thank you all for discussing it.
Most helpful comment
OK - I think that there is a general agreement on adding
innerjoinetc.I will make a PR introducing it.