Given that rows in tables are essentially named tuples of a bunch of values, many or most of which are of type Union{Missing, T}
, I think it would be good to prioritize dealing with these kind of types in a reasonable way if Julia is aiming to be used for data science. I've been waiting a while for #32699 but it doesn't seemed to have helped much. For example
function test()
data = (a = if rand(Bool)
1
else
"1"
end,)
merge(data, (b = data.a,))
end
@code_warntype test() # NamedTuple{(:a, :b),_A} where _A<:Tuple
Also, #31909 means that you can't use union missing data as captures. Given that these types are kind of messy, it would be understandable if Base cannot support them, but if so, please make an announcement so that the data ecosystem can fully shift to alternatives such as DataValues. Apologies if there is an issue for this open already.
I think it would be good to prioritize dealing with these kind of types in a reasonable way if Julia is aiming to be used for data science
if Base cannot support them, but if so, please make an announcement so that the data ecosystem can fully shift to alternatives such as DataValues.
It's very unclear what this issue is about with all these fuzzy statements and claims that Julia should somehow announce that you shouldn't use unions in data science or something. Please reopen an issue written in such a way that it is clear what the issue is, (what code ran, what was the result, what was the expected result), leaving out irrelevant side comments and ultimatums.
Which comments are irrelevant? I have some code and the result above???
I quoted them.
Again, feel free to open a new issue where you leave out that stuff. A good format is:
Title: Imprecise inference when merging a union value into a NamedTuple of unions
Running this code:
the returned value from inference is ....
I expect it to be ....
This would be useful for ...
Ok, well, I still don't see why those comments aren't relevant, but see #36713
I agree this issue lacks clarity:
missing
values or potentially missing types like Union{T, Missing}
?missing
to represent missing data? I bring this up because as a data ecosystem developer, I'm not aware of the direness of this issue or how it threatens to change everything.As for the actual issue, it seems like the request is to just extend some of the inference improvements from #32699 to apply in more cases? These kinds of issues are definitely tricky to spell out exactly; #32699 originally came out at last year's JuliaCon when @JeffBezanson and I sat down for a half hour and walked through some code; it was only then that we were both able to fully communicate and understand what the issue was and what we could reasonably do in Base. I'm not saying that you need a private meeting with Jeff to get anything done, but I'm just trying to emphasize that #32699 was really hard for me to explain or write up in an issue, because there are lots of steps and factors that come into play. I think it'd be helpful in this case to provide as much context as possible: what is the code? what isn't inferring like you'd expect? how is the specific uninferrable case affecting larger code flow?
If you would like context, you can see the benchmarks.jl file of LightQuery on master. This code:
using LightQuery: @>, @_, By, Group, over, make_columns, @name_str, Rows, value
using CSV: File
using DataFrames: DataFrame, groupby, combine
using Tables: columntable
cd("/home/brandon/benchmark")
download("http://rapidsai-data.s3-website.us-east-2.amazonaws.com/notebook-mortgage-data/mortgage_2000.tgz", "mortgage_2000.tgz")
run(`tar --gzip --extract --file=mortgage_2000.tgz`)
cd("perf")
file = File(
"Performance_2000Q1.txt",
delim = '|',
header = Symbol.(string.("Column", 1:31)),
missingstrings = ["NULL", ""],
dateformat = "mm/dd/yyyy",
truestrings = ["Y"],
falsestrings = ["N"],
)
# just the type stable ones
columns = (name"Column1", name"Column2", name"Column4", name"Column6", name"Column10", name"Column11", name"Column12")(columntable(file))
function process_with_lightquery(columns)
@> Rows(; columns...) |>
Group(By(_, name"Column1")) |>
over(_, @_ (Count = length(value(_)),)) |>
make_columns
end
@time process_with_lightquery(columns)
This code runs in 0.4 seconds (beating DataFrames). As as I add column 3, which has an eltype of Union{Missing, T}
, the code runs...indefinitely? Longer than 2 minutes
Ok, now that's a concrete, actionable issue. In the future, please start with that.
Ok, will do
Most helpful comment
I agree this issue lacks clarity:
missing
values or potentially missing types likeUnion{T, Missing}
?missing
to represent missing data? I bring this up because as a data ecosystem developer, I'm not aware of the direness of this issue or how it threatens to change everything.As for the actual issue, it seems like the request is to just extend some of the inference improvements from #32699 to apply in more cases? These kinds of issues are definitely tricky to spell out exactly; #32699 originally came out at last year's JuliaCon when @JeffBezanson and I sat down for a half hour and walked through some code; it was only then that we were both able to fully communicate and understand what the issue was and what we could reasonably do in Base. I'm not saying that you need a private meeting with Jeff to get anything done, but I'm just trying to emphasize that #32699 was really hard for me to explain or write up in an issue, because there are lots of steps and factors that come into play. I think it'd be helpful in this case to provide as much context as possible: what is the code? what isn't inferring like you'd expect? how is the specific uninferrable case affecting larger code flow?