Julia: NamedTuples with lots of missing data

Created on 17 Jul 2020  路  8Comments  路  Source: JuliaLang/julia

Given that rows in tables are essentially named tuples of a bunch of values, many or most of which are of type Union{Missing, T}, I think it would be good to prioritize dealing with these kind of types in a reasonable way if Julia is aiming to be used for data science. I've been waiting a while for #32699 but it doesn't seemed to have helped much. For example

function test()
    data = (a = if rand(Bool)
        1
    else
        "1"
    end,)
    merge(data, (b = data.a,))
end
@code_warntype test() # NamedTuple{(:a, :b),_A} where _A<:Tuple

Also, #31909 means that you can't use union missing data as captures. Given that these types are kind of messy, it would be understandable if Base cannot support them, but if so, please make an announcement so that the data ecosystem can fully shift to alternatives such as DataValues. Apologies if there is an issue for this open already.

Most helpful comment

I agree this issue lacks clarity:

  • "NamedTuples with lots of missing data" isn't a very clear issue title; it's just a statement; is there something wrong? is something not inferring? Are you referring to NamedTuples with missing values or potentially missing types like Union{T, Missing}?
  • "please make an announcement so that the data ecosystem can fully shift to alternatives such as DataValues"; the tone of this statement feels inappropriate, like you're giving an ultimatum to core developers or something. Is there a consensus among data ecosystem developers that this single issue (whatever it is?) is a make or break issue for using NamedTuples as rows? Or using missing to represent missing data? I bring this up because as a data ecosystem developer, I'm not aware of the direness of this issue or how it threatens to change everything.

As for the actual issue, it seems like the request is to just extend some of the inference improvements from #32699 to apply in more cases? These kinds of issues are definitely tricky to spell out exactly; #32699 originally came out at last year's JuliaCon when @JeffBezanson and I sat down for a half hour and walked through some code; it was only then that we were both able to fully communicate and understand what the issue was and what we could reasonably do in Base. I'm not saying that you need a private meeting with Jeff to get anything done, but I'm just trying to emphasize that #32699 was really hard for me to explain or write up in an issue, because there are lots of steps and factors that come into play. I think it'd be helpful in this case to provide as much context as possible: what is the code? what isn't inferring like you'd expect? how is the specific uninferrable case affecting larger code flow?

All 8 comments

I think it would be good to prioritize dealing with these kind of types in a reasonable way if Julia is aiming to be used for data science

if Base cannot support them, but if so, please make an announcement so that the data ecosystem can fully shift to alternatives such as DataValues.

It's very unclear what this issue is about with all these fuzzy statements and claims that Julia should somehow announce that you shouldn't use unions in data science or something. Please reopen an issue written in such a way that it is clear what the issue is, (what code ran, what was the result, what was the expected result), leaving out irrelevant side comments and ultimatums.

Which comments are irrelevant? I have some code and the result above???

I quoted them.

Again, feel free to open a new issue where you leave out that stuff. A good format is:

Title: Imprecise inference when merging a union value into a NamedTuple of unions
Running this code:


the returned value from inference is ....

I expect it to be ....

This would be useful for ...

Ok, well, I still don't see why those comments aren't relevant, but see #36713

I agree this issue lacks clarity:

  • "NamedTuples with lots of missing data" isn't a very clear issue title; it's just a statement; is there something wrong? is something not inferring? Are you referring to NamedTuples with missing values or potentially missing types like Union{T, Missing}?
  • "please make an announcement so that the data ecosystem can fully shift to alternatives such as DataValues"; the tone of this statement feels inappropriate, like you're giving an ultimatum to core developers or something. Is there a consensus among data ecosystem developers that this single issue (whatever it is?) is a make or break issue for using NamedTuples as rows? Or using missing to represent missing data? I bring this up because as a data ecosystem developer, I'm not aware of the direness of this issue or how it threatens to change everything.

As for the actual issue, it seems like the request is to just extend some of the inference improvements from #32699 to apply in more cases? These kinds of issues are definitely tricky to spell out exactly; #32699 originally came out at last year's JuliaCon when @JeffBezanson and I sat down for a half hour and walked through some code; it was only then that we were both able to fully communicate and understand what the issue was and what we could reasonably do in Base. I'm not saying that you need a private meeting with Jeff to get anything done, but I'm just trying to emphasize that #32699 was really hard for me to explain or write up in an issue, because there are lots of steps and factors that come into play. I think it'd be helpful in this case to provide as much context as possible: what is the code? what isn't inferring like you'd expect? how is the specific uninferrable case affecting larger code flow?

If you would like context, you can see the benchmarks.jl file of LightQuery on master. This code:

using LightQuery: @>, @_, By, Group, over, make_columns, @name_str, Rows, value

using CSV: File
using DataFrames: DataFrame, groupby, combine
using Tables: columntable

cd("/home/brandon/benchmark")
download("http://rapidsai-data.s3-website.us-east-2.amazonaws.com/notebook-mortgage-data/mortgage_2000.tgz", "mortgage_2000.tgz")
run(`tar --gzip --extract --file=mortgage_2000.tgz`)
cd("perf")

file = File(
    "Performance_2000Q1.txt",
    delim = '|',
    header = Symbol.(string.("Column", 1:31)),
    missingstrings = ["NULL", ""],
    dateformat = "mm/dd/yyyy",
    truestrings = ["Y"],
    falsestrings = ["N"],
)

# just the type stable ones
columns = (name"Column1", name"Column2", name"Column4", name"Column6", name"Column10", name"Column11", name"Column12")(columntable(file))

function process_with_lightquery(columns)
    @> Rows(; columns...) |>
    Group(By(_, name"Column1")) |>
    over(_, @_ (Count = length(value(_)),)) |>
    make_columns
end

@time process_with_lightquery(columns)

This code runs in 0.4 seconds (beating DataFrames). As as I add column 3, which has an eltype of Union{Missing, T}, the code runs...indefinitely? Longer than 2 minutes

Ok, now that's a concrete, actionable issue. In the future, please start with that.

Ok, will do

Was this page helpful?
0 / 5 - 0 ratings

Related issues

sbromberger picture sbromberger  路  3Comments

omus picture omus  路  3Comments

Keno picture Keno  路  3Comments

yurivish picture yurivish  路  3Comments

dpsanders picture dpsanders  路  3Comments