Datascience: .group() with a list raises deprecation warning

Created on 21 Sep 2020  路  11Comments  路  Source: data-8/datascience

If you call tbl.group(['a', 'b']), you get the following warning:

//lib/python3.7/site-packages/datascience/tables.py:630: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
  values = np.array(tuple(values))

This is unsightly. Is there anything we should/can do in the datascience library to eliminate this?

Most helpful comment

Yes I think that would be much appreciated! (I've been contributing the last couple weeks and it's been a wonderful experience :) )

All 11 comments

@davidwagner You are getting this warning from this line. I want to fix this, can I do that?

There are actually some more occurrences of this particular deprecation warning when you run the tests as well.

tests/test_tables.py::test_doctests
  /Users/hmstepanek/Documents/open-source/datascience/datascience/util.py:40: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
    return np.array(elements)

tests/test_tables.py: 14 warnings
  /Users/hmstepanek/Documents/open-source/datascience/datascience/tables.py:877: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
    values = np.array(tuple(values))

@hmstepanek All of these warnings are of similar kind. Can I try to fix this and contribute to this open-source project?

Yes I think that would be much appreciated! (I've been contributing the last couple weeks and it's been a wonderful experience :) )

@davidwagner @hmstepanek I am taking up this issue. Thanks a lot!!

I made some changes locally and I ran the tests. Some of the tests were failing among which, most of them were docstring tests. The reason for failure is according to my changes, All the empty strings in a column will be replaced by None. But in the docstring, it is not the same case. So should I update the docstrings as well?
image

@Ipsit1234 - were these tests failing prior to the changes you made locally? If so, then we should work with the tests available and try to craft code to make sure the tests don't fail (as they represent current behavior - which we're not looking to change).

As for this issue with None, I'm leaning towards keeping current functionality of making the cell blank rather than None as well. This is strictly from a pedagogical point-of-view, so if there's convincing arguments otherwise, I'd be open to discussing them here :)

@adnanhemani no, the above tests were not failing prior to the changes that I made locally. The reason for the warning on creating an ndarray from ragged nested sequences is due to a recent change in the latest version i.e., v1.19.0 of NumPy, which discontinues the automatic inferring of the object datatype. The explanation for the same can be found in NEP 34.
Hence I suggest that it would be better to consider None instead of leaving it with empty strings as NumPy will update further in future giving the same warning again and again.

I confess I don't really understand what is happening here. It looks like in the example you gave, marbles.group("Shape", sum) is going to end up executing (effectively):

c = [sum(np.array(["Green", "Blue", "Green"])), sum(np.array(["Red", "Red", "Green"]))]
columns.append(c)

I would expect that to raise an exception. Does it somehow get converted to None, and then numpy converts the None somehow to "None"? How does that happen? I'm not immediately seeing how we get ragged lists.

Basically, I'm trying to understand the nature of the changed behavior, under what circumstances we'd get different behavior, and how easy/hard it is to preserve existing behavior. My bias is also towards preserving existing behavior. I didn't quite understand the justification for displaying or storing "None" instead of "". I don't understand the relevance of NEP 34 or why we'll continue to see warnings in the future.

Basically: @Ipsit1234, can you help me understand the root cause of what's going on? Sorry for being dense/slow.

I am getting this warning only when I am running the tests, as @hmstepanek mentioned above. In normal usage, I didn't get any warning on using tbl.group(['a','b']) . I thought that while grouping somehow the ragged lists were forming by looking at the tests, but when I checked the code, it was not the case.
Sorry @davidwagner for the confusion, also can you provide a code snippet to reproduce the same warning while grouping?

I observed this in lec11.ipynb for Fall 2020's Data 8 demos, in the survey.group(['Handedness','Sleep position']).show() cell, but only on my local machine, not on datahub. Possibly because my local machine has numpy 1.19.1 whereas datahub has numpy 1.18.5. Let me see if I can construct a minimal reproducible testcase.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

SamLau95 picture SamLau95  路  22Comments

williamCorrea picture williamCorrea  路  3Comments

zzd1992 picture zzd1992  路  3Comments

disimone picture disimone  路  3Comments

jjisnow picture jjisnow  路  3Comments