Datascience: .group() with a list raises deprecation warning

Created on 21 Sep 2020 · 11Comments · Source: data-8/datascience

If you call tbl.group(['a', 'b']), you get the following warning:

//lib/python3.7/site-packages/datascience/tables.py:630: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
  values = np.array(tuple(values))

This is unsightly. Is there anything we should/can do in the datascience library to eliminate this?

Source

davidwagner

Most helpful comment

Yes I think that would be much appreciated! (I've been contributing the last couple weeks and it's been a wonderful experience :) )

hmstepanek on 27 Sep 2020

🎉2

All 11 comments

@davidwagner You are getting this warning from this line. I want to fix this, can I do that?

Ipsit1234 on 26 Sep 2020

There are actually some more occurrences of this particular deprecation warning when you run the tests as well.

tests/test_tables.py::test_doctests
  /Users/hmstepanek/Documents/open-source/datascience/datascience/util.py:40: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
    return np.array(elements)

tests/test_tables.py: 14 warnings
  /Users/hmstepanek/Documents/open-source/datascience/datascience/tables.py:877: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
    values = np.array(tuple(values))

hmstepanek on 27 Sep 2020

@hmstepanek All of these warnings are of similar kind. Can I try to fix this and contribute to this open-source project?

Ipsit1234 on 27 Sep 2020

🎉1

Yes I think that would be much appreciated! (I've been contributing the last couple weeks and it's been a wonderful experience :) )

hmstepanek on 27 Sep 2020

🎉2

@davidwagner @hmstepanek I am taking up this issue. Thanks a lot!!

Ipsit1234 on 27 Sep 2020

👍1

I made some changes locally and I ran the tests. Some of the tests were failing among which, most of them were docstring tests. The reason for failure is according to my changes, All the empty strings in a column will be replaced by None. But in the docstring, it is not the same case. So should I update the docstrings as well?

Ipsit1234 on 30 Sep 2020

@Ipsit1234 - were these tests failing prior to the changes you made locally? If so, then we should work with the tests available and try to craft code to make sure the tests don't fail (as they represent current behavior - which we're not looking to change).

As for this issue with None, I'm leaning towards keeping current functionality of making the cell blank rather than None as well. This is strictly from a pedagogical point-of-view, so if there's convincing arguments otherwise, I'd be open to discussing them here :)

adnanhemani on 1 Oct 2020

@adnanhemani no, the above tests were not failing prior to the changes that I made locally. The reason for the warning on creating an ndarray from ragged nested sequences is due to a recent change in the latest version i.e., v1.19.0 of NumPy, which discontinues the automatic inferring of the object datatype. The explanation for the same can be found in NEP 34.
Hence I suggest that it would be better to consider None instead of leaving it with empty strings as NumPy will update further in future giving the same warning again and again.

Ipsit1234 on 2 Oct 2020

I confess I don't really understand what is happening here. It looks like in the example you gave, marbles.group("Shape", sum) is going to end up executing (effectively):

c = [sum(np.array(["Green", "Blue", "Green"])), sum(np.array(["Red", "Red", "Green"]))]
columns.append(c)

I would expect that to raise an exception. Does it somehow get converted to None, and then numpy converts the None somehow to "None"? How does that happen? I'm not immediately seeing how we get ragged lists.

Basically, I'm trying to understand the nature of the changed behavior, under what circumstances we'd get different behavior, and how easy/hard it is to preserve existing behavior. My bias is also towards preserving existing behavior. I didn't quite understand the justification for displaying or storing "None" instead of "". I don't understand the relevance of NEP 34 or why we'll continue to see warnings in the future.

Basically: @Ipsit1234, can you help me understand the root cause of what's going on? Sorry for being dense/slow.

davidwagner on 3 Oct 2020

I am getting this warning only when I am running the tests, as @hmstepanek mentioned above. In normal usage, I didn't get any warning on using tbl.group(['a','b']) . I thought that while grouping somehow the ragged lists were forming by looking at the tests, but when I checked the code, it was not the case.
Sorry @davidwagner for the confusion, also can you provide a code snippet to reproduce the same warning while grouping?

Ipsit1234 on 3 Oct 2020

I observed this in lec11.ipynb for Fall 2020's Data 8 demos, in the survey.group(['Handedness','Sleep position']).show() cell, but only on my local machine, not on datahub. Possibly because my local machine has numpy 1.19.1 whereas datahub has numpy 1.18.5. Let me see if I can construct a minimal reproducible testcase.

davidwagner on 4 Oct 2020

Was this page helpful?

0 / 5 - 0 ratings