I hope that I am the one missing something here, because this feature request is going to sound like a very basic thing that I had assumed would have been implemented.
In pandas, when I create a dataframe and call on it in a Jupyter notebook cell, I will get an HTML rendering of it in the notebook, e.g.:
# This is a Jupyter notebook cell
df = pd.read_csv(...)
df # the HTML is rendered.
With cuDF, I get a textual representation instead:
gdf = cudf.read_csv(...)
gdf # the following text gets printed to screen: <cudf.DataFrame ncols=4 nrows=4208262 >
I had originally thought the gdf object would be rendered in HTML similar to pandas, but this was not apparent in my installation of cudf. If I have not missed out on something, then I'm hoping this can be implemented, so that cuDF's UX can match pandas' UX, and hence provide a seamless transition.
A near-term fix for this would be gdf.head().to_pandas(), but it would be great to have the proper method defined for Jupyter Notebook as well.
I suggest the following implementation:
def get_renderable_pandas_dataframe(gdf):
n = pandas.core.config.get_config("display.max_rows")
if len(gdf) <= n:
return gdf.to_pandas()
else:
return pd.concat([gdf.head(n + 1), gdf.tail(n + 1)]) # enough head and tail to look the same, plus some extra
def __repr__(self):
return get_renderable_pandas_dataframe(self).__repr__()
def _repr_html_(self):
return get_renderable_pandas_dataframe(self)._repr_html_()
There are probably some other methods here, like __str__ and to_html that might benefit from this treatment as well
@ericmjl any interest?
@mrocklin I thought you were supposed to be on break or something :smile:.
I do have a question, hope you don't mind it - I think it stems from my mental model of pandas, and model of GPUs, being unclear. Is conversion to a pandas DataFrame necessary for HTML rendering? If so, is it because HTML rendering is done on the CPU and not on the GPU? I guess I'm mostly concerned about data transfer making HTML rendering slow, but perhaps this concern is unfounded, if data transfer is minuscule and hence the overhead is as well?
I'm back from break :)
I imagine it would be possible to render a dataframe to HTML on the GPU, but my guess is that bringing over the 20 or so rows necessary to render on the CPU side (reusing the pandas code) can be done in a millisecond or so. Given that this only happens during human interaction I think that we're probably in a regime where a millisecond is considered a short time.
Ok! I'm in :smile:. Would you like me to do the PR? (Reminds me, I have a backlog for dask-jobqueue - need to do the docs PR fix.)
If you're interested, sure! No pressure though :)
On Sat, Dec 15, 2018 at 4:23 PM Eric Ma notifications@github.com wrote:
Ok! I'm in 😄. Would you like me to do the PR? (Reminds me, I have a
backlog for dask-jobqueue - need to do the docs PR fix.)—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/rapidsai/cudf/issues/489#issuecomment-447598795, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AASszA1alqnpbR_Ve4Al8Wgu3Qvnd0rLks5u5WhmgaJpZM4ZLDK-
.
@kkraus14 if you have time can you chime in here? What would need to happen in order for this to work? Alternatively, do you have any suggestions?
@mrocklin I made some improvements to your original submission at https://github.com/rapidsai/cudf/pull/624 and it is nearly ready to be merged. It handles nulls in integer columns now by substituting the new Integer datatype during printing, but I'm not sure what else is required. I'll spend more time on this if possible before 0.8 ships since it has made it into P2.
For the record @ericmjl if you simply call gdf.to_pandas() you can easily render it in Jupyter notebooks or elsewhere rendering is supported by Pandas.
Great news! I'm curious, is there a PR somewhere where I can see your work?
The PR is your original PR, I've been pushing to mrocklin/repr-html :)
Woooo!!
Most helpful comment
A near-term fix for this would be
gdf.head().to_pandas(), but it would be great to have the proper method defined for Jupyter Notebook as well.