Dataframes.jl: Stop printing row numbers in show(io, df)?

Created on 25 Aug 2015  Â·  26Comments  Â·  Source: JuliaData/DataFrames.jl

I was confused about the column called "Row" that is printed in all DataFrames since it doesn't keep track of indexes after slicing. For example:

julia> df = DataFrame(x=1:100)

julia> df[20:50,:]
31x1 DataFrame
| Row | x  |
|-----|----|
| 1   | 20 |
| 2   | 21 |
| 3   | 22 |
| 4   | 23 |
| 5   | 24 |
| 6   | 25 |
| 7   | 26 |
| 8   | 27 |
â‹®
| 23  | 42 |
| 24  | 43 |
| 25  | 44 |
| 26  | 45 |
| 27  | 46 |
| 28  | 47 |
| 29  | 48 |
| 30  | 49 |
| 31  | 50 |

As this column "Row" is printed, the index starts again with 1 instead of 20. There was an issue ( #187 ) a couple of years ago, but I think the idea was to not rely on indexes and use them only for speed. Since it has been a while, I'd like to know what is the current consensus regarding this issue.

display

Most helpful comment

I think that, for printing, a lot could be gained by just not printing the vertical bars left of the row number, and the identifier 'row'. That would be like

julia> DataFrame(index=1:100, y=2*(1:100))[50:70,:]
21x2 DataFrame
       index | y   |
     |-------|-----|
 1   | 50    | 100 |
 2   | 51    | 102 |
 3   | 52    | 104 |
 4   | 53    | 106 |
 5   | 54    | 108 |
 6   | 55    | 110 |
â‹®

 20  | 69    | 138 |
 21  | 70    | 140 |

All 26 comments

I prefer printing the row numbers since you're allowed to index using them at the moment, but I could be convinced to change that if enough of the current committers agree that the row numbers are a problem.

If we ever remove the row numbers (which I'd like to do), this issue would just go away.

I've retitled this issue since the term "index" is misleading.

Well, the row numbers are useful for simple observations, but I think they become redundant when the user needs a real index to keep track of some parts of the data:

julia> DataFrame(index=1:100, y=2*(1:100))[50:70,:]
21x2 DataFrame
| Row | index | y   |
|-----|-------|-----|
| 1   | 50    | 100 |
| 2   | 51    | 102 |
| 3   | 52    | 104 |
| 4   | 53    | 106 |
| 5   | 54    | 108 |
| 6   | 55    | 110 |
| 7   | 56    | 112 |
â‹®
| 14  | 63    | 126 |
| 15  | 64    | 128 |
| 16  | 65    | 130 |
| 17  | 66    | 132 |
| 18  | 67    | 134 |
| 19  | 68    | 136 |
| 20  | 69    | 138 |
| 21  | 70    | 140 |

For the front-ends, the printed row numbers are very cumbersome. When the user defines a DataFrame with a single column, its HTML representation is actually a two column DataFrame. Usually, this is not a big deal because most front-ends are only displaying the data. However, I noticed the difference when I wanted to receive user-defined DataFrames using the datatables library with virtual scrolling and there was a mismatch in the number of columns.

Furthermore, I think people that uses Python and R frequently want real row numbers. In any case, if we remove the row numbers, what would be offered as a replacement? A real column working as an index?

We'll deal with those kind of user interface issues when the appropriate time comes for dealing with them. Right now, making progress on DataFrames is blocked on finalizing the NullableArrays package in time for Julia 0.4.

That's okay. I just wanted to know if this was a decision already taken or if there was some likelihood to get fixed some of these issues. Thanks.

I think that, for printing, a lot could be gained by just not printing the vertical bars left of the row number, and the identifier 'row'. That would be like

julia> DataFrame(index=1:100, y=2*(1:100))[50:70,:]
21x2 DataFrame
       index | y   |
     |-------|-----|
 1   | 50    | 100 |
 2   | 51    | 102 |
 3   | 52    | 104 |
 4   | 53    | 106 |
 5   | 54    | 108 |
 6   | 55    | 110 |
â‹®

 20  | 69    | 138 |
 21  | 70    | 140 |

Good idea, @mkborregaard.

+1 for @mkborregaard suggestion.
In text mode for wide data frames that do not fit one screen row numbers are essential to track the same record in multiple pages, whereas indexing column would be printed only once.

+1 for @mkborregaard too

I also like @mkborregaard's idea but I'm not sure that addresses the use of a real row of numbers (instead of using a printed representation).

@alyst I'm not sure if this is what you mean but pandas keeps the index column even if the number of columns is too large to fit in the screen:

In [6]: X = np.random.random((5, 10))

In [7]: pd.DataFrame(X)
Out[7]: 
          0         1         2         3         4         5         6  \
0  0.788095  0.200569  0.503817  0.951415  0.394964  0.574591  0.095610   
1  0.252333  0.233394  0.400834  0.763205  0.651176  0.308817  0.830079   
2  0.168796  0.637577  0.362691  0.751329  0.260100  0.336644  0.135710   
3  0.028374  0.417096  0.049947  0.969493  0.644621  0.992500  0.796625   
4  0.217272  0.996964  0.822133  0.961850  0.002511  0.327640  0.621592   

          7         8         9  
0  0.531817  0.250808  0.897373  
1  0.034938  0.312996  0.788211  
2  0.293733  0.383446  0.462809  
3  0.115683  0.577399  0.811903  
4  0.446433  0.519582  0.848727  

@rsmith31415 Yes, thank you, that's what I have meant.
One potential solution could be to add frozen_column= parameter to show(io, df). If specified, it should print the contents of this column(s?) (e.g. real record ID) on each page instead of row numbers.

I think potentially there are two issues here - whether to print row numbers, and whether to automatically associate a row index to DataFrames that has special properties when printing. Note that this behaviour in R is not necessarily intuitive:

> iris[14:18, ]
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
14          4.3         3.0          1.1         0.1  setosa
15          5.8         4.0          1.2         0.2  setosa
16          5.7         4.4          1.5         0.4  setosa
17          5.4         3.9          1.3         0.4  setosa
18          5.1         3.5          1.4         0.3  setosa

is intuitive, but it is not necessarily intuitive when slicing:

> new_data.frame <- iris[iris$Sepal.Width > 3.5, ]
> head(new_data.frame)
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5           5.0         3.6          1.4         0.2  setosa
6           5.4         3.9          1.7         0.4  setosa
11          5.4         3.7          1.5         0.2  setosa
15          5.8         4.0          1.2         0.2  setosa
16          5.7         4.4          1.5         0.4  setosa
17          5.4         3.9          1.3         0.4  setosa
> new_data.frame[14:18, ]
    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
45           5.1         3.8          1.9         0.4    setosa
47           5.1         3.8          1.6         0.2    setosa
49           5.3         3.7          1.5         0.2    setosa
110          7.2         3.6          6.1         2.5 virginica
118          7.7         3.8          6.7         2.2 virginica

In the last case, the row names only make sense if they actually mean something.

The way I read @alyst 's comment, the suggestion is to allow the DataFrames to have a custom key associated that has special properties when printing. I like that, to me that seems user-friendly and intuitive. Is some of that functionality already in the NamedArray package?

@mkborregaard I've meant only specifying column(s) at print time. Column annotations within a data frame is a different story. It's quite some effort to implement keys, but even for simple annotations we would have to figure out how they should behave under data transformation (e.g. joining or grouping), which might be not so universally intuitive in the end.

OK, I get it now. I think that idea is nice!
Still, I actually also like the row numbers as they are. In R I guess they are actually automatically generated row names and act as such, whereas in julia they are indexes into the DataFrame being printed (even if that has been sliced before printing). OK, that was my 2 cents, I understand it a lot better now I think.

@mkborregaard's proposal looks a lot like what Hadley Wickham's tibble does: https://github.com/tidyverse/tibble/blob/master/README.md

@quinnj I think this is not a duplicate. The purpose was to propose a real index, not to hide the row numbers.

If this issue is about adding a concept similar to row names in R or Pandas, I think it can be closed as this probably won't happen. What we could envisage is marking a specific column as being an index like in SQL databases.

@nalimilan I think your suggestion is very reasonable, but let me point out that it is quite similar (or even equivalent) to using "row names". The main purpose is to have an index, so even if this index is not printed by default, it will still be useful.

It's quite similar, but the advantage is that it wouldn't force you to have a useless column of row names when you don't use it. The problem with row names is that they are often redundant with an ID column which already exists in the data, but since row names don't behave like a standard column they are annoying to work with.

Sure. I understand your point. It looks like this is a very subjective issue because I often work with datasets that don't have an ID column, so the additional column is very useful. In any case, I think we can agree that an optional index would be a nice feature.

For whatever it's worth, I actually disagree that an optional index would be beneficial. In my opinion, if an index or some other set of names is significant in your data, it should be stored as a column of the dataset.

I think that's because you don't use indexes. Regardless of their different behavior, indexes are also useful to increase speed.

DataFrames is now pretty agnostic to the columns under the hood, so it would be totally possible to create an IndexedColumn type that stored a btree index or whatever. It would take some work to make sure things like join or getindex took advantage, but totally doable and probably composes pretty well w/ the rest of the system now.

may we have something like to_html() (in pandas), I think atm show(io, "text/html", df) is pretty close, but it doesn't seem to dump everything into a html table per-se.

may we have something like to_html() (in pandas), I think atm show(io, "text/html", df) is pretty close, but it doesn't seem to dump everything into a html table per-se.

What do you need that it doesn't do?

This is largely resolved with PrettyTables.jl backend now.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

abieler picture abieler  Â·  7Comments

jangorecki picture jangorecki  Â·  7Comments

gustafsson picture gustafsson  Â·  6Comments

davidanthoff picture davidanthoff  Â·  4Comments

bbrunaud picture bbrunaud  Â·  3Comments