Openrefine: Include summary statistics for numeric columns in Facets

Created on 29 Mar 2019  路  7Comments  路  Source: OpenRefine/OpenRefine

Is your feature request related to a problem or area of OpenRefine? Please describe.
For numeric cells in a column I would like to be easily find basic statistical information such as:

  • max
  • min
  • mean
  • median
  • sum

Describe the solution you'd like
Make it possible to see this information in a facet

Additional context
See discussion in #1340

enhancement facets

Most helpful comment

I would suggest a separate "Statistical Facet" that allows the user to explore a column's quartiles, outliers, and sigmas in a more precise fashion than the current histogram. I would do a Gaussian (or Normal) mode and a quartile mode.

It would have the side effect of providing all the statistical measures as mentioned above.

Quartile mode would use the rules of the ggplot box plot to determine outliers (1.5*IQR).

Gaussian mode would use sigmas:
image

Give a "n choices"-style link that allows pops up a text box with all the statistics for easy copying. The blue "Stat:" link in the above mock-up.

Make it invertible as well.

To cleanse:

  1. Move sliders to the +/- 3 sigma marks
  2. Invert the selection
  3. Review the validity of the data
  4. Mass delete

Like a text facet, this facet would be responsive to the current facets and filters. The numeric facet, for example, only indicates the filtering through dimming of the bars. This facet would "zoom in" on the filtered data and the statistics would update.

All 7 comments

For the record, the statistics provided by the currently deprecated refine-stats extension are:

Count
Sum
Min
Max
Mean
Median
Mode (not very useful IMHO)
Standard deviation
Variance

The number of blank or null cells (NA) would also be very useful.

Is "Count" the number of cells with a numeric value?

Yep, just the number of cells in the (filtered) column.

OK - so I think we already get:

  • Count of Numeric
  • Count of blank (null or empty string)
  • Count of errors
  • Count of non-numeric (not including blanks or errors)

A range is shown, but this is usually slightly outside the Max/Min (i.e. the max shown can be slightly higher than the max value in the column and the min shown can be slightly lower than the min value in the column)

I see that a tool like Talend Data Preparation displays this information as a Boxplot.

screenshot-127 0 0 1-9090-2019 03 29-13-23-18

In the logic of OpenRefine, the equivalent could be an enriched histogram (maybe with the median and the mean as vertical lines)

screenshot-localhost-3333-2019 03 29-13-24-33

I would suggest a separate "Statistical Facet" that allows the user to explore a column's quartiles, outliers, and sigmas in a more precise fashion than the current histogram. I would do a Gaussian (or Normal) mode and a quartile mode.

It would have the side effect of providing all the statistical measures as mentioned above.

Quartile mode would use the rules of the ggplot box plot to determine outliers (1.5*IQR).

Gaussian mode would use sigmas:
image

Give a "n choices"-style link that allows pops up a text box with all the statistics for easy copying. The blue "Stat:" link in the above mock-up.

Make it invertible as well.

To cleanse:

  1. Move sliders to the +/- 3 sigma marks
  2. Invert the selection
  3. Review the validity of the data
  4. Mass delete

Like a text facet, this facet would be responsive to the current facets and filters. The numeric facet, for example, only indicates the filtering through dimming of the bars. This facet would "zoom in" on the filtered data and the statistics would update.

@nanobrad Bradley THE ARTIST ! Nicely done mockup ! Thanks for this !

Was this page helpful?
0 / 5 - 0 ratings