Sparklyr: `ft_index_to_string` fails to convert indices back to strings

Created on 6 Apr 2017  路  4Comments  路  Source: sparklyr/sparklyr

I'm trying to use ft_index_to_string to convert prediction back to the original string, but I'm getting the error java.lang.ClassCastException: org.apache.spark.ml.attribute.UnresolvedAttribute$ cannot be cast to org.apache.spark.ml.attribute.NominalAttribute.

I understand, that Spark doesn't know how to convert column of type double to string, but I'm not sure why exactly. Documentation is pretty clear about how this function is supposed to work: "Symmetrically to ft_string_indexer, ft_index_to_string maps a column of label indices back to a column containing the original labels as strings." But since there is no example, I'm not sure where I'm going wrong.

Reproducible example:

ex_df <- dplyr::data_frame(label = sample(LETTERS, replace = TRUE, size = 40), 
                           feature1 = runif(40),
                           feature2 = runif(40))

library(sparklyr)
library(dplyr)
sc <- spark_connect(master = "local", version = "2.0.0")
ex_tbl <- copy_to(sc, ex_df)
partition <- ex_tbl %>%
  ft_string_indexer(input.col = "label", output.col = "label_ind") %>%
  sdf_partition(train = 0.5, test = 0.5, seed = 8585)
train_tbl <- partition$train
test_tbl <- partition$test
ml_formula <- formula(label_ind ~ . - label)
ml_nb <- ml_naive_bayes(train_tbl, ml_formula, type = "classification")

pred <- sdf_predict(ml_nb, test_tbl) %>%
  ft_index_to_string(input.col = "prediction", output.col = "pred_label") 
pred

Full error:

Error: java.lang.ClassCastException: org.apache.spark.ml.attribute.UnresolvedAttribute$ cannot be cast to org.apache.spark.ml.attribute.NominalAttribute
    at org.apache.spark.ml.feature.IndexToString.transform(StringIndexer.scala:311)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at sparklyr.Invoke$.invoke(invoke.scala:94)
    at sparklyr.StreamHandler$.handleMethodCall(stream.scala:89)
    at sparklyr.StreamHandler$.read(stream.scala:55)
    at sparklyr.BackendHandler.channelRead0(handler.scala:49)
    at sparklyr.BackendHandler.channelRead0(handler.scala:14)
    at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
    at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
    at java.lang.Thread.run(Thread.java:745)

sessionInfo():

R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Sierra 10.12.3

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_0.5.0    sparklyr_0.5.3

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.9     withr_1.0.2     digest_0.6.12   rprojroot_1.2   assertthat_0.1  rappdirs_0.3.1  R6_2.2.0       
 [8] jsonlite_1.3    DBI_0.6         backports_1.0.5 magrittr_1.5    httr_1.2.1      lazyeval_0.2.0  config_0.2     
[15] tools_3.3.2     parallel_3.3.2  yaml_2.1.14     base64enc_0.1-3 tibble_1.2     

ml

Most helpful comment

@mikhailBalyasin Here's the behavior in the new API

partition <- ex_tbl %>%
  sdf_partition(train = 0.5, test = 0.5, seed = 8585)
train_tbl <- partition$train
test_tbl <- partition$test
ml_nb <- ml_naive_bayes(
  train_tbl, label ~ ., label_col = "label_col")

pred <- ml_predict(ml_nb, test_tbl)
> pred %>%
  +   glimpse()
Observations: 25
Variables: 17
$ label           <chr> "A", "D", "D", "D", "E", "E", "F", ...
$ feature1        <dbl> 0.11230824, 0.04834678, 0.25245844,...
$ feature2        <dbl> 0.86993327, 0.92504589, 0.49927288,...
$ prediction      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ probability     <list> [<0.17589514, 0.14304304, 0.099137...
$ rawPrediction   <list> [<-2.432049, -2.638792, -3.005435,...
$ predicted_label <chr> "U", "U", "U", "U", "U", "U", "U", ...
$ probability_U   <dbl> 0.1758951, 0.1766892, 0.1710263, 0....
$ probability_X   <dbl> 0.1430430, 0.1458818, 0.1308545, 0....
$ probability_I   <dbl> 0.09913703, 0.09577040, 0.11507541,...
$ probability_E   <dbl> 0.09723525, 0.09957837, 0.08753517,...
$ probability_N   <dbl> 0.09333505, 0.09477631, 0.08684602,...
$ probability_T   <dbl> 0.06872758, 0.06669677, 0.07802856,...
$ probability_M   <dbl> 0.09545482, 0.09736886, 0.08725261,...
$ probability_L   <dbl> 0.07282556, 0.07117707, 0.07993756,...
$ probability_K   <dbl> 0.07360868, 0.07204058, 0.08028413,...
$ probability_D   <dbl> 0.08073785, 0.08002064, 0.08315973,...

There's also a labels argument in ft_index_to_string() so one can do this more "manually" while building pipelines from scratch, but getting the metadata might be slightly tricky, perhaps we'll implement functions to make that easier later.

All 4 comments

At the same time, if I pass label_ind column into ft_index_to_string then it works, so that column does have metadata:

ex_df <- dplyr::data_frame(label = sample(LETTERS, replace = TRUE, size = 40), 
                           feature1 = runif(40),
                           feature2 = runif(40))

library(sparklyr)
library(dplyr)
sc <- spark_connect(master = "local", version = "2.0.0")
ex_tbl <- copy_to(sc, ex_df)
partition <- ex_tbl %>%
  ft_string_indexer(input.col = "label", output.col = "label_ind") %>%
  sdf_partition(train = 0.5, test = 0.5, seed = 8585)
train_tbl <- partition$train
test_tbl <- partition$test
ml_formula <- formula(label_ind ~ . - label)
ml_nb <- ml_naive_bayes(train_tbl, ml_formula, type = "classification")

pred <- sdf_predict(ml_nb, test_tbl) %>%
  ft_index_to_string(input.col = "label_ind", output.col = "pred_label") 
pred

Output:

> pred
Source:   query [26 x 8]
Database: spark connection master=local[8] app=sparklyr local=TRUE

   label   feature1   feature2 label_ind rawPrediction probability prediction pred_label
   <chr>      <dbl>      <dbl>     <dbl>        <list>      <list>      <dbl>      <chr>
1      A 0.83615730 0.14315830        14    <dbl [11]>  <dbl [11]>          2          A
2      D 0.31364793 0.46122049        19    <dbl [11]>  <dbl [11]>          0          D
3      E 0.91040128 0.12862943        12    <dbl [11]>  <dbl [11]>          2          E
4      G 0.77647296 0.05422326        15    <dbl [11]>  <dbl [11]>          2          G
5      I 0.45718419 0.01273057         3    <dbl [11]>  <dbl [11]>          0          I
6      I 0.71337845 0.90460885         3    <dbl [11]>  <dbl [11]>          0          I
7      J 0.02232347 0.99550593         8    <dbl [11]>  <dbl [11]>          0          J
8      J 0.22278048 0.38470348         8    <dbl [11]>  <dbl [11]>          0          J
9      K 0.41756519 0.61650525        18    <dbl [11]>  <dbl [11]>          0          K
10     L 0.09293533 0.49025301         5    <dbl [11]>  <dbl [11]>          0          L
# ... with 16 more rows

So the question is how do I make sure that prediction column has metadata about original labels?

Good question! I'm not yet that familiar with how Spark stores attributes + metadata for these transformers, but this appears to be the relevant code:

https://github.com/apache/spark/blob/74aa0df8f7f132b62754e5159262e4a5b9b641ab/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L350-L372

Diving into the 'StructField' metadata, I see something like this:

[[4]]$metadata
<jobj[1358]>
  class org.apache.spark.sql.types.Metadata
  {"ml_attr":{"vals":["W","A","J","V","F","M","I","Q","P","C","K","R","O","S","X","T","U","H","D"],"type":"nominal","name":"label_ind"}}

We should definitely look at providing routines in sparklyr for extracting metadata fields from Spark data columns, and should likely also be extracting those in e.g. collect() as well.

For reference, the code I used to poke at this (copied from your example)

ex_df <-
  dplyr::data_frame(
    label = sample(LETTERS, replace = TRUE, size = 40),
    feature1 = runif(40),
    feature2 = runif(40)
  )

library(sparklyr)
library(dplyr)
sc <- spark_connect(master = "local", version = "2.0.0")
ex_tbl <- copy_to(sc, ex_df)
partition <- ex_tbl %>%
  ft_string_indexer(input.col = "label", output.col = "label_ind") %>%
  sdf_partition(train = 0.5, test = 0.5, seed = 8585)
train_tbl <- partition$train
test_tbl <- partition$test
ml_formula <- formula(label_ind ~ . - label)
ml_nb <-
  ml_naive_bayes(train_tbl, ml_formula, type = "classification")

pred <- sdf_predict(ml_nb, test_tbl) %>%
  ft_index_to_string(input.col = "label_ind", output.col = "pred_label")
pred

schema <- invoke(spark_dataframe(predicted), "schema")
fields <- lapply(0:6, function(i) {
  invoke(schema, "apply", as.integer(i))
})

lapply(fields, function(field) {
  list(
    comment = invoke(field, "getComment"),
    metadata = invoke(field, "metadata"),
    name = invoke(field, "name")
  )
})

Working on https://github.com/rstudio/sparklyr/issues/937 which depends on this, will close and track there.

@mikhailBalyasin Here's the behavior in the new API

partition <- ex_tbl %>%
  sdf_partition(train = 0.5, test = 0.5, seed = 8585)
train_tbl <- partition$train
test_tbl <- partition$test
ml_nb <- ml_naive_bayes(
  train_tbl, label ~ ., label_col = "label_col")

pred <- ml_predict(ml_nb, test_tbl)
> pred %>%
  +   glimpse()
Observations: 25
Variables: 17
$ label           <chr> "A", "D", "D", "D", "E", "E", "F", ...
$ feature1        <dbl> 0.11230824, 0.04834678, 0.25245844,...
$ feature2        <dbl> 0.86993327, 0.92504589, 0.49927288,...
$ prediction      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ probability     <list> [<0.17589514, 0.14304304, 0.099137...
$ rawPrediction   <list> [<-2.432049, -2.638792, -3.005435,...
$ predicted_label <chr> "U", "U", "U", "U", "U", "U", "U", ...
$ probability_U   <dbl> 0.1758951, 0.1766892, 0.1710263, 0....
$ probability_X   <dbl> 0.1430430, 0.1458818, 0.1308545, 0....
$ probability_I   <dbl> 0.09913703, 0.09577040, 0.11507541,...
$ probability_E   <dbl> 0.09723525, 0.09957837, 0.08753517,...
$ probability_N   <dbl> 0.09333505, 0.09477631, 0.08684602,...
$ probability_T   <dbl> 0.06872758, 0.06669677, 0.07802856,...
$ probability_M   <dbl> 0.09545482, 0.09736886, 0.08725261,...
$ probability_L   <dbl> 0.07282556, 0.07117707, 0.07993756,...
$ probability_K   <dbl> 0.07360868, 0.07204058, 0.08028413,...
$ probability_D   <dbl> 0.08073785, 0.08002064, 0.08315973,...

There's also a labels argument in ft_index_to_string() so one can do this more "manually" while building pipelines from scratch, but getting the metadata might be slightly tricky, perhaps we'll implement functions to make that easier later.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

isomorphisms picture isomorphisms  路  3Comments

JohnMount picture JohnMount  路  4Comments

mjcarroll1985 picture mjcarroll1985  路  3Comments

wanting0wang picture wanting0wang  路  3Comments

VadymBoikov picture VadymBoikov  路  3Comments