Sparklyr: `ft_index_to_string` fails to convert indices back to strings

Created on 6 Apr 2017 · 4Comments · Source: sparklyr/sparklyr

I'm trying to use ft_index_to_string to convert prediction back to the original string, but I'm getting the error java.lang.ClassCastException: org.apache.spark.ml.attribute.UnresolvedAttribute$ cannot be cast to org.apache.spark.ml.attribute.NominalAttribute.

I understand, that Spark doesn't know how to convert column of type double to string, but I'm not sure why exactly. Documentation is pretty clear about how this function is supposed to work: "Symmetrically to ft_string_indexer, ft_index_to_string maps a column of label indices back to a column containing the original labels as strings." But since there is no example, I'm not sure where I'm going wrong.

Reproducible example:

ex_df <- dplyr::data_frame(label = sample(LETTERS, replace = TRUE, size = 40), 
                           feature1 = runif(40),
                           feature2 = runif(40))

library(sparklyr)
library(dplyr)
sc <- spark_connect(master = "local", version = "2.0.0")
ex_tbl <- copy_to(sc, ex_df)
partition <- ex_tbl %>%
  ft_string_indexer(input.col = "label", output.col = "label_ind") %>%
  sdf_partition(train = 0.5, test = 0.5, seed = 8585)
train_tbl <- partition$train
test_tbl <- partition$test
ml_formula <- formula(label_ind ~ . - label)
ml_nb <- ml_naive_bayes(train_tbl, ml_formula, type = "classification")

pred <- sdf_predict(ml_nb, test_tbl) %>%
  ft_index_to_string(input.col = "prediction", output.col = "pred_label") 
pred

Full error:

Error: java.lang.ClassCastException: org.apache.spark.ml.attribute.UnresolvedAttribute$ cannot be cast to org.apache.spark.ml.attribute.NominalAttribute
    at org.apache.spark.ml.feature.IndexToString.transform(StringIndexer.scala:311)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at sparklyr.Invoke$.invoke(invoke.scala:94)
    at sparklyr.StreamHandler$.handleMethodCall(stream.scala:89)
    at sparklyr.StreamHandler$.read(stream.scala:55)
    at sparklyr.BackendHandler.channelRead0(handler.scala:49)
    at sparklyr.BackendHandler.channelRead0(handler.scala:14)
    at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
    at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
    at java.lang.Thread.run(Thread.java:745)

sessionInfo():

R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Sierra 10.12.3

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_0.5.0    sparklyr_0.5.3

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.9     withr_1.0.2     digest_0.6.12   rprojroot_1.2   assertthat_0.1  rappdirs_0.3.1  R6_2.2.0       
 [8] jsonlite_1.3    DBI_0.6         backports_1.0.5 magrittr_1.5    httr_1.2.1      lazyeval_0.2.0  config_0.2     
[15] tools_3.3.2     parallel_3.3.2  yaml_2.1.14     base64enc_0.1-3 tibble_1.2

Source

mikhailBalyasin

Most helpful comment

@mikhailBalyasin Here's the behavior in the new API

partition <- ex_tbl %>%
  sdf_partition(train = 0.5, test = 0.5, seed = 8585)
train_tbl <- partition$train
test_tbl <- partition$test
ml_nb <- ml_naive_bayes(
  train_tbl, label ~ ., label_col = "label_col")

pred <- ml_predict(ml_nb, test_tbl)
> pred %>%
  +   glimpse()
Observations: 25
Variables: 17
$ label           <chr> "A", "D", "D", "D", "E", "E", "F", ...
$ feature1        <dbl> 0.11230824, 0.04834678, 0.25245844,...
$ feature2        <dbl> 0.86993327, 0.92504589, 0.49927288,...
$ prediction      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ probability     <list> [<0.17589514, 0.14304304, 0.099137...
$ rawPrediction   <list> [<-2.432049, -2.638792, -3.005435,...
$ predicted_label <chr> "U", "U", "U", "U", "U", "U", "U", ...
$ probability_U   <dbl> 0.1758951, 0.1766892, 0.1710263, 0....
$ probability_X   <dbl> 0.1430430, 0.1458818, 0.1308545, 0....
$ probability_I   <dbl> 0.09913703, 0.09577040, 0.11507541,...
$ probability_E   <dbl> 0.09723525, 0.09957837, 0.08753517,...
$ probability_N   <dbl> 0.09333505, 0.09477631, 0.08684602,...
$ probability_T   <dbl> 0.06872758, 0.06669677, 0.07802856,...
$ probability_M   <dbl> 0.09545482, 0.09736886, 0.08725261,...
$ probability_L   <dbl> 0.07282556, 0.07117707, 0.07993756,...
$ probability_K   <dbl> 0.07360868, 0.07204058, 0.08028413,...
$ probability_D   <dbl> 0.08073785, 0.08002064, 0.08315973,...

There's also a labels argument in ft_index_to_string() so one can do this more "manually" while building pipelines from scratch, but getting the metadata might be slightly tricky, perhaps we'll implement functions to make that easier later.

kevinykuo on 26 Oct 2017

👍2

All 4 comments

At the same time, if I pass label_ind column into ft_index_to_string then it works, so that column does have metadata:

ex_df <- dplyr::data_frame(label = sample(LETTERS, replace = TRUE, size = 40), 
                           feature1 = runif(40),
                           feature2 = runif(40))

library(sparklyr)
library(dplyr)
sc <- spark_connect(master = "local", version = "2.0.0")
ex_tbl <- copy_to(sc, ex_df)
partition <- ex_tbl %>%
  ft_string_indexer(input.col = "label", output.col = "label_ind") %>%
  sdf_partition(train = 0.5, test = 0.5, seed = 8585)
train_tbl <- partition$train
test_tbl <- partition$test
ml_formula <- formula(label_ind ~ . - label)
ml_nb <- ml_naive_bayes(train_tbl, ml_formula, type = "classification")

pred <- sdf_predict(ml_nb, test_tbl) %>%
  ft_index_to_string(input.col = "label_ind", output.col = "pred_label") 
pred

Output:

> pred
Source:   query [26 x 8]
Database: spark connection master=local[8] app=sparklyr local=TRUE

   label   feature1   feature2 label_ind rawPrediction probability prediction pred_label
   <chr>      <dbl>      <dbl>     <dbl>        <list>      <list>      <dbl>      <chr>
1      A 0.83615730 0.14315830        14    <dbl [11]>  <dbl [11]>          2          A
2      D 0.31364793 0.46122049        19    <dbl [11]>  <dbl [11]>          0          D
3      E 0.91040128 0.12862943        12    <dbl [11]>  <dbl [11]>          2          E
4      G 0.77647296 0.05422326        15    <dbl [11]>  <dbl [11]>          2          G
5      I 0.45718419 0.01273057         3    <dbl [11]>  <dbl [11]>          0          I
6      I 0.71337845 0.90460885         3    <dbl [11]>  <dbl [11]>          0          I
7      J 0.02232347 0.99550593         8    <dbl [11]>  <dbl [11]>          0          J
8      J 0.22278048 0.38470348         8    <dbl [11]>  <dbl [11]>          0          J
9      K 0.41756519 0.61650525        18    <dbl [11]>  <dbl [11]>          0          K
10     L 0.09293533 0.49025301         5    <dbl [11]>  <dbl [11]>          0          L
# ... with 16 more rows

So the question is how do I make sure that prediction column has metadata about original labels?

mikhailBalyasin on 7 Apr 2017

Good question! I'm not yet that familiar with how Spark stores attributes + metadata for these transformers, but this appears to be the relevant code:

https://github.com/apache/spark/blob/74aa0df8f7f132b62754e5159262e4a5b9b641ab/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L350-L372

Diving into the 'StructField' metadata, I see something like this:

[[4]]$metadata
<jobj[1358]>
  class org.apache.spark.sql.types.Metadata
  {"ml_attr":{"vals":["W","A","J","V","F","M","I","Q","P","C","K","R","O","S","X","T","U","H","D"],"type":"nominal","name":"label_ind"}}

We should definitely look at providing routines in sparklyr for extracting metadata fields from Spark data columns, and should likely also be extracting those in e.g. collect() as well.

For reference, the code I used to poke at this (copied from your example)

ex_df <-
  dplyr::data_frame(
    label = sample(LETTERS, replace = TRUE, size = 40),
    feature1 = runif(40),
    feature2 = runif(40)
  )

library(sparklyr)
library(dplyr)
sc <- spark_connect(master = "local", version = "2.0.0")
ex_tbl <- copy_to(sc, ex_df)
partition <- ex_tbl %>%
  ft_string_indexer(input.col = "label", output.col = "label_ind") %>%
  sdf_partition(train = 0.5, test = 0.5, seed = 8585)
train_tbl <- partition$train
test_tbl <- partition$test
ml_formula <- formula(label_ind ~ . - label)
ml_nb <-
  ml_naive_bayes(train_tbl, ml_formula, type = "classification")

pred <- sdf_predict(ml_nb, test_tbl) %>%
  ft_index_to_string(input.col = "label_ind", output.col = "pred_label")
pred

schema <- invoke(spark_dataframe(predicted), "schema")
fields <- lapply(0:6, function(i) {
  invoke(schema, "apply", as.integer(i))
})

lapply(fields, function(field) {
  list(
    comment = invoke(field, "getComment"),
    metadata = invoke(field, "metadata"),
    name = invoke(field, "name")
  )
})

kevinushey on 19 Apr 2017

👍1

Working on https://github.com/rstudio/sparklyr/issues/937 which depends on this, will close and track there.

kevinykuo on 18 Aug 2017

👍1

@mikhailBalyasin Here's the behavior in the new API

partition <- ex_tbl %>%
  sdf_partition(train = 0.5, test = 0.5, seed = 8585)
train_tbl <- partition$train
test_tbl <- partition$test
ml_nb <- ml_naive_bayes(
  train_tbl, label ~ ., label_col = "label_col")

pred <- ml_predict(ml_nb, test_tbl)
> pred %>%
  +   glimpse()
Observations: 25
Variables: 17
$ label           <chr> "A", "D", "D", "D", "E", "E", "F", ...
$ feature1        <dbl> 0.11230824, 0.04834678, 0.25245844,...
$ feature2        <dbl> 0.86993327, 0.92504589, 0.49927288,...
$ prediction      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ probability     <list> [<0.17589514, 0.14304304, 0.099137...
$ rawPrediction   <list> [<-2.432049, -2.638792, -3.005435,...
$ predicted_label <chr> "U", "U", "U", "U", "U", "U", "U", ...
$ probability_U   <dbl> 0.1758951, 0.1766892, 0.1710263, 0....
$ probability_X   <dbl> 0.1430430, 0.1458818, 0.1308545, 0....
$ probability_I   <dbl> 0.09913703, 0.09577040, 0.11507541,...
$ probability_E   <dbl> 0.09723525, 0.09957837, 0.08753517,...
$ probability_N   <dbl> 0.09333505, 0.09477631, 0.08684602,...
$ probability_T   <dbl> 0.06872758, 0.06669677, 0.07802856,...
$ probability_M   <dbl> 0.09545482, 0.09736886, 0.08725261,...
$ probability_L   <dbl> 0.07282556, 0.07117707, 0.07993756,...
$ probability_K   <dbl> 0.07360868, 0.07204058, 0.08028413,...
$ probability_D   <dbl> 0.08073785, 0.08002064, 0.08315973,...

kevinykuo on 26 Oct 2017

👍2

Was this page helpful?

0 / 5 - 0 ratings