I'm trying to use ft_index_to_string
to convert prediction back to the original string, but I'm getting the error java.lang.ClassCastException: org.apache.spark.ml.attribute.UnresolvedAttribute$ cannot be cast to org.apache.spark.ml.attribute.NominalAttribute
.
I understand, that Spark doesn't know how to convert column of type double to string, but I'm not sure why exactly. Documentation is pretty clear about how this function is supposed to work: "Symmetrically to ft_string_indexer
, ft_index_to_string
maps a column of label indices back to a column containing the original labels as strings." But since there is no example, I'm not sure where I'm going wrong.
Reproducible example:
ex_df <- dplyr::data_frame(label = sample(LETTERS, replace = TRUE, size = 40),
feature1 = runif(40),
feature2 = runif(40))
library(sparklyr)
library(dplyr)
sc <- spark_connect(master = "local", version = "2.0.0")
ex_tbl <- copy_to(sc, ex_df)
partition <- ex_tbl %>%
ft_string_indexer(input.col = "label", output.col = "label_ind") %>%
sdf_partition(train = 0.5, test = 0.5, seed = 8585)
train_tbl <- partition$train
test_tbl <- partition$test
ml_formula <- formula(label_ind ~ . - label)
ml_nb <- ml_naive_bayes(train_tbl, ml_formula, type = "classification")
pred <- sdf_predict(ml_nb, test_tbl) %>%
ft_index_to_string(input.col = "prediction", output.col = "pred_label")
pred
Full error:
Error: java.lang.ClassCastException: org.apache.spark.ml.attribute.UnresolvedAttribute$ cannot be cast to org.apache.spark.ml.attribute.NominalAttribute
at org.apache.spark.ml.feature.IndexToString.transform(StringIndexer.scala:311)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at sparklyr.Invoke$.invoke(invoke.scala:94)
at sparklyr.StreamHandler$.handleMethodCall(stream.scala:89)
at sparklyr.StreamHandler$.read(stream.scala:55)
at sparklyr.BackendHandler.channelRead0(handler.scala:49)
at sparklyr.BackendHandler.channelRead0(handler.scala:14)
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
at java.lang.Thread.run(Thread.java:745)
sessionInfo():
R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Sierra 10.12.3
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] dplyr_0.5.0 sparklyr_0.5.3
loaded via a namespace (and not attached):
[1] Rcpp_0.12.9 withr_1.0.2 digest_0.6.12 rprojroot_1.2 assertthat_0.1 rappdirs_0.3.1 R6_2.2.0
[8] jsonlite_1.3 DBI_0.6 backports_1.0.5 magrittr_1.5 httr_1.2.1 lazyeval_0.2.0 config_0.2
[15] tools_3.3.2 parallel_3.3.2 yaml_2.1.14 base64enc_0.1-3 tibble_1.2
At the same time, if I pass label_ind
column into ft_index_to_string
then it works, so that column does have metadata:
ex_df <- dplyr::data_frame(label = sample(LETTERS, replace = TRUE, size = 40),
feature1 = runif(40),
feature2 = runif(40))
library(sparklyr)
library(dplyr)
sc <- spark_connect(master = "local", version = "2.0.0")
ex_tbl <- copy_to(sc, ex_df)
partition <- ex_tbl %>%
ft_string_indexer(input.col = "label", output.col = "label_ind") %>%
sdf_partition(train = 0.5, test = 0.5, seed = 8585)
train_tbl <- partition$train
test_tbl <- partition$test
ml_formula <- formula(label_ind ~ . - label)
ml_nb <- ml_naive_bayes(train_tbl, ml_formula, type = "classification")
pred <- sdf_predict(ml_nb, test_tbl) %>%
ft_index_to_string(input.col = "label_ind", output.col = "pred_label")
pred
Output:
> pred
Source: query [26 x 8]
Database: spark connection master=local[8] app=sparklyr local=TRUE
label feature1 feature2 label_ind rawPrediction probability prediction pred_label
<chr> <dbl> <dbl> <dbl> <list> <list> <dbl> <chr>
1 A 0.83615730 0.14315830 14 <dbl [11]> <dbl [11]> 2 A
2 D 0.31364793 0.46122049 19 <dbl [11]> <dbl [11]> 0 D
3 E 0.91040128 0.12862943 12 <dbl [11]> <dbl [11]> 2 E
4 G 0.77647296 0.05422326 15 <dbl [11]> <dbl [11]> 2 G
5 I 0.45718419 0.01273057 3 <dbl [11]> <dbl [11]> 0 I
6 I 0.71337845 0.90460885 3 <dbl [11]> <dbl [11]> 0 I
7 J 0.02232347 0.99550593 8 <dbl [11]> <dbl [11]> 0 J
8 J 0.22278048 0.38470348 8 <dbl [11]> <dbl [11]> 0 J
9 K 0.41756519 0.61650525 18 <dbl [11]> <dbl [11]> 0 K
10 L 0.09293533 0.49025301 5 <dbl [11]> <dbl [11]> 0 L
# ... with 16 more rows
So the question is how do I make sure that prediction
column has metadata about original label
s?
Good question! I'm not yet that familiar with how Spark stores attributes + metadata for these transformers, but this appears to be the relevant code:
Diving into the 'StructField' metadata, I see something like this:
[[4]]$metadata
<jobj[1358]>
class org.apache.spark.sql.types.Metadata
{"ml_attr":{"vals":["W","A","J","V","F","M","I","Q","P","C","K","R","O","S","X","T","U","H","D"],"type":"nominal","name":"label_ind"}}
We should definitely look at providing routines in sparklyr
for extracting metadata fields from Spark data columns, and should likely also be extracting those in e.g. collect()
as well.
For reference, the code I used to poke at this (copied from your example)
ex_df <-
dplyr::data_frame(
label = sample(LETTERS, replace = TRUE, size = 40),
feature1 = runif(40),
feature2 = runif(40)
)
library(sparklyr)
library(dplyr)
sc <- spark_connect(master = "local", version = "2.0.0")
ex_tbl <- copy_to(sc, ex_df)
partition <- ex_tbl %>%
ft_string_indexer(input.col = "label", output.col = "label_ind") %>%
sdf_partition(train = 0.5, test = 0.5, seed = 8585)
train_tbl <- partition$train
test_tbl <- partition$test
ml_formula <- formula(label_ind ~ . - label)
ml_nb <-
ml_naive_bayes(train_tbl, ml_formula, type = "classification")
pred <- sdf_predict(ml_nb, test_tbl) %>%
ft_index_to_string(input.col = "label_ind", output.col = "pred_label")
pred
schema <- invoke(spark_dataframe(predicted), "schema")
fields <- lapply(0:6, function(i) {
invoke(schema, "apply", as.integer(i))
})
lapply(fields, function(field) {
list(
comment = invoke(field, "getComment"),
metadata = invoke(field, "metadata"),
name = invoke(field, "name")
)
})
Working on https://github.com/rstudio/sparklyr/issues/937 which depends on this, will close and track there.
@mikhailBalyasin Here's the behavior in the new API
partition <- ex_tbl %>%
sdf_partition(train = 0.5, test = 0.5, seed = 8585)
train_tbl <- partition$train
test_tbl <- partition$test
ml_nb <- ml_naive_bayes(
train_tbl, label ~ ., label_col = "label_col")
pred <- ml_predict(ml_nb, test_tbl)
> pred %>%
+ glimpse()
Observations: 25
Variables: 17
$ label <chr> "A", "D", "D", "D", "E", "E", "F", ...
$ feature1 <dbl> 0.11230824, 0.04834678, 0.25245844,...
$ feature2 <dbl> 0.86993327, 0.92504589, 0.49927288,...
$ prediction <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ probability <list> [<0.17589514, 0.14304304, 0.099137...
$ rawPrediction <list> [<-2.432049, -2.638792, -3.005435,...
$ predicted_label <chr> "U", "U", "U", "U", "U", "U", "U", ...
$ probability_U <dbl> 0.1758951, 0.1766892, 0.1710263, 0....
$ probability_X <dbl> 0.1430430, 0.1458818, 0.1308545, 0....
$ probability_I <dbl> 0.09913703, 0.09577040, 0.11507541,...
$ probability_E <dbl> 0.09723525, 0.09957837, 0.08753517,...
$ probability_N <dbl> 0.09333505, 0.09477631, 0.08684602,...
$ probability_T <dbl> 0.06872758, 0.06669677, 0.07802856,...
$ probability_M <dbl> 0.09545482, 0.09736886, 0.08725261,...
$ probability_L <dbl> 0.07282556, 0.07117707, 0.07993756,...
$ probability_K <dbl> 0.07360868, 0.07204058, 0.08028413,...
$ probability_D <dbl> 0.08073785, 0.08002064, 0.08315973,...
There's also a labels
argument in ft_index_to_string()
so one can do this more "manually" while building pipelines from scratch, but getting the metadata might be slightly tricky, perhaps we'll implement functions to make that easier later.
Most helpful comment
@mikhailBalyasin Here's the behavior in the new API
There's also a
labels
argument inft_index_to_string()
so one can do this more "manually" while building pipelines from scratch, but getting the metadata might be slightly tricky, perhaps we'll implement functions to make that easier later.