Hi, I'm testing the new Spark SQL driver in Metabase v0.29 and it's not working. O filled the "new database" form with required info and got the error Couldn't connect to the database. Please check the connection details.
When I check the logs the following info is displayed:
05-02 22:18:30 DEBUG metabase.middleware :: GET /api/user/current 200 (7 ms) (1 DB calls). Jetty threads: 8/50 (4 busy, 6 idle, 0 queued)
05-02 22:18:31 DEBUG metabase.middleware :: GET /api/setting 200 (2 ms) (0 DB calls). Jetty threads: 8/50 (4 busy, 6 idle, 0 queued)
05-02 22:18:50 ERROR metabase.driver :: Failed to connect to database: java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration
It seems that we're missing some hadoop-commons files in the build (reference: https://github.com/metabase/metabase/issues/2157#issuecomment-386065796). Not sure if it's important to highlight, but I'm trying to connect to a remote Spark (I can access it using other tools, but I'm not able to to this using my Metabase).
Discourse discussion is here http://discourse.metabase.com/t/connecting-to-local-spark/3444
@wjoel have you ever seen the
java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration
error before when trying to use the SparkSQL driver?
Not sure what dependency we're missing. Apparently it doesn't happen in the build you did from a while back
It might be that Drill stuff you ended up taking out?
I'm also trying to debug this ATM with my limited context... I'm attaching the jars with java -cp "${spark_dir}/jars/*" metabase.jar
and my hive metastore jar seems to have the thing it's saying is not there...
jar tf hive-metastore-1.2.1.spark2.jar | grep "org/apache/hadoop/hive/metastore/api/MetaException"
org/apache/hadoop/hive/metastore/api/MetaException$1.class
org/apache/hadoop/hive/metastore/api/MetaException$_Fields.class
org/apache/hadoop/hive/metastore/api/MetaException$MetaExceptionStandardScheme.class
org/apache/hadoop/hive/metastore/api/MetaException$MetaExceptionStandardSchemeFactory.class
org/apache/hadoop/hive/metastore/api/MetaException$MetaExceptionTupleScheme.class
org/apache/hadoop/hive/metastore/api/MetaException$MetaExceptionTupleSchemeFactory.class
org/apache/hadoop/hive/metastore/api/MetaException.class
@camsaul looks like hadoop-common is missing from project.clj, can you try
including it?
On Thu, May 3, 2018, 02:32 Cam Saul notifications@github.com wrote:
Discourse discussion is here
http://discourse.metabase.com/t/connecting-to-local-spark/3444@wjoel https://github.com/wjoel have you ever seen the
java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration
error before?
Not sure what dependency we're missing. Apparently it doesn't happen in
the build you did from a while back—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/metabase/metabase/issues/7528#issuecomment-386162061,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAPcFmO7hd0iSBcpKzabsHCrFvQqDziFks5tulA1gaJpZM4TwQvB
.
FWIW I checked out the project & I didn't get this error. I only got it from the uberjar I downloaded from the website
@munro were you seeing the same
java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration
error? And when you ran locally from master did you make any changes to project.clj?
@lucasloami can you try running Metabase from master and let me know if you still see this issue?
Hi, @camsaul , I built Metabase from master and tried to connect again to my Spark SQL and got the same error:
05-04 14:21:38 DEBUG metabase.middleware :: POST /api/util/password_check 200 (2 ms) (0 DB calls). Jetty threads: 8/50 (4 busy, 6 idle, 0 queued)
05-04 14:22:02 ERROR metabase.driver :: Failed to connect to database: java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration
05-04 14:22:02 DEBUG metabase.middleware :: POST /api/setup/validate 400 (43 ms) (0 DB calls).
{:errors {:dbname "java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration"}}
@lucasloami @munro what versions of spark are you running and how are you running it?
@salsakran I have a cloudera hadoop cluster in a remote machine that comes with tools such as Hive, Spark, HBase, Hue, Pig and so on. We config Hive in Spark conf and Yarn as its resource manager. I'm using spark 1.6 here but I can change to 2.x versions if required.
I also have a Spark 2.3 installed in my local machine (I followed this tutorial to install) and I tried to connect to Spark SQL using localhost, spark://localhost and these options didn't work, they gave me the same error displayed above.
It works for me after adding hadoop-common as a dependency in project.clj,
so please try that and let us know if it helps.
On Fri, May 4, 2018, 22:30 Lucas Lo Ami notifications@github.com wrote:
@salsakran https://github.com/salsakran I have a cloudera hadoop
cluster in a remote machine that comes with tools such as Hive, Spark,
HBase, Hue, Pig and so on. We config Hive in Spark conf and Yarn as its
resource manager. I'm using spark 1.6 here but I can change to 2.x versions
if required.I also have a Spark 2.3 installed in my local machine (I followed this
tutorial http://www.admintome.com/blog/installing-spark-on-ubuntu-17-10/
to install) and I tried to connect to Spark SQL using localhost,
spark://localhost and these options didn't work, they gave me the same
error displayed above.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/metabase/metabase/issues/7528#issuecomment-386726303,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAPcFt43hLNKMRdwaY5TChmLir8fUqSqks5tvLpxgaJpZM4TwQvB
.
@wjoel , @camsaul I added hadoop-common as a dependency in my project.clj (like shown below), rebuilt the project and now I'm getting the following error:
{:errors {:dbname "java.lang.NoClassDefFoundError: org/apache/hadoop/hive/conf/HiveConf"}}
It seems that there is some missing HiveConf. When I installed Spark 2.3.0 in my local machine, I configured hive-site.xml, core-site.xml and hdfs-site.xml in order to connect my spark to my remote Hadoop cluster. When I test PySpark and Spark-shell they are able to correctly search info in Hive and HDFS, so I believe my Spark is correctly configured.
Am I missing some step here to make it work? Sorry if it seems to be a dumb question, but I'm not a Clojure programmer, so I should have missed something.
; MY project.clj FILE
[...]
[org.spark-project.hive/hive-jdbc "1.2.1.spark2" ; JDBC Driver for Apache Spark
:exclusions [org.apache.curator/curator-framework
org.apache.curator/curator-recipes
org.apache.thrift/libfb303
org.apache.zookeeper/zookeeper
org.eclipse.jetty.aggregate/jetty-all
org.spark-project.hive/hive-common
org.spark-project.hive/hive-metastore
org.spark-project.hive/hive-serde
org.spark-project.hive/hive-shims]]
[org.apache.hadoop/hadoop-common "3.1.0"]
[...]
Those exclusions look far too aggressive, quite different from what I had in my branch: https://github.com/wjoel/metabase/blob/spark-sql/project.clj#L97
Please try with something like this:
[org.apache.hadoop/hadoop-common "2.7.3"]
[org.spark-project.hive/hive-jdbc "1.2.1.spark2" ; JDBC Driver for Apache Spark
:exclusions [org.apache.curator/curator-framework
org.apache.curator/curator-recipes
org.apache.thrift/libfb303
org.apache.zookeeper/zookeeper
org.eclipse.jetty.aggregate/jetty-all]]
@lucasloami please try https://wjoel.com/files/metabase-0.29-spark-sql-2018-05-05.jar which is 0.29 with the following changes:
diff --git a/project.clj b/project.clj
index c86d62e82..fc168aabb 100644
--- a/project.clj
+++ b/project.clj
@@ -93,16 +93,13 @@
[org.liquibase/liquibase-core "3.5.3"] ; migration management (Java lib)
[org.postgresql/postgresql "42.1.4.jre7"] ; Postgres driver
[org.slf4j/slf4j-log4j12 "1.7.25"] ; abstraction for logging frameworks -- allows end user to plug in desired logging framework at deployment time
+ [org.apache.hadoop/hadoop-common "2.7.3"]
[org.spark-project.hive/hive-jdbc "1.2.1.spark2" ; JDBC Driver for Apache Spark
:exclusions [org.apache.curator/curator-framework
org.apache.curator/curator-recipes
org.apache.thrift/libfb303
org.apache.zookeeper/zookeeper
- org.eclipse.jetty.aggregate/jetty-all
- org.spark-project.hive/hive-common
- org.spark-project.hive/hive-metastore
- org.spark-project.hive/hive-serde
- org.spark-project.hive/hive-shims]]
+ org.eclipse.jetty.aggregate/jetty-all]]
[org.tcrawley/dynapath "0.2.5"] ; Dynamically add Jars (e.g. Oracle or Vertica) to classpath
[org.xerial/sqlite-jdbc "3.21.0.1"] ; SQLite driver
[org.yaml/snakeyaml "1.18"] ; YAML parser (required by liquibase)
Hi, @wjoel , thanks for you reply. I rebuilt the project using your specification and it worked properly.
@camsaul there are some points to note:
1. I had several problems with JDBC driver version: we are using Cloudera Hadoop Cluster here, that have outdated versions of Hive, Spark, YARN, etc. So, to use hive-jdbc "1.2.1.spark2" didn't work for me. I had to use v0.13.x. When I used v1.2.1.spark2 I received the following error: java.sql.SQLException: Could not establish connection to jdbc:hive2://[MY_HOST]:10000/default: Required field 'client_protocol' is unset! Struct:TOpenSessionReq(client_protocol:null), which is related to a mismatch between jdbc driver and hiveserver2 (please check this link)
2. Joel is right about the agressive exclusions in project.clj: I built the project keeping the org.spark-project.hive/ exclusions and it didn't work. I received the error: {:errors {:dbname "java.lang.NoClassDefFoundError: org/apache/hadoop/hive/conf/HiveConf"}}.
3. It's possible to use a newest version of hadoop-common: it's not a requirement to use v2.7.3
In summary, my suggestions to solve the problem are:
project.cljhadoop-common dependencyspark-project.hive in project dependencies. People that use older versions of Hive (such as me) should recompile the project with proper dependencies. @mazameli would it be a good idea to create a FAQ about this connector in order to report these points we are discovering in this debug? Even if it's not a Metabase problem, I think Metabase users will benefit from it.
I put the aggressive exclusions in because without them the hive-jdbc dependency was adding something like 20,000 files to metabase.jar and IIRC almost 25MB to the JAR size. Older versions of Java 7 (which we still support) have a 64k file limit in JARs so without the exclusions it put us over the limit and broke Java 7 compatibility.
I'll have to play around with these exclusions or see if I can clear some headroom somewhere else or it's going to be challenging to ship these fixes without breaking Java 7 compatibility
@salsakran @senior Good news and bad news. The good news is it sounds like we can fix SparkSQL support by adding Hadoop as a dependency and removing the Hive exclusions I put in, which fixes the issues (thanks @lucasloami @wjoel). The bad news is it adds a whopping 47 MB to the size of metabase.jar and almost 30,000 files, putting us well over the Java 7 64k file limit.
Here's the JAR with and without the extra deps for comparison:
| JAR | size | number of files |
|---|---|---|
| Metabase 0.29.0 | 100 MB | 62,841 |
| Metabase 0.29.0 with Hadoop + Hive | 147 MB | 91,450 |
Most helpful comment
@lucasloami does it show up if you do as mentioned in https://github.com/metabase/metabase/issues/7528#issuecomment-388231703 above (not clear to me if you did this or not):
edit: Oh wait, just tried a fresh 0.29.2 jar download and startup (On Win 10, Java 8, H2), and @salsakran I can repro what @lucasloami reported: