Incubator-superset: Add support for Hive/Impala

Created on 13 Apr 2016  路  14Comments  路  Source: apache/incubator-superset

It would be really nice for Caravel to have support for Hive/Impala. There is a nice package called Impyla that provides a SQLAlchemy interface to Hive and Impala.

I am testing it out with Caravel, seems to be working so far with Hive. I can connect to my Hive database, there is a problem where SQLAlchemy tries to run a select version() as the first query, as this is not a valid Hive function, it causes an immediate failure. Once I resolve this I will open a pull request.

request

Most helpful comment

I added an extra in the Superset "create database connection" for passing some connection arguments:

{
    "metadata_params": {},
    "engine_params": {
        "connect_args": {
        "auth_mechanism": "GSSAPI",
        "user" : "put_your_user_here",
        "password" : "None" }
}
}

So I introduced a connection URL as following one:

impala://impala_host:21050/impala_database

And it worked

Note: I had run a kinit before to connect.

All 14 comments

Very nice! Thanks!

Found the problem with Impyla, raised ticket https://github.com/cloudera/impyla/issues/190

+1

Raised a pull request to fix this issue with Impala here: https://github.com/cloudera/impyla/pull/191

I have tested the Impyla fix, and verified that I now have a working Hive/Impala connection in Caravel. Once the Impyla fix is in, I will make a pull request against Caravel to add it.

Pull request to fix Impyla SQLAlchemy was merged: https://github.com/cloudera/impyla/pull/191

I will make a patch for Caravel to use the library with the next few days and update here.

We don't bring external database dependencies into Caravel, it's for people to set those up in their environment since not everyone needs every db. People should just pip install impyla --upgrade and they are good to go.

@mistercrunch Sounds good! In this case, users just have to install impyla and they are good to go.

The version with the fix is impyla 0.13.7.

I try to setup connection from caravel to Spark Thrift Server
Connection String is presto://localhost:10001/hive/default
But was not successful.

2016-07-11 11:10:16,057:INFO:pyhive.presto:SHOW TABLES
2016-07-11 11:10:16,057:DEBUG:pyhive.presto:Headers: {u'X-Presto-Source': u'pyhive', u'X-Presto-User': 'giaosudau', u'X-Presto-Catalog': u'hive', u'X-Presto-Schema': u'default'}
2016-07-11 11:10:16,066:INFO:requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): localhost

Spark Thrift Server

16/07/11 11:10:16 ERROR TThreadPoolServer: Error occurred during processing of message.
java.lang.RuntimeException: org.apache.thrift.transport.TTransportException: Invalid status 80
    at org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:219)
    at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:268)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.thrift.transport.TTransportException: Invalid status 80
    at org.apache.thrift.transport.TSaslTransport.sendAndThrowMessage(TSaslTransport.java:232)
    at org.apache.thrift.transport.TSaslTransport.receiveSaslMessage(TSaslTransport.java:184)
    at org.apache.thrift.transport.TSaslServerTransport.handleSaslStartMessage(TSaslServerTransport.java:125)
    at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:271)
    at org.apache.thrift.transport.TSaslServerTransport.open(TSaslServerTransport.java:41)
    at org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:216)
    ... 4 more

impala can't resolve username
impala://xxx:[email protected]:21050/default
File "/usr/local/lib/python2.7/site-packages/sqlalchemy/engine/strategies.py", line 97, in connect
return dialect.connect(_cargs, *_cparams)
File "/usr/local/lib/python2.7/site-packages/sqlalchemy/engine/default.py", line 385, in connect
return self.dbapi.connect(_cargs, *_cparams)
TypeError: connect() got an unexpected keyword argument 'username'

I added an extra in the Superset "create database connection" for passing some connection arguments:

{
    "metadata_params": {},
    "engine_params": {
        "connect_args": {
        "auth_mechanism": "GSSAPI",
        "user" : "put_your_user_here",
        "password" : "None" }
}
}

So I introduced a connection URL as following one:

impala://impala_host:21050/impala_database

And it worked

Note: I had run a kinit before to connect.

@joshwalters do you know how to config socks5 param with impyla

@evinhas I am getting the below error, when trying to use your way to connect. Any ideas on this? To me it looks like it trying to look for the kerberos cache for the user 'root'.

But I can successfully connect from the shell from the user '212562234'.

image

have you requested a kerberos' ticket? I had run a kinit before to connect, for getting the kerberos ticket. I can see in your error log the following message

No kerberos credentials are availble
So it sounds like your user doesn't have any kerberos ticket
Try to run a kinit in the computer running the command is below (adapted to your user/environment):

kinit [email protected]

Yes, I had the keytab file and did kinit as well. But the problem was I using the user 'root' to do the installation and start the server. Once I did with a user who was able to authenticate itself against Kerberos, I could connect to hive.

https://github.com/apache/incubator-superset/issues/4951

Was this page helpful?
0 / 5 - 0 ratings

Related issues

john-bodley picture john-bodley  路  3Comments

gbrian picture gbrian  路  3Comments

amien90 picture amien90  路  3Comments

XiaodiKong picture XiaodiKong  路  3Comments

thoralf-gutierrez picture thoralf-gutierrez  路  3Comments