Presto: Presto Schema not including all the columns of Mongo collection

Created on 26 Jun 2018  Â·  8Comments  Â·  Source: prestodb/presto

I'm using PRESTO 0.198 and it connects to underlying Mongo database.
I found that some attributes/columns of Mongo collection are not reflected in Presto

Main collection record

db.vessel_port_visits.findOne({fleetId : "59ca77facaee7c2ccbd281a5"})
{
    "_id" : ObjectId("5b16b5279e7964763270f818"),
    "number" : "9456135",
    "exitTimeMillis" : 1326276054000,
    "entryTimeMillis" : 1325980799000,
    "groupIds" : [
        "59ca77facaee7c2ccbd281a5"
    ]
}

The resultant presto schema is

>> desc mongodb.databaseName.collection

         Column          |   Type    | Extra | Comment 
--------------------------+-----------+-------+---------
 number                | varchar   |       |         
 exittimemillis           | double    |       |         
 entrytimemillis          | double    |       |         
(3 rows)

I cannot see the column 'groupIds' which is present in Mongo collection.
1 more thing to note is that groupIds is mostly empty array ie [] in most of the mongo records.

This is causing the queries which include 'groupIds' attribute to fail.
Is there a way to refresh the schema in PRESTO

Most helpful comment

I am seeing an similar issue. Presto isn't reading all of the fields in my collection. In my case, it's not reading an embedded object or a field that only has null's. Is there a way to include these fields so presto can query against them?

All 8 comments

I am seeing an similar issue. Presto isn't reading all of the fields in my collection. In my case, it's not reading an embedded object or a field that only has null's. Is there a way to include these fields so presto can query against them?

Hi, did you solve the problem ? I am having similar problems with reading entities from json data type.

Any help?

Hi, from our POV after some investigation:

  1. _id of mongo documents has to be ObjectId data type (if it is string presto will not be able to read the data )
  2. _schema of mongo DB is created with insertion of first document and _schema definition is based on its content. It means that, if after some time the document content Has changed, it is required to add those added / changed variables into _schema manually. _schema will not be updated automatically.
  3. We observed that sometimes _schema is not created automatically or sometimes new collections are not added automatically also. It means that changes should be added manually by editing of existing mongo _schema.

Hope it will be helpful.

Hi,After reading carefully,i have some questions as below:

  1. Is that mean i should edit "mongodb.schema-collection" that in "etc/catalog/mongodb.properties" ?
  2. If it is must to restart presto server?
  3. Is not support mongodb-connector for alter table?
  4. If there are any other way to dynamic,like conf.d plugin?
    Thanks!!!

At 2019-07-05 16:55:55, "pomykalamichal" notifications@github.com wrote:

Hi, from our POV after some investigation:

_id of mongo documents has to be ObjectId data type (if it is string presto will not be able to read the data )
_schema of mongo DB is created based on first document inserted into collection. It means that if after some time the document content will change it is required to add those variables into _schema manually. _schemat will not be updated automatically,
sometimes _schema is not created automatically or sometimes new collections are not added automatically also. It means that changes should be added manually by editing of existing mongo _schema.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.

Trying to answer of your question:

  1. Is that mean i should edit "mongodb.schema-collection" that in "etc/catalog/mongodb.properties" ?

In etc/catalog/mongodb.properties you can provide information which collection contains db schema definition i.e. mongodb.schema-collection parameter. But it is not required. Required is that _schema collection should be first on the list of all collections, if it is not provided in properties.
https://prestosql.io/docs/current/connector/mongodb.html
image

  1. If it is must to restart presto server?

From our POV we have not need to restart presto server after mongodb _schema collection content changes.

  1. Is not support mongodb-connector for alter table?
    Here is the information what kind of data types are supported : https://prestodb.github.io/docs/current/connector/mongodb.html#table-definition-label

  2. If there are any other way to dynamic,like conf.d plugin?
    I am not able to answer this question.

I am seeing an similar issue. Presto isn't reading all of the fields in my collection. In my case, it's not reading an embedded object or a field that only has null's. Is there a way to include these fields so presto can query against them?

Are you able to resolve the issue for embedded object or a field that only has null's. ??

Hi, one additional hint. We define almost each field as a varchar. Then we start to use presto json parse function to read the data. https://prestodb.io/docs/current/functions/json.html
json_parse(string) → json
json_extract_scalar(json, json_path) → varchar
json_extract(json, json_path) → json

Exemplary query:
json_extract_scalar(json_parse(testdb.steps), '$["step-xyz"].details.stepId') = '1033'

Was this page helpful?
0 / 5 - 0 ratings