Hey. I haven't reported bugs before, so I hope I'm doing things correctly here.
When creating Glue table using aws_cdk.aws_glue.Table with data_format = _glue.DataFormat.JSON classification is set to Unknown. Querying the table fails.
glue_table = _glue.Table(self,'GlueTable'
,database = _glue.Database.from_database_arn(self, 'GlueDatabase'
,'arn:aws:glue:region:{}:database/abc'.format(accound_id)
)
,table_name = 'def_ghi'
,data_format = _glue.DataFormat.JSON
,bucket = s3_bucket
,s3_prefix = 'prefix/'
If I manually add "classification" with value "json" in the Table properties, after deploying with CDK, the query works fine.
Amazon Invalid operation: Invalid DataCatalog response for external table "abc"."def_ghi": Cannot deserialize table. Missing mandatory field: Parameters in response from external catalog. ;
This is :bug: Bug Report
After some more fiddling around, I discovered that it probably doesn't have to do with the classification=json parameter. I managed to make it work just by editing and pressing apply. I then looked at the difference and the only thing I could find was this:
SerdeInfo before:
'SerdeInfo': {'SerializationLibrary': 'org.openx.data.jsonserde.JsonSerDe'}
SerdeInfo after:
'SerdeInfo': {'SerializationLibrary': 'org.openx.data.jsonserde.JsonSerDe', 'Parameters': {}}
After some further thought, I see that this also correlates with the error message above.
To get around this I have added a post-deploy code snippet using boto3 to update the table, like this:
response = glue_client.get_table(
DatabaseName=database_name,
Name=table_name
)
table = response['Table']
table['StorageDescriptor']['SerdeInfo']['Parameters'] = {}
table['Parameters']['classification'] = 'json' <-- not necessary, but removes the classification: Unknown
glue_client.update_table(
DatabaseName=table['DatabaseName']
,TableInput={
'Name' : table['Name']
,'Description': table['Description']
,'Retention': table['Retention']
,'StorageDescriptor': table['StorageDescriptor']
,'TableType': table['TableType']
,'Parameters': table['Parameters']
}
)
Hi @jorgenfroland - Thanks for reporting this.
I believe this is rooted in either the Glue API or how CloudFormation invokes it. In any case, passing an empty map should be the same as not passing it at all, and CDK can probably mitigate this quirk.
Filing 馃憤
Thanks @jorgenfroland :)
Your comment helped me solve the same problem.
Can confirm this is happening in typescript construct as well, Kudos to @jorgenfroland. Currently the inability to add parameters like classification and S3 exclude Path with the L2 construct is indeed a problem when using Cdk for creating Glue resources. Hope it gets stable soon.
Most helpful comment
To get around this I have added a post-deploy code snippet using boto3 to update the table, like this: