AWS Glue now supports the ability to run ETL jobs on Apache Spark 2.4.3 (with Python 3).. terraform support needed
aws_glue_job
resource "aws_glue_job" "aws_glue_job_foo" {
glue_version = "1"
name = "job-name"
description = "job-desc"
role_arn = data.aws_iam_role.aws_glue_iam_role.arn
max_capacity = 1
max_retries = 1
connections = [aws_glue_connection.connection.name]
timeout = 5
command {
name = "pythonshell"
script_location = "s3://bucket/script.py"
python_version = "3"
}
default_arguments = {
"--job-language" = "python"
"--ENV" = "env"
"--ROLE_ARN" = data.aws_iam_role.aws_glue_iam_role.arn
}
execution_property {
max_concurrent_runs = 1
}
}
It looks like this is enabled via the Glue version for a job, added in AWS SDK v1.21.4.
Requires:
Related: #9409
+1
Alternative to set python and glue version
resource "aws_glue_job" "etl" {
name = "${var.job_name}"
role_arn = "${var.iam_role_arn}"
command {
script_location = "s3://${var.bucket_name}/${aws_s3_bucket_object.script.key}"
}
default_arguments = {
"--enable-metrics" = ""
"--job-language" = "python"
"--TempDir" = "s3://${var.bucket_name}/TEMP"
}
# Manually set python 3 and glue 1.0
provisioner "local-exec" {
command = "aws glue update-job --job-name ${var.job_name} --job-update 'Command={ScriptLocation=s3://${var.bucket_name}/${aws_s3_bucket_object.script.key},PythonVersion=3,Name=glueetl},GlueVersion=1.0,Role=${var.iam_role_arn},DefaultArguments={--enable-metrics=\"\",--job-language=python,--TempDir=\"s3://${var.bucket_name}/TEMP\"}'"
}
}
Any idea when does this change get pushed?
The solution/workaround provided by @ezidio works exactly as expected.
But it would be good if this change is made through terraform and pushed.
I will do the same as @ezidio suggests, but with job-language scala instead of python.
I also think it would be good if it would work without this workaround.
Maybe more urgent given the announcement of python 2's official sunsetting on Jan 1 2020.
the "workaround" is a horrible PITA as all the arguments need to be flattened into one string... with nested escaping.
In fact there is no proper workaround because any modification will reset the job back to Python 2. The local-exec provisioner will not be rerun.
In fact there is no proper workaround because any modification will reset the job back to Python 2. The local-exec provisioner will not be rerun.
I think the below script logic should do the job for you. Using null resource based on timestamp.
resource "aws_glue_job" "etl" {
name = "${local.name}"
role_arn = "${module.crawler_role.role_arn}"
command {
script_location = "s3://abc/abc.py"
}
default_arguments = {
"--job-language" = "python"
"--database" = "${local.name}"
"--s3bucket" = "${var.bucket_name}"
}
}
resource "null_resource" "cluster" {
depends_on = ["aws_glue_job.etl"]
triggers = {
time = "${timestamp()}"
}
provisioner "local-exec" {
command = "aws glue update-job --job-name ${local.name} --job-update 'Role=${module.crawler_role.role_arn}, Command={ScriptLocation=s3://abc/abc.py,PythonVersion=3,Name=glueetl}, DefaultArguments={--job-language=python,--database=${local.name},--s3bucket=<bucket-name>}, Connections={Connections=[${local.name}]}, GlueVersion=1.0'"
}
}
@Vedant-R now it will be always run, 40 times for the 40 jobs, without any changes... :F
@Vedant-R now it will be _always_ run, 40 times for the 40 jobs, without any changes... :F
Yes, but it solves your purpose of not getting reset to python2.
The python_version = "3" option is enabled in latest provider terraform-provider-aws_v2.29.0
, however that did not modify "Spark version" on an existing job, because of which the job failed with the following error.
JobName:XXXXXXX and JobRunId:jr_XXXXXXXXX failed to execute with exception Unsupported pythonVersion 3 for given glueVersion 0.9
The python_version = "3" option is enabled in latest provider
terraform-provider-aws_v2.29.0
, however that did not modify "Spark version" on an existing job, because of which the job failed with the following error.
JobName:XXXXXXX and JobRunId:jr_XXXXXXXXX failed to execute with exception Unsupported pythonVersion 3 for given glueVersion 0.9
I am also getting the same issue. What i am missing?
The python_version = "3" option is enabled in latest provider
terraform-provider-aws_v2.29.0
, however that did not modify "Spark version" on an existing job, because of which the job failed with the following error.
JobName:XXXXXXX and JobRunId:jr_XXXXXXXXX failed to execute with exception Unsupported pythonVersion 3 for given glueVersion 0.9
I am also getting the same issue. What i am missing?
For time being I have deployed with python_version 3 and then from AWS console modified the job with glueVersion 1.. This fixed it.. However its good to have a fix from the provider
This issue can be fixed through Cloud formation template. In cloud formation template we can directly declare Glue version and Python version. It will be very easy way and no need update aws provider.
{
"Description": "AWS Glue Job ",
"Resources": {
"GlueJob": {
"Type": "AWS::Glue::Job",
"Properties": {
"Command": {
"Name": "glueetl",
"ScriptLocation": "${script_location}",
"PythonVersion" : "3"
},
"DefaultArguments": {
"--job-language": "${job-language}",
"--TempDir" : "${TempDir}",
"--extra-jars" : "${extra-jars}"
},
"Name": "${Name}",
"Role": "${role_arn}",
"MaxCapacity" : 10,
"GlueVersion" : "1.0"
}
}
}
}
@g-sree all updates using Terraform will always reset the Python version...
updates using Terraform will always reset the Python version...
This is my terraform declaration ..
resource "aws_glue_job" "test_glue_job" {
name = "name"
role_arn = "iam_role"
command {
script_location = "script"
python_version = 3
}
default_arguments = {
~~~~ truncated ~~~~
}
}
I'm using python3.. however if you modify an existing job only python version changes and not the glue version which should be 1.0 . In this case the job will fail in the next run. I have manually updated the glue version from 0.9 to 1.0 in the AWS console. I never had a problem afterwards .
A cloudformation-based workaround: https://github.com/terraform-providers/terraform-provider-aws/issues/8526#issuecomment-490161140
resource "aws_cloudformation_stack" "network" {
name = "${local.name}-glue-job"
template_body = <<STACK
{
"Resources" : {
"MyJob": {
"Type": "AWS::Glue::Job",
"Properties": {
"Command": {
"Name": "glueetl",
"ScriptLocation": "s3://${local.bucket_name}/jobs/${var.job}"
},
"ExecutionProperty": {
"MaxConcurrentRuns": 2
},
"MaxRetries": 0,
"Name": "${local.name}",
"Role": "${var.role}"
}
}
}
}
STACK
}
Support for the new glue_version
argument in the aws_glue_job
resource has been merged and will release with version 2.34.0 of the Terraform AWS Provider, on Thursday. 👍
This has been released in version 2.34.0 of the Terraform AWS provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading.
I'm going to lock this issue because it has been closed for _30 days_ ⏳. This helps our maintainers find and focus on the active issues.
If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks!
Most helpful comment
Support for the new
glue_version
argument in theaws_glue_job
resource has been merged and will release with version 2.34.0 of the Terraform AWS Provider, on Thursday. 👍