Terraform-provider-aws: Support for recrawl-policy on aws_glue_crawler

Created on 4 Nov 2020  路  1Comment  路  Source: hashicorp/terraform-provider-aws

Community Note

  • Please vote on this issue by adding a 馃憤 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Description

The update-crawler and create-crawler API now supports: "recrawl-policy":
A policy that specifies whether to crawl the entire dataset again, or to crawl only folders that were added since the last crawler run.

Reference:

New or Affected Resource(s)

  • aws_glue_crawler

Potential Terraform Configuration

resource "aws_glue_crawler" "example" {
  database_name = aws_athena_database.example.name
  name          = "example"
  role          = aws_iam_role.example_role.arn

  s3_target {
    path       = "s3://${EXAMPLE}/"
    exclusions = ["elasticsearch-failed/**"]
  }
  schedule = "cron(05 0/1 * * ? *)" # 5 min past the hour

  recrawl_policy { 
     recrawl_behavior = "CRAWL_NEW_FOLDERS_ONLY"
   }

  configuration = <<EOF
{
  "Version":1.0,
  "Grouping": {
     "TableGroupingPolicy": "CombineCompatibleSchemas"
  }
}
EOF
}

References

enhancement servicglue

Most helpful comment

Very needed option, this can significantly reduce crawl time from 4 hours to 3 minutes like in my case :)

>All comments

Very needed option, this can significantly reduce crawl time from 4 hours to 3 minutes like in my case :)

Was this page helpful?
0 / 5 - 0 ratings