The update-crawler and create-crawler API now supports: "recrawl-policy":
A policy that specifies whether to crawl the entire dataset again, or to crawl only folders that were added since the last crawler run.
Reference:
resource "aws_glue_crawler" "example" {
database_name = aws_athena_database.example.name
name = "example"
role = aws_iam_role.example_role.arn
s3_target {
path = "s3://${EXAMPLE}/"
exclusions = ["elasticsearch-failed/**"]
}
schedule = "cron(05 0/1 * * ? *)" # 5 min past the hour
recrawl_policy {
recrawl_behavior = "CRAWL_NEW_FOLDERS_ONLY"
}
configuration = <<EOF
{
"Version":1.0,
"Grouping": {
"TableGroupingPolicy": "CombineCompatibleSchemas"
}
}
EOF
}
Very needed option, this can significantly reduce crawl time from 4 hours to 3 minutes like in my case :)
Most helpful comment
Very needed option, this can significantly reduce crawl time from 4 hours to 3 minutes like in my case :)