Terraform-provider-aws: terraform cannot write state due to credentials expiring (initially aws_db_instance times out)

Created on 2 Aug 2017 · 12Comments · Source: hashicorp/terraform-provider-aws

Terraform 0.9.8

aws_db_instance sqlserver restored from a snapshot gets stuck in "Still creating..." even though it was fully created and available in RDS console after 20 minutes.

bug servicrds stale upstream-terraform

Source

gtmtech

👍3

Most helpful comment

I believe the above bug is due to terraform not refreshing the assumed role it uses to store state in an s3 backend: my terraform init command was:

terraform init --backend-config=bucket=my-terraform-bucket --backend-config=key=terraform/terraform.tfstate --backend-config=role_arn=arn:aws:iam::111111111111:role/state_storing_role

when combined with:

terraform {
    backend "s3" {
        region  = "eu-west-1"
        encrypt = "true"
    }
}

And a terraform apply which takes >1h, I believe it cannot then push the state. Terraform should refresh the session token, details of how to do this are in the aws docs.

gtmtech on 2 Aug 2017

👍3

All 12 comments

Resource:

resource "aws_db_instance" "foo" {
    count                       = "${var.flags["enable_foo"]}"
    identifier                  = "foo-${var.environment}-${count.index}"
    allocated_storage           = "4000"
    allow_major_version_upgrade = "false"
    apply_immediately           = "true"
    auto_minor_version_upgrade  = "false"
    availability_zone           = "eu-west-1a"
    backup_retention_period     = "7"
    backup_window               = "15:15-15:45"
    copy_tags_to_snapshot       = "true"
    db_subnet_group_name        = "${aws_db_subnet_group.db.id}"
    engine                      = "sqlserver-ee"
    engine_version              = "13.00.4422.0.v1"
    final_snapshot_identifier   = "foo-${replace(var.environment, "_", "-")}-${count.index}-snapshot-tf"
    instance_class              = "${var.instance_sizes["db_foo"]}"
    iops                        = "12000"
    license_model               = "license-included"
    maintenance_window          = "sun:11:00-sun:11:30"
    multi_az                    = "false"
    option_group_name           = "${replace(var.environment, "_", "-")}-sqlserver-og"
    password                    = "supersecretpassword"
    publicly_accessible         = "false"
    skip_final_snapshot         = "false"
    snapshot_identifier         = "arn:aws:rds:eu-west-1:111111111111:snapshot:foo-snapshot"
    storage_type                = "io1"
    timezone                    = "UTC"
    username                    = "sa"
    vpc_security_group_ids      = ["${aws_security_group.foo.id}"]

    tags {
        Name        = "foo-${replace(var.environment, "_", "-")}-${count.index}"
        environment = "${var.environment}"
        component   = "database"
        service     = "foo"
    }

    timeouts {
        create = "6h"
    }
}

gtmtech on 2 Aug 2017

Output:

aws_db_instance.foo: Still creating... (10s elapsed)
aws_db_instance.foo: Still creating... (20s elapsed)
....
aws_db_instance.foo: Still creating... (1h59m53s elapsed)
aws_db_instance.foo: Still creating... (2h0m3s elapsed)
...
etc...

however the RDS restore was complete in 20minutes and the RDS marked as available in the AWS console, but terraform doesnt see it :(

gtmtech on 2 Aug 2017

Further information, the timeout error eventually happened and was thus:

Failed to save state: Failed to upload state: NoCredentialProviders: no valid providers in chain. Deprecated.
    For verbose messaging see aws.Config.CredentialsChainVerboseErrors

Failed to persist state to backend.

The error shown above has prevented Terraform from writing the updated state
to the configured backend. To allow for recovery, the state has been written
to the file "errored.tfstate" in the current working directory.

Running "terraform apply" again at this point will create a forked state,
making it harder to recover.

To retry writing this state, use the following command:
    terraform state push errored.tfstate

gtmtech on 2 Aug 2017

Also there was no file errored.tfstate written anyway - another bug i will raise separately

gtmtech on 2 Aug 2017

the "no errored.tfstate" problem separately
raised as https://github.com/hashicorp/terraform/issues/15688

gtmtech on 2 Aug 2017

I believe the above bug is due to terraform not refreshing the assumed role it uses to store state in an s3 backend: my terraform init command was:

terraform init --backend-config=bucket=my-terraform-bucket --backend-config=key=terraform/terraform.tfstate --backend-config=role_arn=arn:aws:iam::111111111111:role/state_storing_role

when combined with:

terraform {
    backend "s3" {
        region  = "eu-west-1"
        encrypt = "true"
    }
}

And a terraform apply which takes >1h, I believe it cannot then push the state. Terraform should refresh the session token, details of how to do this are in the aws docs.

gtmtech on 2 Aug 2017

👍3

I can confirm the same behaviour is happening on terraform 0.10. I'm specifying role_arn to the s3 backend, and it fails with:

Failed to save state: Failed to upload state: NoCredentialProviders: no valid providers in chain. Deprecated.
    For verbose messaging see aws.Config.CredentialsChainVerboseErrors

One approach to fix it would be implementing a "keepalive" API call to AWS (e.g. getting the md5 checksum of the tfstate in s3 every minute), which would trigger the go aws-sdk logic into refreshing the STS token, thus preventing the issue from happening.

rafaelfc-olx on 18 Oct 2017

I've also experienced this issue, especially when creating high number of resources with apply command. For example RDS database, Cloudfront and elasticache - each of them takes about 10 minutes to provision.
I'm using aws-vault for credentials management so I'm using temporary credentials that assume selected role, so I think this is pretty much the same setup as you've used.

Did somebody figure out how to solve this issue?

artursmet on 4 Apr 2018

I think this is the same issue as #1351. One workaround is to use something like aws-vault (unrelated to hashicorp's vault) that provides tokens via the metadata API and refreshes in the background for you. (FWIW, that's what I'm telling @Latacora customers the answer is, together with "consider not having humans near terraform" :-) Keep in mind that there are some details like binding on a privileged port there; you probably still want to encapsulate that in a VM or container or whatever -- something with a separate networking namespace :))

lvh on 19 Apr 2018

I came across the same issue. I refreshed the token under .aws/credentials and still had this issue.

askulkarni2 on 2 Aug 2018

Marking this issue as stale due to inactivity. This helps our maintainers find and focus on the active issues. If this issue receives no comments in the next 30 days it will automatically be closed. Maintainers can also remove the stale label.

If this issue was automatically closed and you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thank you!