A Debugging Story: A Bug in the S3 API

Introduction

In the blog post on how to import an AWS S3 bucket I mentioned that sometimes importing a resource will fail with the following error Error: Cannot import non-existent remote object despite the resource existing. This blog post will detail how I debugged that issue. While I don't have a solution, I was at least able to determine that the error was coming from the AWS API and not a bug in Terraform.

Close Encounters of the Buggy Kind

My plan when creating the blog post was that I would create a test bucket, modify it, try to import it, delete it, and repeat this many times trying to terraform import different settings. Because I planned to do this many times, I wasn't paying too much attention to how I was configuring the bucket, I just wanted to try a bunch of permutations and see what happened.

After some time I got the error message: Error: Cannot import non-existent remote object. Because I had just been playing around I didn't know if this was the first time I'd tried to import that setting or how I had set it. I wasn't sure if maybe this was a mistake on my part or somewhere else.

At this point I decided I probably needed to dig deeper.

Let's Get Systematic

Up to now, I'd just been fooling around and trying things, so I had not been very systematic in my approach or taking notes. I had an aws_s3_bucket_ownership_controls resource. I didn't know how I had modified the bucket but I knew I couldn't import that resource.

Guess 1: Eventual Consistency?

Sometimes when you do something on AWS, it takes a little while for the change to propagate. My first guess was that this was the case. I waited a few minutes and tried to import again. but the issue didn't go away, so it wasn't that.

Guess 2: Search Google?

With my first guess not panning out, I decided to search Google for the error message. I found a GitHub issue that was not resolved where a user had experienced the same error. Because of the GitHub issue, I thought that this was a bug in the resource. For some reason, the aws_s3_bucket_ownership_controls resource was not importable. I decided to move on from the bug and make a note in the blog post saying that this resource could not be imported.

I deleted the bucket and created it again and modified it and went back to importing attributes. After a few more rounds, I hit the error again, but this time with a different resource: aws_s3_bucket_public_access_block. I knew that I had successfully imported that before.

I then ran the import for the aws_s3_bucket_ownership_controls resource again. To my surprise, it worked.

Guess 3: Something About Configuring the Bucket

At this moment I decided that I needed to be able to consistently recreate the error I was seeing. I went back to deleting the bucket and creating it again and modifying it. I finally hit upon the cause and was able to recreate the issue: when creating the bucket, if I modified a configuration on create, I could not import the attribute. If I modified it after create, I could import it. Additionally, if I created a bucket with the default configuration, I could import it.

There is something weird going on where modifying the setting on create, Terraform can not import it.

I then tested that if I could not import an attribute, if I modified it in the AWS console and then reverted the modification, I could import it.

Guess 4: The AWS Provider Has a Bug

I thought that this must be a bug in the AWS provider. The AWS S3 resources had recently been significantly refactored, so probably a bug had slipped through. But how to prove it? Luckily, Terraform has very detailed debug logging, especially at the trace level. I performed an import with the logging level set to trace with:

env TF_LOG=trace terraform import 'aws_s3_bucket_public_access_block.bucket' terrateam-test-bucket

I saw the failed call in the output:

 -----------------------------------------------------
 [DEBUG] [aws-sdk-go] DEBUG: Response s3/GetPublicAccessBlock Details:
 ---[ RESPONSE ]--------------------------------------
 HTTP/1.1 404 Not Found
 Transfer-Encoding: chunked
 Content-Type: application/xml
 Date: Fri, 06 Jan 2023 21:50:30 GMT
 Server: AmazonS3
 X-Amz-Id-2: ...
 X-Amz-Request-Id: ...


 -----------------------------------------------------

Looking up the public access block with the AWS API was returning 404.

To verify this, I switched over to the aws CLI tool to perform the query directly:

$ aws s3api get-public-access-block --bucket terrateam-test-bucket

An error occurred (NoSuchPublicAccessBlockConfiguration) when calling the GetPublicAccessBlock operation: The public access block configuration was not found

I've verified that the underlying API is returning 404. But is that expected? Is there still a bug where the provider is supposed to be doing another API call in this situation?

To prove this, I recreated my bucket such that it could be imported and executed the API call. I got an actual response back and not 404.

I did another experiment where I created a bucket where I could not import some attributes, verified that the response was 404, then modified configuration and verified that the API gave a response.

Bingo.

Pulling It All Together

I've managed to determine how to consistently create bucket attributes that can be successfully imported or cannot be imported.
I was able to use debug logging in Terraform to determine that the attribute cannot be imported because the API was returning 404.
I was able to confirm this using the aws CLI to perform the API call directly, getting the same result as Terraform.
I was able to show that attributes that can be imported do not return 404 from the API.
Finally, I showed that if I could not import a resource, if I then modified it in the AWS console (for example, toggling the "public access block" setting), I could then import it.

Conclusion? The AWS S3 API has a bug in it where if the configuration is modified on create, the API responds with 404 for those attributes.

Debugging Tips

Debugging is more art than science. It requires a lot of creativity. You have this system that is not working as expected and you have to come up with ways to poke it such that it will reveal more information to you. That is very dependent on the situation and it is hard to give any generic advice. But there are some things you can do to improve your odds.

Take notes. Once I decided to start debugging this, I started taking notes. Everything I did, I recorded what it was and its output. I wanted to try a lot of different things and it was necessary that I didn't get confused about what I did or if I had tried something earlier.
Try different things. I got to the issue being bucket creation vs bucket modification just by having a list of things to try. I had no intuition that it would be that, I was just trying everything.
Make a hypothesis and prove (or disprove) it. I could have just stopped with "the Terraform AWS provider has a bug". But just because that is how I experienced the bug doesn't mean that is the cause. There are a lot of layers of software between Terraform and AWS, and it could be any of those. Luckily, Terraform has extensive debug logging. Without that, hopefully eventually I would have stumbled upon doing the AWS API calls myself.

If you're interested in debugging, the Oxide Computer Company has some great podcast episodes on debugging:

Learn

Connect