A Debugging Story: A Bug in the S3 API
Malcolm Matalka
On this page
Introduction
In the blog post on how to import an AWS S3 bucket I mentioned that sometimes importing a resource will fail with the following error Error: Cannot import non-existent remote object
despite the resource existing. This blog post will detail how I debugged that issue. While I don’t have a solution, I was at least able to determine that the error was coming from the AWS API and not a bug in Terraform.
Close Encounters of the Buggy Kind
My plan when creating the blog post was that I would create a test bucket, modify it, try to import
it, delete it, and repeat this many times trying to terraform import
different settings. Because I planned to do this many times, I wasn’t paying too much attention to how I was configuring the bucket, I just wanted to try a bunch of permutations and see what happened.
After some time I got the error message: Error: Cannot import non-existent remote object
. Because I had just been playing around I didn’t know if this was the first time I’d tried to import
that setting or how I had set it. I wasn’t sure if maybe this was a mistake on my part or somewhere else.
At this point I decided I probably needed to dig deeper.
Let’s Get Systematic
Up to now, I’d just been fooling around and trying things, so I had not been very systematic in my approach or taking notes. I had an aws_s3_bucket_ownership_controls
resource. I didn’t know how I had modified the bucket but I knew I couldn’t import
that resource.
Guess 1: Eventual Consistency?
Sometimes when you do something on AWS, it takes a little while for the change to propagate. My first guess was that this was the case. I waited a few minutes and tried to import
again. but the issue didn’t go away, so it wasn’t that.
Guess 2: Search Google?
With my first guess not panning out, I decided to search Google for the error message. I found a GitHub issue that was not resolved where a user had experienced the same error. Because of the GitHub issue, I thought that this was a bug in the resource. For some reason, the aws_s3_bucket_ownership_controls
resource was not importable. I decided to move on from the bug and make a note in the blog post saying that this resource could not be imported.
I deleted the bucket and created it again and modified it and went back to importing attributes. After a few more rounds, I hit the error again, but this time with a different resource: aws_s3_bucket_public_access_block
. I knew that I had successfully imported that before.
I then ran the import
for the aws_s3_bucket_ownership_controls
resource again. To my surprise, it worked.
Guess 3: Something About Configuring the Bucket
At this moment I decided that I needed to be able to consistently recreate the error I was seeing. I went back to deleting the bucket and creating it again and modifying it. I finally hit upon the cause and was able to recreate the issue: when creating the bucket, if I modified a configuration on create, I could not import
the attribute. If I modified it after create, I could import
it. Additionally, if I created a bucket with the default configuration, I could import
it.
There is something weird going on where modifying the setting on create, Terraform can not import
it.
I then tested that if I could not import
an attribute, if I modified it in the AWS console and then reverted the modification, I could import
it.
Guess 4: The AWS Provider Has a Bug
I thought that this must be a bug in the AWS provider. The AWS S3 resources had recently been significantly refactored, so probably a bug had slipped through. But how to prove it? Luckily, Terraform has very detailed debug logging, especially at the trace
level. I performed an import
with the logging level set to trace
with:
env TF_LOG=trace terraform import 'aws_s3_bucket_public_access_block.bucket' terrateam-test-bucket
I saw the failed call in the output:
Looking up the public access block with the AWS API was returning 404.
To verify this, I switched over to the aws
CLI tool to perform the query directly:
I’ve verified that the underlying API is returning 404. But is that expected? Is there still a bug where the provider is supposed to be doing another API call in this situation?
To prove this, I recreated my bucket such that it could be imported and executed the API call. I got an actual response back and not 404.
I did another experiment where I created a bucket where I could not import
some attributes, verified that the response was 404, then modified configuration and verified that the API gave a response.
Bingo.
Pulling It All Together
- I’ve managed to determine how to consistently create bucket attributes that can be successfully imported or cannot be imported.
- I was able to use debug logging in Terraform to determine that the attribute cannot be imported because the API was returning 404.
- I was able to confirm this using the
aws
CLI to perform the API call directly, getting the same result as Terraform. - I was able to show that attributes that can be imported do not return 404 from the API.
- Finally, I showed that if I could not import a resource, if I then modified it in the AWS console (for example, toggling the “public access block” setting), I could then import it.
Conclusion? The AWS S3 API has a bug in it where if the configuration is modified on create, the API responds with 404 for those attributes.
Debugging Tips
Debugging is more art than science. It requires a lot of creativity. You have this system that is not working as expected and you have to come up with ways to poke it such that it will reveal more information to you. That is very dependent on the situation and it is hard to give any generic advice. But there are some things you can do to improve your odds.
- Take notes. Once I decided to start debugging this, I started taking notes. Everything I did, I recorded what it was and its output. I wanted to try a lot of different things and it was necessary that I didn’t get confused about what I did or if I had tried something earlier.
- Try different things. I got to the issue being bucket creation vs bucket modification just by having a list of things to try. I had no intuition that it would be that, I was just trying everything.
- Make a hypothesis and prove (or disprove) it. I could have just stopped with “the Terraform AWS provider has a bug”. But just because that is how I experienced the bug doesn’t mean that is the cause. There are a lot of layers of software between Terraform and AWS, and it could be any of those. Luckily, Terraform has extensive debug logging. Without that, hopefully eventually I would have stumbled upon doing the AWS API calls myself.
If you’re interested in debugging, the Oxide Computer Company has some great podcast episodes on debugging: