Using LLMs to Generate Terraform Code

We originally tested ChatGPT 4, Claude 3 Opus, and Mistral Large over a year ago. The AI landscape has changed significantly, with modern LLM models far eclipsing the previous generation in capabilities. With that in mind, we've updated this post with evaluations of the current models, and also briefly discuss some of the most significant changes in AI as they relate to infrastructure as code.

💡 Looking for help with Terraform? We provide expert consulting services for infrastructure automation.

TL;DR

We tested four leading LLMs (OpenAI o3, Claude Sonnet 4, Mistral Le Chat Pro, and Google Gemini 2.5 Pro) on generating Terraform configurations for a simple AWS application.

Winners: Claude Sonnet 4 and Gemini 2.5 Pro (tie)

Claude Sonnet 4: Excelled with a user-friendly deployment script, single-shot success
Gemini 2.5 Pro: Generated cleanest, best-practice Terraform code structure
OpenAI o3: Succeeded but required 7 iterations to fix various issues
Mistral Le Chat Pro: Disappointed, felt like previous-generation model

Key takeaway: Modern LLMs are now capable of generating production-ready IaC with proper context and prompting.

Jump to full findings →

2025 Updates

Rise of Agents/Agentic Workflows

Since the original post was published, agents and agentic AI workflows have moved from bleeding edge to practical application. Cursor and GitHub Copilot both now offer agent workflows, and there are multiple platforms and frameworks for orchestrating multiple agents across a variety of domains. Devin, which we mentioned last year, is now generally available, although it remains to be seen if their fully autonomous approach to agents will gain traction given the current capabilities of LLMs. It looks more likely that the more popular pattern is semi-autonomous agents, with engineers acting as orchestrators/conductors. For DevOps and IaC specifically, Stakpak now offers a dedicated, AI-enabled DevOps IDE with agentic capabilities.

Google Gemini

Virtually a non-entity in the LLM space a year ago, Google has established itself with the launch of its Gemini models, which offer good performance and a staggeringly large context window (Gemini 2.5 Pro offers 2 million tokens). Due to the ubiquity of Gemini 2.5 usage, we've decided to include it in our updated evaluations.

Chain of Thought Reasoning

Chain of Thought (CoT) reasoning has become standard across most current-generation language models, allowing them to break down complex problems into logical steps rather than attempting to solve everything in a single response. For infrastructure code generation, this means (in theory) that models should now be able to work through these prompts more systematically and effectively, by breaking the problem down into smaller pieces. We should see some CoT output in the tests, but it will be interesting to see if it leads to better output.

2025 Testing Methodology

We'll be using the same testing methodology as before, with the same prompt and same hypothetical application infrastructure. The caveats are also the same: evaluating the output of LLMs is ultimately subjective; LLM output is not deterministic. We'll be doing manual assessment of each model, and will determine if the code generates a "working" application. Any time an error, failure, or other issue is encountered, the model is prompted to fix the issue, and a new iteration begins.

2025 Tests

Experiment #1: OpenAI - o3

View Experiment Details

Process

A new chat prompt was opened with the o3 model selected.
The prompt was entered into the model.

View Process Screenshots

Iteration #1

The configurations were placed in their respective files as instructed: versions.tf, variables.tf, main.tf, outputs.tf.
We've already hit a snag: the instructions ask the user to build and deploy their image before the infrastructure, which includes the ECR registry, is deployed. The model was prompted to correct this.
Placeholder values were updated as instructed.

View Iteration #1 Screenshots

Iteration #2

terraform init was run.
Option "A" was chosen from what the model provided, terraform apply was run, targeting the ECR registry only.
Terraform reported an error:

│ Error: Reference to undeclared resource
│
│   on main.tf line 106, in module "vpc":
│  106:   azs                  = slice(data.aws_availability_zones.available.names, 0, 2)
│
│ A data resource "aws_availability_zones" "available" has not been declared in the root module.

The error message was provided to the model, and the fix was applied.

View Iteration #2 Screenshots

Iteration #3

terraform validate was used to confirm validity of the configuration, which succeeded.
The terraform apply -target=aws_ecr_repository.app_repo command was attempted again, and was successful, but issued warnings around the use of targeted terraform apply.
The build & push commands were run, however the following error was encountered when attempting to build the image:

ERROR: failed to build: invalid tag "\"************.dkr.ecr.us-east-1.amazonaws.com/hello-fargate-picked-sturgeon\":latest": invalid reference format

View Iteration #3 Screenshots

Iteration #4

Option 2 of the fix was chosen, with a specific output added to outputs.tf
terraform apply -refresh-only was run.
The new output was set as a Bash variable with: REPO_URL=$(terraform output -raw ecr_repo_url)
The build was run again: docker build -t ${REPO_URL}:latest ., and the image was built successfully.
The following error was encountered when attempting to authenticate to ECR with: aws ecr get-login-password --region $AWS_REGION | docker login --username AWS --password-stdin $REPO_URL:

Provided region_name '╷
│ Warning: No outputs found
│
│ The state file either has no outputs defined, or all the defined outputs are empty. Please define an output in your
│ configuration with the `output` keyword and run `terraform refresh` for it to become available. If you are using
│ interpolation, please verify the interpolated value is not empty. You can use the `terraform console` command to assist.
╵' doesn't match a supported format.
Error: Cannot perform an interactive login from a non TTY device

The model was provided the error, and the fixes were applied.

View Iteration #4 Screenshots

Iteration #5

aws ecr get-login-password --region $AWS_REGION | docker login --username AWS --password-stdin $REPO_URL was run successfully.
docker push ${REPO_URL}:latest was run, and the image was pushed successfully.
terraform apply was run, and completed successfully.
The app did not run successfully, with a 504 error being generated when viewing the frontend.
Upon reviewing the task logs, it was discovered that the DynamoDB table name was not being injected into the container runtime.
The model was provided the error, and the fix was applied.

raise ValueError("Please set the DYNAMODB_TABLE_NAME environment variable")

View Iteration #5 Screenshots

Iteration #6

terraform apply was run.
The task successfully started, however when visiting the ALB DNS URL, a Boto3 exception was thrown.
The error was provided, and the fix was applied. NOTE: although the model recommends making changes to the application code, in order to maintain as much similarity to the previous test examples, we are implementing option B so that the application differs as little as possible from the previous test runs.

View Iteration #6 Screenshots

Iteration #7

terraform apply was run successfully.
The endpoint returned "Hello World" successfully.
Timestamped entries were confirmed in the DynamoDB Table.

Notes

Compared to the previous generation model, o3 provided much clearer instructions around file structure and where to place configuration blocks. It even made use of a publicly available module.

Issues

The model did not identify the need to first bootstrap the ECR registry before docker images could be pushed and the remaining infrastructure deployed.
A data source for availability zones was interpolated before it was created.
The docker build command required changes.
The command to authenticate to the ECR registry required new outputs to be defined in outputs.tf
The environment variable name for the DynamoDB table did not match the application code*
The primary key name for the DynamoDB table did not match what was defined in the application code*

Results

We got a working application! Although it required several iterations, most of the issues stemmed from either the initial bootstrapping and Docker commands, or configuration mismatches with the application code. I've marked the last two items with a "*", as those were not a direct result of model error, rather that the model was not supplied with the application code in its initial prompt and made assumptions.

My takeaway is that I am much more impressed with the overall IaC output and capabilities of the model. Compared to the previous OpenAI model, this configuration made much better use of value and variable interpolation, utilized public registry modules, and broke down the configuration into separate .tf files.

Experiment #2: Anthropic - Claude Sonnet 4

View Experiment Details

Process

A new chat prompt was opened with the Claude Sonnet 4 model selected.
The prompt was entered into the model.

View Process Screenshots

Iteration #1

The model provided explicit instructions for set up, including a deploy.sh script, which was run after the initial directory was set up and files copied over.
The deploy script was successfully run, and the app was verified as fully functional.

Notes

Compared to the previous generation Claude model, this output had a much more perceptible focus on user experience; it provided a fully-featured deployment script that addressed the need for bootstrapping the ECR registry, as well as providing the outputs to the user.

Issues

The Terraform configuration was generated entirely within a single file: main.tf. At minimum, most Terraform configurations start with separate files for the resources, variables, outputs, as well as provider and backend configuration.

Results

Very impressive. The model satisfied the requirements of the prompt in a single shot. The deployment script was a nice touch, and provided easy-to-read outputs, and did a good job of properly bootstrapping the resources to handle building and pushing a Docker image.

Experiment #3: Mistral - Le Chat Pro

View Experiment Details

Process

A new chat prompt was opened with the Le Chat Pro model selected.
The prompt was entered into the model.

View Process Screenshots

Iteration #1

The model provided some basic instructions for setup, including, creating and pushing a Docker image.
The model did not correctly interpret the need to have created the ECR registry before pushing the image.
The model was prompted with this information, and the fixes were applied.

View Iteration #1 Screenshots

Iteration #2

terraform init, terraform plan, and terraform apply were run successfully.
The Docker image was built and pushed to the ECR registry.
aws ecs update-service --cluster flask-app-cluster --service flask-app-service --force-new-deployment was run successfully.
When visiting the endpoint, a 504 error was generated. The model was prompted with this information.

View Iteration #2 Screenshots

Iteration #3

Since the model provided only generic troubleshooting advice, it was prompted to fix the issue again. At this point, I noticed the model required manually enabling "Think" mode, which was done.

View Iteration #3 Screenshots

Iteration #4

The new Terraform configuration was copied, and the terraform init, terraform plan, and terraform apply commands were run successfully.
When visiting the endpoint, an exception was thrown.
The model was prompted with the exception information and asked to fix it.

View Iteration #4 Screenshots

Iteration #5

The model again asked the user to perform generic troubleshooting steps, and was prompted to fix the issue.

View Iteration #5 Screenshots

Iteration #6

The new Terraform configuration was copied over, and terraform init, plan, apply commands were run.
An error was received:

│ Error: creating IAM Role (ecsTaskExecutionRole): EntityAlreadyExists: Role with name ecsTaskExecutionRole already exists.
│       status code: 409, request id: 69650f31-7e87-49d5-bddf-10926e57010c
│
│   with aws_iam_role.ecs_task_execution_role,
│   on main.tf line 52, in resource "aws_iam_role" "ecs_task_execution_role":
│   52: resource "aws_iam_role" "ecs_task_execution_role" {
│

On further investigation, it was also discovered the Terraform configuration contained the same error that had plagued the previous generation models: it did not create separate execution and task roles.
The model was prompted with this information. (NOTE: after generating a new config and instructions, the model generated a large amount of repeating text.)

View Iteration #6 Screenshots

Iteration #7

The new Terraform configuration was copied over, and terraform init, plan, apply commands were run successfully.
aws ecs update-service --cluster flask-app-cluster --service flask-app-service --force-new-deployment was run to ensure the latest task definition was being used.
The endpoint successfully returned "Hello World", with entries written to the DynamoDB table.

Notes

Unfortunately, this felt like I was still interacting with a previous-generation LLM model. It frequently asked the user to perform generic troubleshooting steps rather than fixing its own work, and made several errors.

Issues

The model frequently defaulted to asking the user to perform generic troubleshooting steps.
The model did not correctly interpret that the ECR registry would need to be created before a Docker image could be pushed.
The need for separate task and execution roles was not properly addressed until specifically called out by the user.
A configuration was generated that would not even complete a terraform apply without failure.
A single Terraform file was generated, rather than multiple files.
The model asked the user to perform several manual commands, rather than attempting to fulfill the requirement that the user should have only minimal intervention.

Results

Even with "Thinking" mode turned on, the Le Chat model very much gave the impression that I was still interacting with a previous generation of LLM. The issues and rough edges I encountered were all very typical for the models that were tested over a year ago. Eventually we did converge on a working application, but it felt like a lot of work to arrive at that point.

Experiment #4: Google - Gemini 2.5 Pro

View Experiment Details

Process

A new chat prompt was opened with the Gemini 2.5 Pro model selected.
The prompt was entered into the model.

View Process Screenshots

Iteration #1

The Terraform configurations were copied to their respective files.
terraform init and terraform apply -auto-approve were run per model instructions.
The commands to build and push the Docker image were run successfully.
The outputs were added to the outputs.tf file.
Per instructions, terraform apply -auto-approve was run successfully.
The ecs service was updated via aws ecs update-service --cluster $(terraform output -raw ecs_cluster_name) --service $(terraform output -raw ecs_service_name) --force-new-deployment --region
The app successfully returned "Hello World", and the DynamoDB table was populated with items.

Notes

This is the first time testing the Gemini 2.5 model.

Issues

This is more of a nit, but there was more manual work for the user compared to the Claude Sonnet 4 model, which bundled all commands in a deployment script.

Results

The Gemini 2.5 model jumps out of the gate with a great showing, managing to produce a working configuration in one pass. It also produced the cleanest output, breaking down Terraform configuration into multiple files that were easy to access via the chat UI. The Terraform configuration also made use of variables and interpolation as well; very impressive.

2025 Findings

Compared to the 2024 results, these models have significantly improved when it comes to generating valid, working IaC configurations. Three out of the four models seem to have largely eliminated the problems that plagued the earlier generations and converged on a working solution with much less user intervention required.

Picking a winner is tough; it's basically a tie between Claude Sonnet 4 and Google Gemini 2.5 Pro. The Sonnet 4 model generated an end-to-end deployment script that required very little work on the part of the user to arrive at a working solution. However, the Gemini model generated Terraform configurations that were much closer to canonical best practice (multiple files). Compared to these two, the Mistral Le Chat Pro model was a disappointment. It eventually produced a working configuration, but it really felt like interacting with a previous generation model. It made the same mistakes as the previous models, and it required far more work on the part of the user to produce a working solution.

Compared to last year, I think the takeaway is that we are much closer to being able to depend on LLMs to generate full, possibly production ready IaC configurations. While these tests were still based on a very simple infrastructure abstraction, we also must consider that our prompt provided a bare minimum of context, and the explicit goal was to arrive at a working solution with as little user input as possible, not necessarily to generate a scalable, enterprise-ready configuration. With what we now know about prompting and context, it's much more likely that an engineer could deploy a model within an engineering team, provide it the necessary prompts and context (like existing code samples), and it would be capable of generating production-ready configurations.

Features

Learn More

Learn

Connect

TL;DR

2025 Updates

Rise of Agents/Agentic Workflows

Google Gemini

Chain of Thought Reasoning

2025 Testing Methodology

2025 Tests

Experiment #1: OpenAI - o3

Process

Iteration #1

Iteration #2

Iteration #3

Iteration #4

Iteration #5

Iteration #6

Iteration #7

Notes

Issues

Results

Experiment #2: Anthropic - Claude Sonnet 4

Process

Iteration #1

Notes

Issues

Results

Experiment #3: Mistral - Le Chat Pro

Process

Iteration #1

Iteration #2

Iteration #3

Iteration #4

Iteration #5

Iteration #6

Iteration #7

Notes

Issues

Results

Experiment #4: Google - Gemini 2.5 Pro

Process

Iteration #1

Notes

Issues

Results

2025 Findings