Using LLMs to Generate Terraform Code
Mike Vanbuskirk
On this page
- Introduction
- What is a Large Language Model (LLM)?
- LLMs in the Development Workflow
- Chat Interface/Prompting
- Advantages
- Disadvantages
- IDE Extension
- Advantages
- Disadvantages
- Agent-Driven
- Advantages
- Disadvantages
- What about Infrastructure as Code?
- Testing Methodology
- Prompt
- Application Code and Dockerfile
- Generating Infrastructure-as-Code with LLMs
- Experiment #1: OpenAI - ChatGPT 4
- Experiment #2: Anthropic - Claude 3 Opus
- Experiment #3: Mistral - Large
- Initial Findings
- Conclusion
Introduction
Infrastructure as Code (IaC) has revolutionized the way technology organizations manage and provision software infrastructure resources, enabling higher standards for reproducibility and scalability. By treating infrastructure configurations as code, teams can version, test, and deploy infrastructure using the same practices and tools used for software development. This shift has greatly improved the efficiency and reliability of infrastructure management.
More recently, generative AI and large language models (LLMs) have also had a transformative impact across the technology industry. These powerful tools have been widely adopted in development workflows, assisting with tasks such as code generation, refactoring, and documentation. LLMs have demonstrated remarkable capabilities in understanding and generating human-readable text, including programming languages and domain-specific languages.
Given the success of LLMs in aiding software development, it’s natural to ask: can these models be effectively applied to generating IaC configurations? Terraform, and its open-source fork OpenTofu, are considered industry standards for implementing IaC. They provide a declarative language for defining and managing infrastructure resources across various cloud providers and services.
In this article, we will explore example workflows that showcase the use of LLMs in generating IaC configurations using Terraform. Primarily we’ll be looking to answer a simple question: can these tools reliably generate basic, working Infrastructure-as-Code configurations, and what implications might that have for broader usage in production environments.
What is a Large Language Model (LLM)?
A Large Language Model (LLM) is an artificial intelligence model or program that generates text by “predicting” the next token in a sequence. These models are trained on massive datasets, requiring substantial parallel computing resources, such as GPUs, to process and learn from the vast amounts of text data.
LLMs utilize a type of neural network architecture called a transformer model, which was introduced in the influential paper “Attention is All You Need”.
While LLMs are commonly associated with generating human-readable text, their capabilities extend to code generation as well. Code, at its core, is a form of text that follows specific syntactic and semantic rules. Several LLMs have been trained on large codebases, allowing them to understand and generate code in various programming languages.
Moreover, LLMs can be fine-tuned for specific code generation tasks. By training on domain-specific code datasets, such as Terraform configurations or Python scripts, LLMs can learn the common patterns and best practices associated with a particular programming language or framework. This fine-tuning process enhances the model’s ability to generate syntactically correct and semantically meaningful code snippets.
LLMs in the Development Workflow
From chat interfaces to IDE plugins and agent-driven systems, LLMs are being leveraged in various ways to enhance developer productivity and improve the coding process. In this section, we will explore some of the most common and emerging workflows for utilizing LLMs in the development lifecycle.
One of the most popular workflows for utilizing LLMs in the development process is through a chat interface. Developers can interact with the LLM by providing prompts or instructions, such as “ChatGPT, you are a developer…” followed by a specific code generation request. The LLM then generates code snippets or functions based on the given prompt.
This interactive approach allows developers to quickly prototype ideas, generate boilerplate code, or explore different code implementations; using this workflow has already enabled developers to save non-trivial time and effort in writing repetitive or complex code structures.
However, it’s important to note that the current workflows and methods for integrating LLMs into the development process are not exhaustive. The field of AI-assisted development is rapidly evolving, with new techniques and approaches being developed and iterated upon every day. As we explore further, we’ll discover how LLMs are being seamlessly integrated into various aspects of the development workflow, from interactive coding sessions to intelligent code suggestions and even autonomous agent-driven development.
Chat Interface/Prompting
Many popular LLMs offer a multi-modal chat interface that supports text, images, and code. Examples include Claude (https://claude.ai), ChatGPT (https://chat.openai.com), and Mistral (https://mistral.ai/). These interfaces provide an intuitive and accessible way for developers to interact with LLMs and generate code.
The typical workflow involves the user “prompting” the model with a conversational text instruction, describing the desired code or functionality. The model then generates code based on the prompt, which the user can copy and paste into their development environment. This process allows developers to quickly prototype ideas and generate code snippets without writing everything from scratch.
However, the generated code may not always be perfect on the first attempt. If errors or issues arise, users can copy the problematic code back into the chat interface and ask the model for fixes or improvements. This iterative process continues until the desired outcome is achieved, with the model refining the code based on user feedback.
Advantages
The chat interface approach offers several advantages. It has a low barrier to entry, making it easy for developers to get started with AI-assisted coding. The models are generally powerful and capable of handling basic to intermediate level coding tasks. Additionally, the conversational nature of the interface makes it intuitive and user-friendly.
Disadvantages
On the other hand, there are some drawbacks to consider. The copy/paste workflow can be cumbersome and may break the context of the code, especially when dealing with larger codebases. Users are also limited to the training data window of the specific model they are using, which may not always have up-to-date information on the project’s libraries and dependencies. Model selection is also often restricted to the offerings of the chat interface provider, and usually represents a subset of available models from the provider.
Furthermore, using a chat interface raises potential intellectual property (IP) and data privacy concerns, as the code and prompts are being shared with a third-party service. It can also be challenging to address complex, multi-system architectural issues through a chat interface, as the model may not have the full context of the entire codebase.
IDE Extension
LLMs trained specifically on code can provide suggestions and linting directly within an Integrated Development Environment (IDE). GitHub Copilot (https://github.com/features/copilot) is a prominent example of such an IDE extension. These extensions typically offer auto-complete suggestions based on the context of the code being written.
Another example is Cursor (https://cursor.sh/), a VSCode-based IDE that directly incorporates the OpenAI GPT-4 model. Cursor offers a chat interface similar to ChatGPT, allowing developers to interact with the model directly within the IDE. Cursor is unique in that offers a couple of twists on the standard chat interface:
- Users can invoke the model in-context within an editing window, providing instructions to generate or modify code within the existing module.
- Users can also use this interface to query documentation and code from other libraries or files.
- Users can also open a separate chat pane, where they can engage with a more standard chat interface that is tailored to their development environment.
Advantages
The IDE extension approach offers several advantages. While it requires a a bit more involved technical setup compared to chat interfaces, with the need for a compatible IDE and extension, the benefits can be significant. Models like GitHub Copilot are trained exclusively on code, providing more targeted and relevant suggestions. Additionally, developers can receive assistance without leaving their IDE, maintaining the context of their work.
Cursor takes it a step further by providing the LLM with the full context of the codebase. This allows for more accurate and contextually relevant suggestions and fixes. The chat interface within the IDE enables developers to have a conversation with the model, asking for specific code snippets, explanations, or improvements.
Disadvantages
However, there are also some disadvantages to consider. GitHub Copilot, for example, is limited to providing suggestions and auto-complete functionality. It lacks the ability to generate larger code snippets or understand the broader context of the project. Additionally, the selection of models available as IDE extensions is currently limited, and developers have no control over fine-tuning the models to their specific needs. Both Cursor and Copilot also incur a subscription cost for use.
Agent-Driven
Agent-driven LLM models represent a newer approach to code generation, where autonomous models are given a high-level objective or directive and take multiple, intermediate steps to achieve the desired outcome. Unlike traditional chat interfaces or IDE extensions, agent-driven models have the potential to operate more independently and make decisions based on the given goal.
However, it’s important to note that agent-driven code generation is currently at the bleeding edge of AI research and development. As of now, there isn’t a fully productionized and widely available implementation of this approach. The concept is still in its early stages, with researchers and companies exploring different strategies and architectures.
One potential approach for agent-driven code generation is an autonomous agent that automatically submits pull requests. This agent would be responsible for understanding the project requirements, generating the necessary code changes, and proposing those changes through the standard pull request process. However, this is still a theoretical concept and has not been fully realized in practice.
There are some open-source initiatives and commercial offerings that are working towards agent-driven code generation. For example, AutoGPT (https://github.com/Significant-Gravitas/AutoGPT) is an open-source project that aims to create an autonomous agent for code generation. On the commercial side, Magic (https://magic.dev/) is also exploring the possibilities of agent-driven development.
A notable recent development in this field is the introduction of Devin by Cognition Labs (https://www.cognition-labs.com/introducing-devin). Devin has demonstrated impressive capabilities in agent-based development, showcasing the potential for autonomous code generation. However, it’s important to note that Devin is not yet generally available to the public.
Advantages
The potential advantages of agent-driven code generation are significant. It offers the possibility of highly autonomous and efficient code generation, as the agent can understand high-level objectives and make decisions accordingly. This approach reduces the need for manual intervention and enables developers to focus on higher-level tasks and strategic decisions. Additionally, agent-driven models open up possibilities for continuous integration and deployment, with agents automatically generating and proposing code changes.
Disadvantages
Despite the potential benefits, agent-driven code generation comes with several disadvantages. The technology is still in the experimental stage, with no widely available and production-ready implementations. Significant advancements in AI research and development are required to create reliable and safe agent-driven models. There are also potential security and trust concerns, as autonomous agents have more control over the codebase and development process. Adopting agent-driven code generation may require significant changes to existing development workflows and processes, which could be a challenge for organizations.
What about Infrastructure as Code?
While there are some IaC-specific AI tools in development, such as Klotho (in beta), these specialized tools are not yet widely available or fully commercialized. As a result, this article will primarily focus on the more accessible and prevalent approach of using generalized LLM models with chat interfaces for generating infrastructure as code (IaC).
Testing Methodology
Before testing LLMs for generating infrastructure as code, we need to establish some basic testing parameters and inputs. For this purpose, we will define a reasonably simple AWS application infrastructure:
- Application: Python/Flask running in a Docker container
- Compute: ECS Fargate
- Frontend/Proxy: Application Load Balancer (ALB)
- Database: DynamoDB
We’ll assume that foundational networking infrastructure is already in place, such as VPCs and subnets. We’ll be using three popular, commercially available LLM chat providers:
- OpenAI’s ChatGPT 4
- Anthropic’s Claude 3 Opus
- Mistral’s Large
For IaC, we’ll be using Hashicorp’s Terraform v1.5.6.
It’s important to note that evaluating the output of LLMs is often a subjective exercise, as the generated code is generally not deterministic. Therefore, we will not be using strict evaluation frameworks or scoring systems. Instead, we will provide each model with the same prompt and assess the quality and functionality of the generated code.
The configuration will be considered working if the ALB returns “Hello World” and writes a datetime string to the DynamoDB table when its DNS A record receives an HTTP GET request.
Prompt
The prompt given to each LLM will be as follows:
You are an expert Infrastructure as Code AI Engineer. Your task is to generate Terraform code for the following AWS application architecture:
- Use ECS Fargate to run the application container.
- Please create an ECR registry that the ECS service will have permission to pull images from.
- Configure an Application Load Balancer (ALB) to distribute traffic to the ECS tasks.
- Provision a DynamoDB table for data storage.
- The name of the DynamoDB table and the AWS region should be made available to the ECS task as environment variables.
- Set up the necessary IAM roles and permissions for the ECS tasks to access DynamoDB.
- The application will be a Dockerized Python app that runs Flask on port 5000, responds to GET requests with “Hello World”, and writes the datetime to the DynamoDB table.
- The user will provide the application code and Docker container.
- Please generate the commands needed to build the Docker image and push it to the ECR registry.
- The application frontend should be accessible from the public internet.
- Please output the ALB frontend address to enable the user to test the application.
The primary goal is to generate a minimum-viable Terraform configuration that requires little to no editing or intervention by a human user. The user should be able to generate valid Terraform plans and apply them assuming they have valid AWS credentials. Please call out any input or values that need changing by the user.
Application Code and Dockerfile
Since our focus will be on the model’s ability to generate purely IaC configurations, we’ll provide the application code and Dockerfile.
Application Code:
Dockerfile:
Generating Infrastructure-as-Code with LLMs
Experiment #1: OpenAI - ChatGPT 4
Process
- A new chat prompt was opened with the GPT-4 model selected.
- The prompt was entered into the model.
Iteration #1
The provider and ECR Terraform configurations were placed into a single Terraform file:
main.tf
.terraform init
was run.terraform plan
andterraform apply
were run to create an ECR registry.The Docker application image was built, tagged, and pushed to the ECR registry.
The remaining Terraform configurations were placed into
main.tf
.Placeholder values for
vpc-id
,subnet-ids
, and Docker image URL were updated with the correct values.The model was prompted to create the necessary security groups and network configuration.
Iteration #2
A plan was generated using
terraform plan
.The plan was applied with
terraform apply
.Terraform reported back with the error message saying the provided target group has a target type instance which is incompatible with the awsvpc network mode specified in the task definition.
The error message was provided to the model and the fix was applied.
Iteration #3
terraform apply
was run again.The ALB DNS name was used to test the application.
Troubleshooting determined that the ECS task did not have the correct permissions to pull ECR images.
The model was prompted to fix the permissions.
Iteration #4
- Configurations were updated and
terraform apply
was run again.
Notes
- The model did not specify module or folder structure in any way, and did not provide a complete file structure (expecting the user to copy/paste the generated snippets). It’s possible that it would provide this if prompted explicitly to do so.
Issues
- The generated configuration uses absolutely no variable definitions; all input values are hard-coded in specific resource declarations.
- The model did not initially generate the security groups necessary for the configuration to work correctly, despite the prompt instructing the model to require minimum input or changes.
- Despite being asked only to correct the security group omission, the model revised the dynamodb configuration by expanding the scope of the IAM policy for accessing the table.
- The target group was created with an incompatible target type for network mode in the task definition.
- The model did not create the correct IAM configuration to allow the ECS service to pull the container image.
- The security groups were not configured correctly to pass traffic between the ALB and ECS containers.
- The model did not understand the distinction between the ECS Task Execution Role and the ECS Task Role for assigning permissions.
Results
It eventually worked. Even after correcting the initial errors and omissions within the configuration, the application still returned a 502 error when visiting the ALB endpoint. After further prompting for troubleshooting, the model suggested that the security groups may not be allowing traffic to pass between the ALB and the ECS tasks.
After updating the security group configuration, the ALB would pass traffic to the ECS task, but the application returned an error saying the task did not have the permissions necessary to perform PutItem on the DynamoDB table. After prompting the model for separate policies for the task and task execution roles and applying the changes, the application functioned correctly.
Experiment #2: Anthropic - Claude 3 Opus
Process
- A new chat prompt was opened with the Claude 3 Opus model selected.
- The prompt was entered into the model.
Iteration #1
The provider and ECR Terraform configurations were placed into a single Terraform file:
main.tf
.terraform init
was run.terraform plan
andterraform apply
were run to create an ECR registry.The Docker application image was built, tagged, and pushed to the ECR registry.
The remaining Terraform configurations were placed into
main.tf
.Placeholder values for vpc-id, subnet-ids, aws region, and names were updated with the correct values.
The model was prompted again to correct the lack of separate task and execution roles
Iteration #2
- A plan was generated using
terraform plan
. - The plan was applied with
terraform apply
. - The ALB endpoint was tested.
Notes
- Like the ChatGPT output, the generated configuration uses no variable definitions; everything is hard-coded.
- As visible in the screenshots, Claude 3 makes use of syntax highlighting, which helps with readability.
- Claude also generated the entire configuration in one copyable block, as opposed to the multiple blocks generated by ChatGPT.
- The model comments several resource name declarations, prompting the user to name them appropriately.
- The model correctly interpolated the ECR URL in the container definition.
- The model also understood the testing mode for the ALB endpoint, prepending the value with “http://“.
- Unlike ChatGPT, the model assigned a default role for ECR access. While this may be broader than needed, it enables it to pull images.
Issues
- No major difficulties or issues were encountered.
Results
After the updates to the IAM policies, the application worked successfully.
Experiment #3: Mistral - Large
Process
- A new chat prompt was opened with the Mistral Large model selected.
- The prompt was entered into the model.
Iteration #1
The provider and ECR Terraform configurations were placed into a single Terraform file:
main.tf
.terraform init
was run.terraform plan
andterraform apply
were run to create an ECR registry.The Docker application image was built, tagged, and pushed to the ECR registry.
The remaining Terraform configurations were placed into
main.tf
.Placeholder values for vpc-id, subnet-ids, aws region, and names were updated with the correct values.
A plan was generated using
terraform plan
.The plan was applied with
terraform apply
.The model was prompted to correct the task definition to include the Task Execution Role and Task Role ARN.
Iteration #2
A new plan was generated with
terraform plan
.The plan was applied with
terraform apply
.The target group was updated with a target type of
ip
.The model was prompted to fix an issue with security groups.
Notes
- Like the Claude 3 model, the output had syntax highlighting, and was provided as a single block.
- Unlike the other two models, Mistral defined an explicit health check.
- The model correctly separated the functionality of task and execution roles, correctly placing the DynamoDB permissions with the task role and the ECR access with the execution role.
Issues
- Like the other two models, variables were not used, with inputs like “name” being hard-coded.
- The policy assigned to the Task Role was overly broad for the DynamoDB permissions needed.
- The environment variable for the AWS region interpolated a Terraform variable without it being defined.
- The task definition was not configured with the ARN of the task execution role.
- Similar to the ChatGPT configuration, the target group had the wrong target type set.
- The configuration incorrectly assigned the same security group to both the ALB and the ECS service, preventing the load-balancing from working correctly.
Results
Initially, the service responded with a 504 Bad Gateway Error
. It was determined through troubleshooting and inspection of the generated configuration that the model had not correctly assigned separate security groups to the ECS service and and ALB for passing traffic. Once that was corrected, the application functioned correctly.
Initial Findings
LLMs certainly have the potential to be a game-changer when it comes to software development and infrastructure engineering; one observation is that each model could generate a significant amount of boilerplate code, something that might take much longer for a single engineer.
If we had to pick a “winner”, it would be the Claude 3 Opus model from Anthropic. Although it still required additional prompting to generate the correct IAM configuration, it got us the closest to a working MVP with the least amount of additional prompting and troubleshooting.
However, one of the primary takeaways from these tests is that when it comes to LLMs, they have a long way to go to be viable for generating full, production-ready infrastructure configuration.
The hypothetical stack we used to test each model was an extremely simple abstraction that would be much more complex were it to be deployed in a production environment, and even then each model struggled with basic, critical implementation details that were needed to provide a working configuration. In most organizations, there would be a far greater number of implementation details and integrations needed to make a software deployment functional in a multi-application architecture, and these models would not have access to internal documentation or configuration data to properly generate high-quality, functional Infrastructure-as-Code. Simply put: LLMs can provide a productivity boost, but engineering teams should not expect to be fully dependent on them for complex configurations or production-grade architecture.
Conclusion
Large Language Models (LLMs) have demonstrated their potential as powerful tools for generating infrastructure as code (IaC). The ability to generate IaC configurations using natural language prompts can significantly streamline the infrastructure provisioning process. However, it’s important to recognize that LLMs, in their current state, are not yet capable of fully handling production-ready IaC on their own.
While LLMs can generate functional IaC code snippets and templates, the generated code may require manual review, modification, and integration into existing infrastructure workflows. The complexity and variability of real-world infrastructure requirements often exceed the current capabilities of LLMs, necessitating human expertise and intervention.
Despite these limitations, the rapid progress in LLM development and the increasing adoption of IaC practices suggest a promising future for AI-assisted infrastructure provisioning. As LLMs continue to evolve and specialized IaC tools emerge, we can expect to see more robust and reliable solutions for generating production-ready IaC code.