May 18, 2025josh-pollara

Managing Terraform Modules at Scale

This is Part 2 of our Terraform Modules Series

If you're new to Terraform modules, start with Part 1: Terraform Modules - Organization and Scaling Best Practices, which covers the fundamentals of module creation, structure, and basic patterns.

Once teams start relying on Terraform modules, the next challenge is keeping them organized and maintainable at scale. Poorly structured modules lead to brittle infrastructure, painful updates, and duplication that defeats the whole purpose of modularization. A solid approach to module structure, versioning, and automation makes Terraform modules easier to use, safer to update, and more predictable across environments.

This post covers how to design modules for long-term maintainability, manage versioning, and introduce automation to keep infrastructure changes reliable and controlled.

Module Organization and Development

Key Principle

Building maintainable infrastructure modules requires more than just working code. Teams that succeed with modules think carefully about structure, interfaces, and maintenance patterns that serve them long-term.

Focus on Specific Problems

Successful modules solve specific infrastructure problems - whether that's deploying an application stack, setting up networking components, or managing database infrastructure. When modules try to handle too many concerns, they become brittle and prone to breaking changes. Rather than creating a monolithic module for application deployment, separate your concerns into focused modules for container infrastructure, database management, and monitoring components.

Designing the Module Interface

Your module's interface shapes how "users" (other engineers) interact with your infrastructure components. A well-designed interface exposes enough flexibility for the use cases of multiple engineering teams while hiding unnecessary complexity. Consider this application deployment module interface:

variable "service_name" {  
  type        = string  
  description = "Name of the service being deployed"  
}

variable "container_image" {  
  type        = string  
  description = "Docker image to deploy (format: repository/image:tag)"  
}

variable "scaling_config" {  
  type = object({  
    min_instances     = number  
    max_instances     = number  
    target_cpu_util   = number  
    target_mem_util   = number  
  })  
  description = "Service autoscaling configuration"  
   
  validation {  
    condition     = var.scaling_config.min_instances <= var.scaling_config.max_instances  
    error_message = "Minimum instance count must not exceed maximum instance count"  
  }  
}  

The module exposes sensible inputs like service name, container image, and min/max instances, while abstracting away the implementation details of all the various AWS services needed to make the stack work.

Working with Complex Data Structures

As the deployment environment matures and scales, you'll often need to handle more complex infrastructure deployments. The object type provides a simple data structure for grouping together related variables, which helps with readability. This approach drives module maintainability and helps module consumers understand how different options relate to each other:

variable "database_config" {
  type = object({
    engine_version         = string
    instance_class         = string
    allocated_storage      = number
    backup_retention       = number
    multi_az              = bool
    parameter_group_settings = map(string)
  })
  description = "Database configuration settings"
}

The object grouping also can provide a level of modularity. Combined with some of the dynamic functions in HCL, a single configuration object can be re-used for multiple environments.

Managing Dependencies

As the infrastructure grows, so too can the potential graph of module dependencies. While Terraform handles basic dependency ordering through its resource graph, more complex modules often require explicit dependency management. Consider a module that sets up both a database and its monitoring infrastructure:

resource "aws_cloudwatch_metric_alarm" "database_cpu" {
  alarm_name          = "${var.identifier}-cpu-utilization"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = "2"
  metric_name         = "CPUUtilization"
  namespace           = "AWS/RDS"
  period              = "300"
  statistic           = "Average"
  threshold           = "80"
 
  dimensions = {
    DBInstanceIdentifier = aws_db_instance.main.id
  }
 
  depends_on = [aws_db_instance.main]
}

Conditional Resource Creation

Conditional resource creation allows modules to adapt to different use-cases without having to maintain separate codebases. Rather than creating environment-specific modules, use variables and conditions to handle these variations:

locals {
  is_production = var.environment == "prod"
  backup_retention = local.is_production ? 30 : 7
 
  monitoring_config = local.is_production ? {
    detailed_monitoring = true
    alarm_thresholds = {
      cpu    = 70
      memory = 80
    }
  } : {
    detailed_monitoring = false
    alarm_thresholds = {
      cpu    = 85
      memory = 90
    }
  }
}

Provider Configuration Strategy

Strongly opinionated provider configuration means anticipating how your module will be used across different environments and accounts. For most modules, inheriting provider configuration from the calling code provides the most flexibility:

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = ">= 4.0.0"
      configuration_aliases = [aws.primary, aws.replica]
    }
  }
}

This approach allows module consumers to use your module across different regions or with different authentication methods, while still enabling modules to specify version constraints and required provider features.

Testing Infrastructure Modules

Testing Considerations

While unit tests work well for variable validation and simple logic, infrastructure modules often require integration testing to verify they work as expected.

This example shows a basic testing fixture using Terratest:

func TestApplicationModule(t *testing.T) {
    t.Parallel()
    terraformOptions := &terraform.Options{
        TerraformDir: "../examples/complete",
        Vars: map[string]interface{}{
            "environment":  "test",
            "service_name": "test-service",
        },
    }
    defer terraform.Destroy(t, terraformOptions)
    terraform.InitAndApply(t, terraformOptions)
}

Implement a testing strategy that includes both quick-running validation tests and more comprehensive integration tests; assume that a full test suite of a reasonably complex module will take some time to fully deploy and tear down the infrastructure. This helps catch issues early in development while still ensuring your modules work correctly in real-world scenarios.

Module Lifecycle Management

Critical Insight

Deploying a module is just the beginning. Production infrastructure modules often support production-level workloads for years, evolving with the applications that run on them, new features, and shifts in cloud APIs and services.

Version Control for Infrastructure

Version control works differently for infrastructure code than application code. While both use semantic versioning to signal compatibility changes, the impact of breaking changes differs significantly.

Application Code

Can be tested in isolation and rolled back quickly if issues arise

Infrastructure Code

Directly alters the state of live systems - rollback might mean reversing encryption, storage, or network changes

This makes semantic versioning particularly important for infrastructure modules. Teams need to thoroughly evaluate version updates before applying them, understanding that version 1.3.0 brings new optional features they can safely test, while 2.0.0 signals changes that will modify their underlying infrastructure.

Handling Breaking Changes

Major version changes need particular attention, especially for modules managing stateful resources like databases or persistent storage. Adding an optional variable with a sensible default preserves backward compatibility:

variable "encryption_config" {
  type = object({
    enabled = bool
    kms_key_id = optional(string)
  })
  default = {
    enabled = true
    kms_key_id = null
  }
  description = "Database encryption configuration"
}

However, removing variables or changing core resource configurations warrants a major version bump. Take a database module that starts enforcing encryption at rest - this change impacts existing databases and needs coordination with every team using the module.

Version Pinning Strategy

When you need to start making non-trivial changes to your core infrastructure modules is when either git-hosted or private registry-hosted modules really shine; consumers of a given module can pin their configurations to their working version, and they can integrate new version updates at their own convenience.

Validation and Guardrails

Validation becomes increasingly important as modules scale, and the number of module users grows. While initial module development might focus on core functionality, production modules need guardrails to prevent misconfigurations. Adding validation rules helps catch issues before they result in a failed deployment:

variable "database_config" {
  type = object({
    instance_class     = string
    allocated_storage  = number
    backup_retention   = number
  })
  validation {
    condition     = contains(["db.r5.large", "db.r5.xlarge", "db.r5.2xlarge"], var.database_config.instance_class)
    error_message = "Instance class must be one of the approved r5 types for production workloads"
  }
  validation {
    condition     = var.database_config.allocated_storage >= 100
    error_message = "Production databases must have at least 100GB of allocated storage"
  }
  validation {
    condition     = var.database_config.backup_retention >= 14
    error_message = "Production databases require minimum 14-day backup retention"
  }
}

These validations enforce organizational standards and prevent common misconfigurations. Running the module with non-compliant values fails fast during the plan phase, before any infrastructure changes occur. By combining variable validation with resource-level constraints, modules can ensure infrastructure stays within defined operational bounds.

Provider Updates and Compatibility

Provider updates can introduce their own challenges. New versions might deprecate certain configurations or add required fields, which are manifest as breaking changes. Regular testing against provider updates helps catch these changes early, giving engineering time to adapt. Document provider compatibility explicitly in your module's README. Some teams maintain separate module versions targeting different provider versions when breaking changes are unavoidable.

Documentation Best Practices

Documentation often falls behind as modules change. Beyond updating basic usage examples, maintain a detailed changelog that helps module users understand and adopt new features effectively. Include context about why changes were made:

## [2.0.0] - 2024-02-15
### Breaking Changes
- Removed support for db.t2 instance types
  - AWS is retiring these instances
  - Existing deployments should migrate to t3/t4g equivalents
  - Performance improvements justify the change
### Added
- Support for RDS Blue/Green deployments
  - Enables zero-downtime major version upgrades
  - Configurable cutover windows
  - Automatic rollback on failed upgrades

Automation Tip

Unfortunately, manual updates of documentation will inevitably be done on a "best-effort", which as time goes the interval approaches 0. Automated updates are the best way to ensure meaningful documentation is provided alongside module configurations.

Depending on the version control provider, there are options available for automated release notes. For modules specifically, there is the excellent terraform-docs tool that has been a mainstay of the Terraform ecosystem for many years. Typically, the tool is run as part of a pre-commit workflow, ensuring that any module changes are registered in the README.

Secure Database Modules with Ephemeral Secrets

Security Challenge

Database credentials present a particular challenge for infrastructure modules. They're highly sensitive yet essential for both provisioning and application access. The traditional approach of storing these credentials in state files creates security risks that grow with your infrastructure.

Using Infisical's ephemeral resources pattern with Terraform v1.10+ provides a more secure approach:

# A database module that uses ephemeral secrets  
resource "aws_db_instance" "main" {  
  identifier      = var.name  
  engine          = var.engine  
  instance_class  = var.instance_type  
    
  # Use credentials from the module inputs  
  username        = var.credentials.username  
  password        = var.credentials.password  
    
  # Other configuration...  
}

# In the root module  
module "production_db" {  
  source = "./modules/database"  
    
  name          = "production-db"  
  engine        = "postgres"  
  instance_type = "db.r5.large"  
    
  credentials = {  
    username = local.db_credentials.username  
    password = local.db_credentials.password  
  }  
}

# Fetch the database credentials ephemerally  
ephemeral "infisical_secret" "db_creds" {  
  name         = "PROD_DB_CREDENTIALS"   
  env_slug     = "production"  
  workspace_id = var.infisical_workspace_id  
  folder_path  = "/database"  
}

# Decode the JSON secret into usable values  
locals {  
  db_credentials = jsondecode(ephemeral.infisical_secret.db_creds.value)  
}  

Benefits of This Approach

  1. Platform teams manage infrastructure without needing to see raw credentials
  2. Security teams can rotate credentials in Infisical without requiring infrastructure redeployments
  3. Application teams access only the credentials they need

Organizations can extend this pattern to handle other sensitive resources like API keys, certificates, and connection strings. In each case, the ephemeral approach keeps sensitive data out of your Terraform state while facilitating secure access when needed.

Building Secure and Maintainable Terraform Modules

Using Terraform modules makes infrastructure more reusable and scalable, but without a solid approach to versioning, security, and maintenance, they can become a source of tech debt. Managing changes across environments, handling module updates safely, and enforcing security best practices are critical for long-term stability.

The Power of Moved Blocks

The moved block in Terraform and OpenTofu has transformed how we handle module refactoring. Previously, reorganizing resources within a module forced users to either manually update state or risk resource recreation. Now, module authors can include moved blocks to guide the migration:

moved {
  from = aws_db_instance.database
  to   = aws_db_instance.primary
}

moved {
  from = aws_db_parameter_group.params
  to   = module.parameters.aws_db_parameter_group.custom
}

For Module Authors

Restructure modules, split large modules into smaller ones, or reorganize resources for clarity while providing a clean upgrade path.

For Module Consumers

Adopt changes without manually managing state or risking downtime from resource recreation.

Migrating to a Modular Terraform/OpenTofu Architecture

Signs You Need Modules

  • Copy-pasted Terraform configurations accumulating across repositories
  • Small differences between environments multiplying into significant maintenance overhead
  • Security policies and best practices becoming harder to enforce consistently

Planning Your Migration

Start by mapping your infrastructure's current state. Which resources are typically created together? What patterns repeat across services? Common patterns often emerge around networking (VPCs, subnets, security groups), application deployments (ECS services, task definitions, IAM roles), and data layers (RDS instances, parameter groups, monitoring). These patterns form natural boundaries for your initial modules.

State Management During Migration

State management forms the foundation of a safe migration. Terraform and OpenTofu track resources through unique identifiers in the state file. Moving these resources into modules means updating these references without disrupting your running infrastructure. Always start with a state backup:

# Backup state before migration  
terraform state pull > terraform.backup.tfstate

# Move resources into the new module structure  
terraform state mv 'aws_security_group.app' 'module.app_cluster.aws_security_group.app'  
terraform state mv 'aws_ecs_service.app' 'module.app_cluster.aws_ecs_service.app'  

Warning

Watch out for state locks during migrations, especially in situations where multiple engineers deploy infrastructure. Set up a communication channel to coordinate state operations and consider scheduling migrations during quieter periods.

Migration Best Practices

Proven Migration Strategy

Start Small: Pick one well-understood piece of infrastructure for your first module

Document Dependencies: Many teams discover hidden dependencies during migration

Match Exactly First: For stateful resources, match your existing configuration exactly in the new module

Run Hybrid: Keep original Terraform code alongside new modules during transition

Establish clear workflows for module development and testing. Who reviews module changes? How do you test modules across different environments? What's the process for rolling out module updates to dependent services? Having these workflows in place makes module adoption smoother and helps prevent deployment conflicts.

Advanced Module Usage and Optimization

Production infrastructure modules often need to handle complex scenarios beyond simple resource creation. While basic modules might work with static configurations, real-world infrastructure typically involves dynamic scaling, cross-provider orchestration, and careful performance tuning.

Dynamic Resource Creation

Dynamic inputs and nested data structures give modules the flexibility to adapt functionality to the user. Instead of hardcoding values, you can build modules that scale based on input parameters:

locals {  
  # Transform list of environments into map for easier lookup  
  environment_config = {  
    for env in var.environments : env.name => env  
  }  
}

resource "aws_security_group" "service" {  
  for_each = local.environment_config  
   
  name        = "${var.service_name}-${each.key}"  
  description = "Security group for ${var.service_name} in ${each.key}"  
  vpc_id      = each.value.vpc_id

  dynamic "ingress" {  
    for_each = each.value.allowed_ports  
    content {  
      from_port   = ingress.value  
      to_port     = ingress.value  
      protocol    = "tcp"  
      cidr_blocks = each.value.allowed_cidrs  
    }  
  }  
}  

This pattern lets you create environment-specific security groups from a single configuration, reducing code duplication while maintaining flexibility. The for_each expression creates distinct resources for each environment, while dynamic blocks handle variable port configurations.

Performance Optimization

Performance Tip

Large modules with many resources need careful performance optimization. A module deploying dozens of microservices might hit API rate limits if all resources try to create simultaneously. Balance parallel creation with dependency management, use explicit `depends_on` only when necessary, and consider splitting large modules into focused components.

Multi-Region and Multi-Account Deployments

Managing infrastructure across multiple regions or accounts adds another layer of complexity. You might need different provider configurations for each region:

provider "aws" {  
  alias  = "us_east_1"  
  region = "us-east-1"  
}

provider "aws" {  
  alias  = "us_west_2"  
  region = "us-west-2"  
}

module "primary_cluster" {  
  source = "./modules/eks-cluster"  
  providers = {  
    aws = aws.us_east_1  
  }  
  # Primary region configuration  
}

module "dr_cluster" {  
  source = "./modules/eks-cluster"  
  providers = {  
    aws = aws.us_west_2  
  }  
  # DR region configuration  
}  

Cross-Provider Orchestration

Cross-provider modules can orchestrate resources across different services. A complete application deployment might involve AWS for compute, Cloudflare for DNS, and Datadog for monitoring:

terraform {  
  required_providers {  
    aws = {  
      source  = "hashicorp/aws"  
      version = "~> 4.0"  
    }  
    cloudflare = {  
      source  = "cloudflare/cloudflare"  
      version = "~> 3.0"  
    }  
    datadog = {  
      source  = "DataDog/datadog"  
      version = "~> 3.0"  
    }  
  }  
}  

Error Handling and Retry Logic

Error handling is important when modules hit a certain level of complexity. Resources might fail to create, API calls might timeout, or external services might be temporarily unavailable. Building retry logic helps handle these scenarios:

locals {  
  max_retries = 3  
  retry_interval = 30  
}

resource "null_resource" "retry_example" {  
  provisioner "local-exec" {  
    command = <<EOF  
      attempt=0  
      until [ $attempt -ge ${local.max_retries} ]; do  
        if aws command-here; then  
          exit 0  
        fi  
        attempt=$((attempt+1))  
        sleep ${local.retry_interval}  
      done  
      exit 1  
    EOF  
  }  
}  

Long-running operations or external resource changes need special handling. Using null_resource with triggers lets you respond to infrastructure changes:

resource "null_resource" "deployment_check" {  
  triggers = {  
    cluster_endpoint = aws_eks_cluster.main.endpoint  
    config_version   = var.config_version  
  }

  provisioner "local-exec" {  
    command = "scripts/validate-deployment.sh ${aws_eks_cluster.main.endpoint}"  
  }  
}  

Environment-Specific Secrets in Multi-Environment Modules

When modules deploy similar infrastructure across development, staging, and production environments, managing environment-specific secrets becomes another challenge. Different environments need different credentials, but you want to maintain a consistent module interface.

Clean Secret Management Pattern

Infisical solves this elegantly by allowing your module consumers to fetch environment-specific credentials based on deployment context. With Terraform v1.10+, you can use ephemeral resources to ensure these secrets never persist in state files.

module "api_service" {
  source = "./modules/api_service"
  # Module configuration...
  
  # Environment-specific secrets from Infisical
  database_url = local.api_credentials.database_url
  api_keys = {
    stripe = local.api_credentials.stripe_key
    sentry = local.api_credentials.sentry_dsn
  }
}

# Fetch secrets ephemerally (never persisted in state)
ephemeral "infisical_secret" "api_credentials" {
  name         = "API_CREDENTIALS"
  env_slug     = terraform.workspace  # or any environment identifier
  workspace_id = var.infisical_workspace_id
  folder_path  = "/api/credentials"
}

# Decode the JSON secret into usable values
locals {
  api_credentials = jsondecode(ephemeral.infisical_secret.api_credentials.value)
}

This pattern keeps your module interface clean - it doesn't need separate variables for dev, staging, and prod credentials. Instead, the root module fetches the appropriate secrets based on context, and your module simply consumes what it's given.

Access Control Benefits

Hidden Value

What's less obvious but equally valuable is how this approach simplifies access control. Security teams can restrict production credential access without affecting development teams' ability to deploy infrastructure. Developers can modify dev environment secrets in Infisical without needing production access, while the underlying Terraform code remains identical across environments.

As teams grow and security requirements become more stringent, this separation becomes increasingly important. Modules that follow this pattern scale better across organizational boundaries compared to approaches that embed environment-specific credential logic within the modules themselves.

Conclusion

Modules change how teams manage infrastructure, but they are not set-and-forget. Without versioning, testing, and security controls, they can create more risk than they solve. A solid module strategy ensures infrastructure remains predictable, secure, and easy to manage as it grows.

Terrateam makes it easier to enforce these best practices by integrating policy enforcement, automated plan and apply checks, and approval workflows directly into GitHub. Versioning and automated testing help teams roll out changes safely. Features like moved blocks make refactoring easier without breaking running infrastructure.

Not everything belongs in a module. Standardizing common infrastructure patterns improves maintainability, but forcing everything into a module adds unnecessary complexity. The goal is not just reuse. It is making sure teams can manage infrastructure safely and efficiently over time.