October 24, 2025josh-pllara

Scaling Terraform with Terragrunt | Multi-Environment Management

What you'll learn: How to scale Terraform to enterprise complexity using Terragrunt and build production-ready infrastructure across multiple AWS accounts with proper orchestration. Explore Terragrunt's hierarchical approach for DRY configurations, implement CI/CD automation with GitHub Actions that respect dependencies, and learn practical troubleshooting techniques for common issues.

Managing hundreds of Terraform modules across multiple AWS accounts starts as organized infrastructure-as-code. Six months later, it becomes a maze of configuration drift and copy-paste errors.

Your production backend configuration accidentally points to the staging state file because someone copied the wrong block. Development deploys with production's database password because variables were duplicated across thirty files. The VPC module fails because someone forgot to deploy the security groups first.

Terraform doesn't allow variables in backend configurations, meaning every module needs its own hardcoded version. Every environment requires you to copy the same variable definitions, hoping you update them all consistently. Due to manual dependency management, you have to run Terraform apply in precisely the right order across dozens of directories.

Terragrunt solves these problems by adding the orchestration layer that Terraform lacks.

It eliminates repetition through configuration inheritance, automates dependency management, and enables infrastructure-as-code at scale. Instead of managing hundreds of similar configurations, you manage one propagating template.

What is Terragrunt?

Terragrunt is a thin wrapper around Terraform that adds the orchestration capabilities Terraform lacks for enterprise-scale deployments. While Terraform provides the building blocks for infrastructure as code, Terragrunt supplies the blueprint for assembling those blocks into manageable systems.

Terraform excels at managing individual pieces of infrastructure, but when you're juggling dozens of environments across multiple AWS accounts, you need something to coordinate the complexity.

The gap it fills

Native Terraform forces you to repeat yourself constantly. Every module needs its own backend configuration, and every environment requires you to copy and paste the same variable definitions. When your infrastructure grows to hundreds of modules, this repetition becomes a maintenance burden.

Here's what you face with pure Terraform at scale:

# In every module, you write this:
terraform {
  backend "s3" {
    bucket = "my-terraform-state"
    key    = "prod/networking/vpc/terraform.tfstate"  # Hardcoded, error-prone
    region = "us-east-1"
  }
}

Terragrunt eliminates this repetition by letting you define configuration once and inherit it everywhere. Key features include:

  • Remote state configuration inheritance. Define your S3 backend once at the root, and every module automatically inherits it with dynamically generated state paths.
  • Variable management across environments. Set common variables at the root level, then override only what changes per environment.
  • Dependency orchestration. Tell Terragrunt that your app module depends on your database module, and it will handle the deployment order automatically.
  • Multi-module execution. Deploy your entire infrastructure with terragrunt run-all apply instead of running Terraform in 20 different directories.
  • Hooks for automation. Run scripts before or after Terraform commands, perfect for validation, notifications, or custom workflows.

When to consider Terragrunt

Start evaluating Terragrunt when you encounter these scenarios:

  • Managing 10+ similar environments that share the most configuration
  • Orchestrating multi-account or multi-region AWS deployments
  • Finding configuration drift between environments
  • Needing standardized patterns across all your Terraform modules

If you're managing three VPCs that are 95% identical except for CIDR blocks and regions, you're ready for Terragrunt.

Terragrunt vs. Terraform: when to use them together

Pure Terraform works well for single environments or simple setups, but if you scale up to multiple environments, you'll hit limitations that Terraform alone can't overcome.

Pure Terraform limitations at scale

The most frustrating Terraform limitation is that backend configurations reject variables entirely. You must hardcode every value:

# Can't do this - Terraform will error:
terraform {
  backend "s3" {
    bucket = var.state_bucket
    key    = "${var.env}/terraform.tfstate"
  }
}

# Must do this instead:
terraform {
  backend "s3" {
    bucket = "my-terraform-state"  # Hardcoded
    key    = "prod/terraform.tfstate"  # Hardcoded
  }
}

This means copying backend blocks across every module, risking state file overwrites if you make a mistake.

Module sources face the same restriction. You can't deploy different module versions to different environments without separate code copies. When you lack native dependency management, you must manually run terraform apply in the correct order across dozens of modules. If you miss one dependency, your deployment fails partway through.

Where Terragrunt adds value

Terragrunt transforms these limitations into solved problems:

DRY backend configuration to define your backend once in the root terragrunt.hcl:

remote_state {
  backend = "s3"
  config = {
    bucket = "my-terraform-state"
    key    = "${path_relative_to_include()}/terraform.tfstate"  # Dynamic path
  }
}

Variable inheritance, meaning common variables live at the root and environments only specify their differences:

# root terragrunt.hcl
inputs = {
  instance_type = "t3.medium"
  region        = "us-east-1"
}

# prod/terragrunt.hcl - override only what changes
include "root" {
  path = find_in_parent_folders()
}

inputs = {
  instance_type = "t3.xlarge"  # Bigger instances for prod
}

Module versioning so that each environment can pin different versions:

# prod/vpc/terragrunt.hcl
terraform {
  source = "git::https://github.com/myorg/modules.git//vpc?ref=v2.0.0"  # Prod on stable
}

# dev/vpc/terragrunt.hcl
terraform {
  source = "git::https://github.com/myorg/modules.git//vpc?ref=main"  # Dev tests latest
}

Dependency management to explicitly declare dependencies:

dependency "vpc" {
  config_path = "../vpc"
}

inputs = {
  vpc_id = dependency.vpc.outputs.vpc_id  # Guaranteed correct order
}

Terragrunt vs. Terraform: A trade-off analysis and decision framework

Terragrunt adds another tool to learn and another layer to debug when things break. Your team will need training, and you're committing to Terragrunt's opinionated directory structure, but consider the alternative: untangling configuration drift, manually orchestrating deployments, and fixing copy-paste errors in backend configurations. The structured approach Terragrunt enforces prevents these issues from occurring.

You can decide how you want to implement your IaC setup by considering the following:

Terraform vs. Terragrunt: When to use each

Stay with pure Terraform when:Adopt Terragrunt when:
Managing single environments or proof-of-concepts Your infrastructure rarely changes The team is new to Infrastructure as CodeManaging 3+ similar environments Operating across multiple AWS accounts Configuration management consumes significant team time You need different module versions per environment

The tipping point is when you spend more time managing Terraform configurations than managing actual infrastructure.

How to use Terragrunt with Terraform

Let's look at a multi-account AWS infrastructure example using Terragrunt. We'll create a setup that manages networking, databases, and applications across development, staging, and production accounts.

Terragrunt uses a hierarchical structure that mirrors your infrastructure organization. Here's a production-ready layout:

infrastructure/
├── terragrunt.hcl                 # Root configuration
├── _envcommon/                    # Shared environment configs
│   ├── vpc.hcl
│   └── rds.hcl
├── dev/
│   ├── account.hcl               # Dev account settings
│   ├── vpc/
│   │   └── terragrunt.hcl
│   ├── rds/
│   │   └── terragrunt.hcl
│   └── eks/
│       └── terragrunt.hcl
├── staging/
│   ├── account.hcl
│   └── [same structure as dev]
└── prod/
    ├── account.hcl
    └── [same structure as dev]

Each terragrunt.hcl file contains only the unique configuration for that specific deployment. Everything else gets inherited. The _envcommon folder holds configurations that are shared across environments but vary by component type.

Creating the root configuration

Your root terragrunt.hcl is the single source of truth for backend configuration and shared variables:

# infrastructure/terragrunt.hcl
locals {
  account_vars = read_terragrunt_config(find_in_parent_folders("account.hcl"))
  account_name = local.account_vars.locals.account_name
  aws_account_id = local.account_vars.locals.aws_account_id
  aws_region = local.account_vars.locals.aws_region
}

remote_state {
  backend = "s3"
  generate = {
    path      = "backend.tf"
    if_exists = "overwrite"
  }
  config = {
    bucket         = "terraform-state-${local.aws_account_id}"
    key            = "${path_relative_to_include()}/terraform.tfstate"
    region         = local.aws_region
    encrypt        = true
    dynamodb_table = "terraform-locks-${local.aws_account_id}"
  }
}

generate "provider" {
  path      = "provider.tf"
  if_exists = "overwrite"
  contents  = <<EOF
provider "aws" {
  region = "${local.aws_region}"

  assume_role {
    role_arn = "arn:aws:iam::${local.aws_account_id}:role/TerraformExecutionRole"
  }

  default_tags {
    tags = {
      Environment = "${local.account_name}"
      ManagedBy   = "Terraform"
    }
  }
}
EOF
}

inputs = {
  project_name = "myapp"
  common_tags = {
    Project = "MyApp"
    Owner   = "Platform Team"
  }
}

This configuration automatically generates unique state paths for each module, creates backend configuration files, sets up provider configuration with account-specific roles, and applies consistent tagging.

Environment-specific configurations

Each environment folder contains an account.hcl with account-specific settings:

# dev/account.hcl
locals {
  account_name   = "dev"
  aws_account_id = "123456789012"
  aws_region     = "us-east-1"
  instance_types = {
    web = "t3.small"
    db  = "db.t3.micro"
  }
}

Individual modules inherit and override as needed:

# dev/vpc/terragrunt.hcl
include "root" {
  path = find_in_parent_folders()
}

include "envcommon" {
  path = "${dirname(find_in_parent_folders())}/_envcommon/vpc.hcl"
}

inputs = {
  vpc_cidr = "10.0.0.0/16"  # Dev-specific CIDR
}

For production, you might use a different module version:

# prod/vpc/terragrunt.hcl
include "root" {
  path = find_in_parent_folders()
}

terraform {
  source = "git::https://github.com/terraform-aws-modules/terraform-aws-vpc.git?ref=v5.1.0"
}

inputs = {
  vpc_cidr           = "10.10.0.0/16"
  enable_nat_gateway = true  # Production needs NAT
  enable_vpn_gateway = true  # Production needs VPN
}

Multi-account AWS pattern

Managing multiple AWS accounts requires careful role and permission setup. Each account has its own IAM role that Terragrunt assumes:

# prod/account.hcl
locals {
  account_name   = "prod"
  aws_account_id = "987654321098"
  aws_region     = "us-east-1"
}

# prod/rds/terragrunt.hcl
include "root" {
  path = find_in_parent_folders()
}

dependency "vpc" {
  config_path = "../vpc"
}

inputs = {
  vpc_id     = dependency.vpc.outputs.vpc_id
  subnet_ids = dependency.vpc.outputs.database_subnets

  # Production-specific configuration
  instance_class       = "db.t3.large"
  allocated_storage    = 100
  backup_retention_period = 30
}

Dependency management in action

Dependencies ensure modules deploy in the correct order. Here's how to set up an EKS cluster that needs networking and database resources first:

# dev/eks/terragrunt.hcl
include "root" {
  path = find_in_parent_folders()
}

terraform {
  source = "git::https://github.com/terraform-aws-modules/terraform-aws-eks.git?ref=v19.0.0"
}

dependency "vpc" {
  config_path = "../vpc"

  # Prevent running if VPC isn't ready
  skip_outputs = false
}

dependency "rds" {
  config_path = "../rds"

  # Optional: Mock outputs for `terragrunt validate`
  mock_outputs = {
    db_endpoint = "mock-db.cluster-xyz.us-east-1.rds.amazonaws.com"
  }
  mock_outputs_allowed_terraform_commands = ["validate", "plan"]
}

inputs = {
  cluster_name    = "${local.account_name}-eks-cluster"
  cluster_version = "1.28"

  vpc_id     = dependency.vpc.outputs.vpc_id
  subnet_ids = dependency.vpc.outputs.private_subnets

  # Pass RDS endpoint to cluster
  cluster_additional_security_group_ids = [dependency.vpc.outputs.default_security_group_id]

  node_groups = {
    main = {
      desired_size = local.account_name == "prod" ? 3 : 1
      max_size     = local.account_name == "prod" ? 10 : 3
      min_size     = local.account_name == "prod" ? 3 : 1

      instance_types = [local.instance_types.web]
    }
  }
}

When you run terragrunt run-all apply from the dev folder, Terragrunt deploys the VPC first, creates RDS in parallel with other independent resources, deploys EKS only after dependencies complete, and handles all state locking automatically. This orchestration happens without manual coordination.

Running Terragrunt in a CI pipeline

Automating Terragrunt deployments through GitHub Actions eliminates manual errors and enforces consistent deployment patterns. Your pipeline handles authentication, dependency resolution, and multi-module deployments across different AWS accounts. Before implementing Terragrunt workflows, review these CI/CD best practices for foundational automation patterns.

GitHub Actions workflow setup

Structure your repository to match your deployment strategy. Each environment gets its own workflow trigger:

.github/
└── workflows/
    └── terragrunt.yml

infrastructure/
├── terragrunt.hcl
├── dev/
├── staging/
└── prod/

Branch protection rules enforce your deployment flow. Development deploys from feature branches, staging from the staging branch, and production only from main with required reviews.

Authentication and permissions

OIDC eliminates long-lived credentials. Each environment assumes a specific role with least-privilege permissions. Here's a complete pipeline:

name: Terragrunt Deploy

on:
  push:
    branches: ['main', 'dev']
    paths: ['infrastructure/**']
  pull_request:
    paths: ['infrastructure/**']

env:
  AWS_REGION: us-east-1
  TF_VERSION: 1.5.0
  TG_VERSION: 0.54.0

permissions:
  id-token: write
  contents: read
  pull-requests: write

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
        uses: actions/checkout@v4

      - name: Determine Environment
        id: env
        run: |
          if [[ "${{ github.ref_name }}" == "main" ]]; then
            echo "environment=prod" >> $GITHUB_OUTPUT
            echo "aws_account=${{ secrets.PROD_AWS_ACCOUNT }}" >> $GITHUB_OUTPUT
          else
            echo "environment=dev" >> $GITHUB_OUTPUT
            echo "aws_account=${{ secrets.DEV_AWS_ACCOUNT }}" >> $GITHUB_OUTPUT
          fi

      - name: Configure AWS Credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::${{ steps.env.outputs.aws_account }}:role/GitHubActionsRole
          aws-region: ${{ env.AWS_REGION }}

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: ${{ env.TF_VERSION }}

      - name: Setup Terragrunt
        run: |
          wget -q https://github.com/gruntwork-io/terragrunt/releases/download/v${TG_VERSION}/terragrunt_linux_amd64
          chmod +x terragrunt_linux_amd64
          sudo mv terragrunt_linux_amd64 /usr/local/bin/terragrunt

      - name: Terragrunt Plan
        id: plan
        run: |
          cd infrastructure/${{ steps.env.outputs.environment }}
          terragrunt run-all plan --terragrunt-non-interactive

      - name: Terragrunt Apply
        if: github.event_name == 'push'
        run: |
          cd infrastructure/${{ steps.env.outputs.environment }}
          terragrunt run-all apply --terragrunt-non-interactive --auto-approve

This workflow determines the environment based on the branch, assumes the correct AWS role using OIDC, runs terragrunt plan for all pull requests, and applies changes when code merges to the branch.

For production environments, add a manual approval step:

deploy-prod:
  if: github.ref == 'refs/heads/main'
  environment: production  # Creates manual approval gate
  runs-on: ubuntu-latest
  steps:
    # ... previous setup steps ...

    - name: Terragrunt Apply Production
      run: |
        cd infrastructure/prod
        # Run with reduced parallelism for safety
        terragrunt run-all apply \
          --terragrunt-non-interactive \
          --terragrunt-parallelism 1 \
          --auto-approve

The environment: production setting requires manual approval before the job runs. Configure this in your repository settings under Environments.

Handling run-all commands

The run-all command deploys multiple modules while respecting dependencies. Control parallelism based on your AWS API limits:

# Fast but aggressive - good for dev
terragrunt run-all apply --terragrunt-parallelism 4

# Slow but safe - good for production
terragrunt run-all apply --terragrunt-parallelism 1

If one module fails, Terragrunt continues with modules that don't depend on it. Use --terragrunt-ignore-dependency-errors=false to fail fast instead.

Advanced patterns

For selective deployments when only certain modules change:

- name: Deploy Changed Modules Only
  run: |
    cd infrastructure/${{ steps.env.outputs.environment }}

    # Get changed directories
    CHANGED_DIRS=$(git diff --name-only HEAD^ | grep "^infrastructure/" | cut -d/ -f3 | sort -u)

    for dir in $CHANGED_DIRS; do
      if [ -d "$dir" ]; then
        echo "Deploying $dir"
        terragrunt apply --terragrunt-working-dir $dir \
          --auto-approve --terragrunt-non-interactive
      fi
    done

Schedule drift detection to catch manual changes:

on:
  schedule:
    - cron: '0 2 * * *'  # Daily at 2 AM

jobs:
  drift-detection:
    runs-on: ubuntu-latest
    steps:
      # ... setup steps ...

      - name: Check for Drift
        run: |
          cd infrastructure/prod
          terragrunt run-all plan -detailed-exitcode --terragrunt-non-interactive || EXIT_CODE=$?

          if [ "${EXIT_CODE}" = "2" ]; then
            echo "::error::Drift detected in production!"
            exit 1
          fi

Troubleshooting common issues

State lock conflicts occur when deployments overlap. Extend the lock timeout:

- name: Apply with Extended Lock Timeout
  run: |
    terragrunt run-all apply \
      --terragrunt-non-interactive \
      --auto-approve \
      --lock-timeout=30m

If locks get stuck, you may need to force unlock the state file.

Module download failures happen with private repositories. Clear the cache and retry:

- name: Clean and Retry
  if: failure()
  run: |
    rm -rf .terragrunt-cache/
    terragrunt run-all init --terragrunt-non-interactive

IAM permission errors need debugging. Enable detailed logging:

- name: Apply with Debug Output
  if: failure()
  env:
    TF_LOG: DEBUG
    TERRAGRUNT_LOG_LEVEL: debug
  run: |
    terragrunt apply --terragrunt-working-dir problem-module

Best practices and troubleshooting

When it comes to managing dozens of Terragrunt deployments, adopting certain patterns separate smooth operations from constant firefighting. Before diving into Terragrunt-specific practices, ensure you understand how to organize your Terraform code effectively.

Directory structure

Keep your hierarchy shallow (three levels maximum). Deep nesting makes navigation harder and slows down run-all commands:

infrastructure/
├── terragrunt.hcl
├── dev/
│   ├── us-east-1/
│   │   ├── vpc/
│   │   └── eks/
│   └── us-west-2/
└── prod/

Name your modules consistently. For example, if it's vpc in dev, it's vpc in prod, not network or virtual-private-cloud. This consistency enables automation like terragrunt run-all apply --terragrunt-include-dir "*/vpc".

Finally, pin module versions explicitly. Development can test newer versions while production stays stable:

# dev uses latest
source = ".../modules//vpc?ref=main"

# prod uses pinned version
source = ".../modules//vpc?ref=v2.1.0"

State management

It is recommended to use one state file per component per environment; in other words, never share state between environments. Terragrunt generates unique paths automatically if you use path_relative_to_include() in your backend configuration.

While S3 versioning helps with reverting to older versions of the state file, making explicit backups before risky operations is safer:

terragrunt state pull > backup-$(date +%Y%m%d).tfstate

When migrating existing Terraform to Terragrunt, carefully migrate the existing state. Run terragrunt init first to create the new state location, then use Terraform's built-in migration prompts.

Variable management

Use locals for computed values instead of duplicating logic:

locals {
  environment = split("/", path_relative_to_include())[0]
  region      = split("/", path_relative_to_include())[1]
  name_prefix = "${local.environment}-${local.region}"
}

Document variable inheritance in your root terragrunt.hcl. Future team members need to understand what comes from where. Add validation where possible to catch errors early.

Performance optimization

Parallelism defaults to 10, which can overwhelm AWS APIs. Production should use -terragrunt-parallelism 2 to avoid rate limits, while development can push to between 4 and 5.

The download cache grows indefinitely. Add a weekly cleanup job to your CI pipeline. Use selective execution with --terragrunt-include-dir or --terragrunt-exclude-dir for targeted deployments.

Team collaboration

Document your Terragrunt patterns in a team runbook. Include directory structure conventions, how to add new environments, variable inheritance hierarchy, and troubleshooting steps.

Code reviews should check for circular dependencies, hardcoded values that belong in variables, and consistent naming.

New team members need Terragrunt training before touching production. The abstraction layer eventually helps, but can become confusing. Pair programming for the first few deployments to accelerate learning.

Common pitfalls and solutions

Even experienced teams encounter Terragrunt issues. Here's how to recognize and resolve them quickly.

Circular dependencies kill deployments instantly. Terragrunt detects them and fails with "Dependency cycle detected." This typically happens when Module A depends on Module B, which depends on Module C, which depends back on Module A. Security groups are frequent culprits.

Solution: restructure your modules so dependencies flow in one direction, or combine related security groups into a single module.

Cache corruption manifests as "module not found" errors or version mismatches, especially after switching branches. The fix is straightforward:

rm -rf .terragrunt-cache/
terragrunt run-all init

Add .terragrunt-cache/ to your .gitignore file – this cache should never be committed.

Version conflicts between Terraform and Terragrunt cause cryptic errors like "invalid character" or "unknown function." Terragrunt 0.54+ requires Terraform 1.0+. Always pin both versions in your CI pipeline and document them in your README.

Module source authentication with private repositories fails silently. For local development, configure Git to use SSH. For CI/CD, use GitHub Apps or deploy keys instead of personal access tokens.

Conclusion

Terragrunt transforms Terraform from a powerful tool into a scalable platform. Through inheritance, automated dependency management, and simplified multi-account deployments, you've eliminated configuration duplication. Your modules now follow standardized patterns that scale.

The investment pays off quickly when managing complex infrastructures. Teams juggling 100+ modules across multiple AWS accounts save more time on manual orchestration than they spend learning Terragrunt's patterns. The opinionated structure that seems restrictive at first becomes the foundation for reliable deployments.

Start by picking development as your proof of concept, where mistakes are cheap. Migrate existing modules gradually, establishing team standards for naming conventions and directory structures before they become implicit knowledge.

While Terragrunt solves orchestration, Terrateam adds enterprise automation. It understands Terragrunt's dependency graphs, provides policy enforcement across all modules, and adds pull request automation that respects your module relationships.

Ready to scale your infrastructure without the complexity? Sign up for Terrateam and transform your Terragrunt deployments with automated workflows.