Safety First

Safety First blog post

Introduction

In the blog post where we talked about transitioning from Atlantis to GitHub Actions, we mentioned that in some cases we chose to diverge from how Atlantis operates.

The source of divergence for us almost always comes down to the same question. What is safe for our users?

What does safe mean?

We believe that your infrastructure should match what is defined in your default Terraform branch. Terrateam pushes our users to avoid problematic infrastructure drift.

It is very important to us that we provide an experience to help users make changes via Terraform while not getting in their way.

In some scenarios, our operations differ from other platforms like Atlantis.

While building the initial verison of Terrateam, we had to walk through a lot of scenarios to ensure safety, reliability, and consistency. We even considered edge cases that are unlikely to happen, to make sure we always operate in a safe way.

For the most part, Terrateam users have a workflow with autoplan enabled. Changes swiftly go from planned to applied to merged. Many of the scenarios we’ll describe involve expert users manually performing various commands.

Escape Hatches

Even though we view safety as critical to the Terrateam experience, we realize that users might either want to or need to do something we consider unsafe.

We decided early on that we would provide what we call escape hatches that allows a user to override the Terrateam software.

Locks

The most common escape hatch is to comment terrateam unlock in a GitHub Pull Request. When our backend service receives this command, it immediately release any locks associated with Terraform resources against the Pull Request.

Both Terrateam and Atlantis use locks to enforce workflows. A lock prevents any other change from being applied if another change has that lock.

A Terrateam lock is connected to a GitHub Pull Request and that Pull Request owns a lock on a directory.

For example, let’s say we have Pull Request 42 with changes in the following directories:

  • aws/ec2
  • aws/iam

When applying those changes, Terrateam will lock both of these directories and associate it with Pull Request 42.

Any other Pull Request which has a change in those directories will not be able to plan or apply its changes.

How we differ from Atlantis mostly comes down to when we create a lock or release a lock.

Planning

Atlantis lock

Running a plan acquires the lock on that directory for that Pull Request. Once acquired, no other Pull Request with changes to the same directory can perform a terraform plan.

Terrateam unlock

Planning on a directory is allowed by any number of Pull Requests. We feel that users should be able to review plans alongside existing pull requests.

If a change is applied, any locks are released which invalidates any plans against open Pull Requests on that same Terraform directory or set of resources.

Only when the pull request that was applied is merged into the default branch, the set of locks are released.

This ensures consistency between your Terraform repository and resources.

After a successful merge to the default branch, Terrateam will re-plan any invalidated plans based on the repo-level autoplan configuration.

Merged and Applied to Unlock

Two-Step Operation

To build on the previous difference, Atlantis releases a lock on apply.

However, as mentioned before, we want to make sure your default branch looks as close to what’s running in production as possible.

Applying a change is a two step operation where it needs to be applied or merged to lock the changes and then the other operation (apply or merge) needs to be performed to unlock.

We only lock when your infrastructure is being modified. An unlock only happens once that change has been successfully applied and successfully merged to the default branch.

Barney Rubble (Trouble)

We can recognize two problematic scenarios.

Scenario 1

If a user opens a Pull request, applies a set of Terraform changes via terrateam apply, closes the Pull Request, and finally deletes the branch.

We can’t stop a user from performing these actions. We can inform the user of this undesirable state by commenting on the Pull Request.

We will also hold onto the locks against the closed Pull Request until a user with the appropriate permission unlocks them by commenting terrateam unlock.

In this scenario, there isn’t much we can do but provide feedback. We post a comment to the Pull Request letting them know that they are in an undesirable state and we maintain locks on that Pull Request until the user manually unlocks it by commenting terrateam unlock.

Scenario 2

If autoplan is disabled, a user opens a Pull Request with Terraform changes, and merges the Pull Request to the default branch.

The likely result is drift

Once a change is in the default branch and it has not been applied, it transitions into the locked state.

In this scenario, the workflow follows the same set of rules as a merge then apply workflow.

This is the only way for a lock to be acquired without performing a terraform plan.

The terrateam unlock command can be executed to exit out of this state.

Planning Directories Without Changes

Atlantis

Users are allowed to perform a terraform plan against a directory without modifications.

In this case, the resulting plan will be empty and Atlantis will now have this Pull Request own the lock on this directory.

Terrateam

Users are not allow to perform a terrateam plan against a directory without modifications.

This is one of the few cases that we do not have an escape hatch.

Modifying the Terrateam configuration is the one exception.

Updating Repository-Level Configuration Locks Repository

This is a special case. What happens depends on what was changed in the Terrateam config.yml.

Any directories and/or workspaces that are impacted by the change to the YAML are now considered a changed directory, even if their underlying code has not changed.

A Pull Request Must Have Plans for All Its Changes to Apply

Atlantis

As long as the Pull Request has a lock on the directory, a user can apply a change to that directory.

Terrateam

Once changes have started to apply against a Pull Request, Terrateam strives for all changes to be applied before releasing locks.

When applying, the Pull Request needs to have valid plans for all directories it will apply against.

Two Pull Requests

Imagine the following scenario:

  1. Pull Request 1 is a change against two directories: dir1 and dir2
  2. Pull Request 2 is a change against two directories: dir2 and dir3

Both Pull Requests can run plans for their respective directories, despite sharing dir2.

Let’s say Pull Request 1 is applied, but not merged, and then a user tries to apply Pull Request 2.

Pull Request 2 has a plan for dir3, however because Pull Request 1 has been applied, but not merged, the plan that Pull Request 2 had for dir2 has been invalidated and cannot be planned, therefore applying Pull Request 2 will fail.

Even if the user only tries to apply dir3, it will not be allowed because the Pull Request needs to be able to lock all of the directories that have changed under it to apply anything.

Single Pull Request

Imagine a new scenario:

  • A single Pull Request with changes in dir1 and dir2
  • dir1 has autoplan enabled
  • dir2 does not have autoplan enabled

A user cannot perform an apply, even if it is only against dir1 until dir2 has a plan, at which point dir1 and dir2 would be locked until both are applied and merged.

User Impact

Most likely zero.

We had to go through all of these scenarios to make sure that Terrateam always operates in a safe way. If a user hits an edge case, we want to make sure we safely protect the user.

Again, these are edge cases. We think that most users want a straight forward and basic workflow:

  • create a pull request
  • run a terraform plan
  • run a terraform apply
  • merge

Or some variation of that workflow.

We don’t consider Atlantis to be unsafe. Most users will just not exercise it in a way that they could get into a sticky situation.

One gotcha that users coming from Atlantis to Terrateam may experience is that planning does not create a lock.

As a user, feel safe knowing that the default workflow is probably all you need. If you do get yourself caught into one of these edge cases, you can be sure to know that Terrateam will make the safest decision possible to keep your infrastructure secure.

The repository-level configuration and comment interface for Terrateam is also different from Atlantis in some important ways. We’ll cover those in an upcoming blog post.

GitOps-First Infrastructure as Code

Ready to get started?

Build, manage, and deploy infrastructure with GitHub pull requests.