Terrateam has been a huge fan and user of Atlantis for a long time. The community is strong and the self-hosted solution brings Terraform users a solid foundation.
When we decided there was a niche to be filled in the Terraform collaboration space, our first iteration was building a hosted Atlantis solution.
We take our customer’s security seriously, so we implemented a service which ran each costumer’s Atlantis in its own isolated environment.
Our service would securely route GitHub events to the appropriate customer environment and let Atlantis process the request.
This proved to work well. Generally speaking, our implementation was very similar to how someone hosting Atlantis on their own infrastructure would deploy the service.
We quickly learned that this was not compelling enough of a service. While hosted Atlantis may be convenient, it wasn’t convenient enough to provide a service offering. As Terraform users ourselves, we wanted more.
We already had a large backlog of features that we wanted to implement but we had not decided how those features would work with the existing Atlantis codebase. There was concern this would be like forcing a square peg into a round hole.
One of the features that is important to Terraform is Role-Based Access Control (RBAC). From small to large teams, it is important to be able to restrict access to shared resources. Atlantis does not have RBAC. Using custom workflows and Open Policy Agent (OPA) one can push Atlantis into the right direction.
To provide a better user experience, we would need to modify the Atlantis source code and, most importantly, the configuration quite a bit. We were becoming concerned that we would be taking on a lot by asking the Atlantis community to adopt these sorts of Terrateam-specific changes.
Another obstacle we ran into was that we needed to provide a mechanism for Terraform secrets. We had a sophisticated solution which involved secrets being encrypted with a public key coupled with a strict isolated IAM policy. This ensured each customer environment had access to their private key to securely decrypt secrets.
We decided to back away from this approach. Terrateam is a new company and we thought we’d get more traction if customers were able to leverage a familiar platform like GitHub.
Finally, our initial solution required having a long-running container per customer. This architecture is great for response time but it was an operational burden that we didn’t want. In an ideal world, we’d have isolated and safe on-demand customer environments.
We stepped back and asked if there was a solution that solved these problems or gave us a foundation to build on.
Given that we required a first-class GitHub integration, GitHub Actions was a very appealing platform to integrate against. It solved two of our biggest problems:
Users already trust GitHub with their secrets and GitHub manages the Actions infrastructure.
Of course, there was a catch.
Atlantis is designed to execute as a long-running service. It is also designed to process GitHub webhooks, which we wouldn’t need if running inside GitHub Actions. We could save the Atlantis’ state and reify it between runs and fake the GitHub webhooks once it was running.
We quickly strayed too far from Atlantis.
We had already started to diverge from Atlantis with how we wanted RBAC to behave. Now we felt that Atlantis compatibility was really what was important to us, not emulation.
At its core, Atlantis is running Terraform with some configuration. A lot of the Atlantis code is managing its integrations (GitHub, BitBucket, GitLab, etc), which we didn’t need.
terrat-runner is a Python program that replaces the core functionality of Atlantis. We spent a lot of time testing different scenarios to see how Atlantis handles them to make sure we did the same thing or, if we chose to, diverge in a way that made sense to us. And we did decide to diverge in some scenarios.
The end result is a service which receives GitHub webhooks, evaluates a pull request, and then executes a GitHub Action which then runs terrat-runner.
terrat-runner is given a specific set of operations to perform and reports the results back to our service.
We’re up and running with this new architecture now with production workloads.
If you’d like to become an early adopter, please sign up.
Even if we aren’t building on top of Atlantis, we are building on top of its concepts and we’re grateful for it.