Build vs Buy: A Case Study in Terraflow

Build vs Buy: A Case Study in Terraflow blog post

Build vs Buy, the eternal question

As an engineer, build vs buy is a constant question. Being a developer is incredibly empowering. You can make a computer do anything that physics allows. So given a problem we are trying to solve, should we build a solution ourselves or buy one? There is no real answer here. Like most decisions, it comes down to what you value when you lay out the pros and cons. Is getting a solution that exactly meets your needs more important than spending that time more directly furthering the business? This is a question only you can answer.

Terraform Weekly Issue #202 includes an article by David Calvert on the bespoke TACOS that he developed for his employer, Hivebrite, called Terraflow. David’s article provides insights into the thought process, goals, and results of building their own solution. Unless a company is open sourcing their “build” solution, it’s rare to get insight into a decision like this, so I appreciate that David was willing to share his experience.

Open Source

It looks like David was mostly interested in an open source solution, which makes sense. I’m kicking myself that we decided to open source only recently because I think we would have been a solid contender in the evaluation. Many of the comparison points that disqualified Atlantis and Digger are addressed by Terrateam. It’s a learning experience.

Purpose Built

The primary motivation for building their own solution at Hivebrite was that they wanted something that fit seamlessly into their existing workflow. This is a tall order for any vendor. Vendors are creating a generic solution in order to capture as much of the market as possible. This has to be balanced with making a product that is easy to use. Terrateam, for example, is very expressive however that comes at the cost of complexity of documentation and buy-in from the user such that they are willing to read the documentation. Only by luck can an existing solution fit a complex workflow like a glove.

While more requirements are listed, the interesting ones are:

  1. Handling the large repository: This is something that is simple to say but it’s genuinely difficult at scale. Users can step on each other, get blocked with locks, or get blocked due to a broken repository. It depends a lot on how the repository is organized.
  2. Be able to automatically detect and organize “stacks”: The word stack is overloaded in the Terraform world and while they are not explicitly using Terraform Stacks, the concept is the same: they have root modules that are dependencies of other root modules and want to ensure they are detected automatically and triggered in the correct order.
  3. Must support Terragrunt
  4. Custom plan and apply scope: This is not well defined in the blog post. I think this means that, for example, credentials can be scoped to a specific plan and apply.

The reasons given for implementing their own solution are:

  1. Time: we believe it’s quicker to build a custom solution to solve this problem than what it would take to adopt, and contribute the missing part on an existing project. Also, this gives us the luxury to choose the interface.
  2. Reduced Risk: due to the custom nature of our repository, nothing guarantees that we would be able to merge our specific use-case in an upstream project.
  3. Quality: better integration with our existing code, we don’t need to adapt or change what we already have, just build an additional wrapper around it.
  4. Knowledge: baking a custom solution gives us a better understanding of how things work.

As David writes in the article, the results were positive. It fits their existing setup quite well. They have been using it for, at the time of writing, three months. Personally, for David, this was also a great technical accomplishment. His largest Python project which he also designed, was architect, and wrote most of the code. Two new team members are also contributing. In conclusion, “it has already significantly increased deployment speed and reduced toil, leading to improved team performance.”

Is Build Right For You?

As I said in the beginning: whether build or buy is right for you depends on what is important to you. The reality is that the question is going to come down to money: is it worth putting engineering effort into a solution that doesn’t directly contribute to the bottom line. They could be spending their valuable time developing features that draw more customers rather than internal tooling, otherwise called “opportunity cost”.

If you’re not sure how to start considering the question, here are some elements I think are worth thinking about:

  1. Estimated cost of development: Make a range of guesses, from conservative to liberal, of how much the cost of development could be. Compare that to the existing vendors. If the numbers are close, then maybe building your own isn’t a bad idea. If they are dramatically different, it will require a deeper analysis.

    While we don’t know how much time it took to develop Terraflow, let’s assume that a developer costs $200k/year, or around $17k/month. If it took a month of full-time work to develop Terraflow, we’re looking at $17k just for development. But a month is probably a low estimate.

    The math is pretty simple:

    • 3 months (likely low): $51k
    • 6 months: $100k

    Put in context:

    • Depending on exact requirements, $51k could be a year-long enterprise contract with some servers.
    • At the time of this writing, Spacelift’s starter plan is $400/month, so $51k is 10 years of Spacelift.
    • Terrateam’s equivalent plan is $150/month, or 28 years of service.
  2. Completeness of the solution: Even if you don’t use all of the features of possible vendors, how competitive is the solution that was estimated in point (1)? Does it include the “nice to haves”? If needs change in the future, does that cost estimate include updating the software? In the case of Terraflow, it lacks applying saved plans and does it support pull request locks (two features we consider a hard requirement for anyone doing IaC). I don’t know if it lacks other functionality around compliance or security, which tend to become very important as a company grows.

  3. Maintains plan: If built, will the service align with natural development workflows, or will it become obsolete or fall into disuse? Is it important enough so it will be prioritized for training for new employees? At Terrateam, we have built a lot of our own frameworks and libraries because we want the control. But all of those frameworks are in our everyday development path. They are used for all of our services.

  4. Vendor relationship: Do you feel that you can get a good, healthy, relationship with the vendors? Broadcom’s practices can make people wary when it comes to vendor relationships but the reality is most vendors genuinely want to have a positive relationship with their customers.

  5. Upskilling: Developing a service can be a great way to improve one’s skills. David talks about how Terraflow is his largest project in Python yet. He clearly learned a lot in the process of developing Terraflow. That value is hard to quantify. Even being difficult to quantify, it’s worth considering if there are other projects that would allow upskilling but also contribute directly to the business.

Conclusion

Build vs buy is not an easy question to answer. It depends on the context of the problem and your values. At the very least, it’s worth doing some high-level calculations to determine if a home-grown solution is in the same ballpark as the vendor options. Left out of this was the impact of an open source solution.

As a vendor, I am obviously biased towards buying a solution in this space. Where organizations see simplicity, I see hundreds edge cases that took us months to work out. Where they see things that “work for us for now” I see the risk of a mistake that will bring down your infrastructure and impact your business. But if I told you that always buying is the right choice, that would be doing a disservice. Define what’s right for you, challenge it, and if the answer sincerely comes up that building it is the right option, then do it.

Shameless Plug

You’re reading a Terrateam blog, so I have to inject how we stack up given the Terraflow requirements. Going by the complete list in the blog post:

  • Parallel Execution: Yes. By default, Terrateam runs three operations in parallel but that can be configured. It also supports splitting runs in a pull request across different compute nodes via environments.

  • GitHub Integration: Absolutely. This is our specialty.

  • Terragrunt Support:: Yes

  • Drift Detection: Yes

  • PR Stack Detection (the blast radius of a change): Depending on what this means it comes in a couple of flavors.

    • If the feature is the ability to automatically detect the relationship between root modules, then no, we do not support this by default. We call this functionality layered runs. Even though we do not have native support for it, we do have an escape hatch in the form of dynamic configuration, which allows writing a script which generates a Terrateam configuration file on the fly. We automatically perform a topological sort to execute plans and applies in the correct order. You can encode any logic you want there for how to define the relationship between root modules.

    • If the feature is the ability to detect module relationships, this is supported via our code indexing feature. This automatically determines the relationship between a module and its dependencies such that the correct things are run.

  • Custom plan/apply scope: Yes via workflows.

  • Custom Commands Support: No, we do not support this but it’s in our backlog. We have gone back and forth for if we want it for awhile, however the longer we wait, the more operations that previously required custom commands are being brought into Terraform, making it less of an issue. And, at least for Terrateam, the ethos is everything should go through a review process, so moving more of those operations into pull requests is the right thing to do.

  • Dependency tree: Depends on what this means, see PR Stack Detection above.

  • Slack integration: Yes via webhooks.

  • Ease of use (stack detection, rollout, logs, summary, etc.): This is in the eye of the beholder. Terrateam is pretty expressive but one also has to read the docs and understand it. We get a lot of positive feedback on how simple it is to get going for self-hosted options.

I’d like to think that the only reason Terrateam was not seriously considered was because it was not open source and if the evaluation was done again now, we’d score pretty well.

GitOps-First Infrastructure as Code

Ready to get started?

Build, manage, and deploy infrastructure with GitHub pull requests.