Flying away from AWS

Flying away from AWS blog post

TL;DR

It was a pleasure migrating from AWS to Fly.io but it’s not all rainbows and unicorns.

Fly.io goes above and beyond to create an exceptional developer experience. It’s super easy to hit the ground running. However, there are some rough edges that you might encounter.

If you like managing your own infrastructure and can live without stellar support, Fly.io could be the right solution for you.

Overall, I would recommend it over the major cloud providers if you don’t need all of the bells and whistles.

Background

When Terrateam was just starting out, like any other startup, we wanted to move fast to see if it was even worth pursuing. AWS was the easiest win because we could stand something up in short order. Throw in some free credits and we didn’t have to worry about infrastructure for a year. Problem solved.

Time flies though and the free credits ran out. Our AWS bill was too high for a 100% bootstrapped startup. We started hunting around for cheaper alternatives without compromising stability, security, and scalability.

Why Migrate?

The main motivation to migrate was cost. AWS is not cheap and we didn’t need all of the bells and whistles that AWS provides. Maybe one day, but not today.

By design, Terrateam has a simple stack that doesn’t require a lot of infrastructure. Here are the pieces:

  • A Postgres database
  • Web service: A simple Docker container listening on a port. Horizontally scalable and lightweight.

The benefit of such a simple stack is that it gives us a lot of options in choosing a provider. We knew what we wanted: cheaper, more of a PaaS feel, and didn’t compromise on operational necessities. We wanted a modern-day Heroku-like service offering.

After some quick experimentation and a proof of concept, it became clear to us that Fly.io was worth pursuing. We decided to stand up a staging environment to flesh out the details.

Preparation

In preparation for standing up a staging environment, we decided to map out all of the pieces of our existing AWS infrastructure in an attempt to match them to Fly.io components. We wanted to make sure we weren’t missing anything.

  • AWS ALB → Fly.io Load Balancer
  • AWS ECS → Fly.io Nomad
  • AWS RDS → Fly.io Postgres (close enough)

Everything looked straight forward for a migration. The next step was to create the Fly.io applications and configuration files.

Fly.io has a strong CLI that allows you to specify a fly.toml configuration file for each application. With this configuration, you can create an application and configure it with all of your required settings. Overall very easy.

Our staging fly.toml:

app = "terrateam-app-staging"
kill_signal = "SIGINT"
kill_timeout = 60
[env]
DB_HOST = "terrateam-db-staging.internal"
DB_NAME = "terrateam"
DB_PORT = "5432"
DB_USER = "app"
TERRAT_PORT = "8180"
TERRAT_PYTHON_EXEC = "/usr/bin/python3"
[[services]]
internal_port = 8080
protocol = "tcp"
[services.concurrency]
hard_limit = 25
soft_limit = 20
type = "connections"
[[services.http_checks]]
grace_period = "10s"
interval = "10s"
method = "get"
path = "/health"
protocol = "http"
restart_limit = 0
timeout = "3s"
tls_skip_verify = false
[services.http_checks.headers]
[[services.ports]]
force_https = true
handlers = ["http"]
port = 80
[[services.ports]]
handlers = ["tls", "http"]
port = 443
[deploy]
strategy = "rolling"

Testing

As we started to build out the Terrateam staging environment, we quickly encountered the following roadblocks. Luckily they all had easy fixes.

IPv6

Fly.io application endpoints resolve to an IPv6 address.

josh@elmer:~ $ fly ssh console --app terrateam-app-staging
Connecting to fdaa:0:c037:a7b:c6ef:47dd:247:2... complete
/ # dig terrateam-db-staging.internal A terrateam-db-staging.internal AAAA +short
fdaa:0:c037:a7b:c207:e395:9a80:2
/ #

Our application didn’t have support for IPv6. That’s on us so we implemented the happy eyeballs algorithm. First migration problem solved.

Postgres with SSL

Database connectivity wasn’t working. While we could resolve and reach the database endpoint from our application, we couldn’t connect. The database library we were using enforces a secure connection with SSL. We could have easily created an override, but we felt like it was good practice to configure Postgres with SSL.

Searching the Fly.io documentation, there didn’t seem to be an easy way to configure SSL using the fly.toml configuration file. Upon further investigation, we learned that the Fly.io Postgres solution is not equivalent to AWS RDS. Confirmed by their documentation: This Is Not Managed Postgres

There is exceptional support for creating new databases with the Fly.io CLI. But it ends there. Management, scaling, upgrading, failing over, configuring, etc. is all on you. This is not a negative against Fly.io as they are upfront about what the offering is and what it isn’t. Now we know.

To configure Postgres with SSL, certificates were created and the proper configuration was deployed to the postgresql.conf. Second migration problem solved.

Migration

After ironing out the kinks in the staging environment, it was time to create all of the Terrateam Fly.io production applications and pull the trigger on a migration.

Two approaches were discussed for the migration:

  • Live migration with zero downtime
  • Quick migration with minimal downtime

Live Migration

In my experience, every migration conversation starts with evaluating the difficulties of performing steps without application downtime. We discussed an approach that involved some complex Nginx configuration but ultimately decided against it.

A short amount of downtime vs. the level of effort and complexity to create all of the required moving parts for a live migration was not worth it. Keeping it simple stupid was a more attractive approach. Having the ability to resend any missed GitHub webhooks further influenced us to pursue the not-so-fancy quick migration with minimal downtime pathway.

Quick Migration

The quick migration was straight forward:

  1. Update app.terrateam.io with a low DNS TTL
  2. Block incoming connections on the AWS ALB
  3. Migrate the AWS RDS database to the new Fly.io database
  4. Update app.terrateam.io to point to the new Fly.io application endpoint
  5. Resend any missed GitHub webhooks (there weren’t any)

Pros

We benefit in many ways by hosting Terrateam with Fly.io. All of the benefits come out of the box with a new Fly.io organization. These are the best kinds of extras. It really shows that Fly.io understands what typical engineering teams require to build infrastructure.

Observability

The observability you get for free in Fly.io is superb. When creating a new application, you automatically get a Grafana dashboard with all of the graphs you’d typically want. It’s very easy to create more graphs and dashboards on top of the pre-existing ones which makes visibility into your application a joy. This is a breath of fresh air when compared to AWS CloudWatch.

Additionally, if you configure your application to expose a Prometheus endpoint, those metrics will automatically show up on your Grafana dashboard.

Grafana

Remote Access

Coming from the AWS world, this is another area where Fly.io shines. In my experience, it can be required to remote into a running container. This can be especially useful when building infrastructure for the first time or troubleshooting an ongoing issue. Sometimes you just need a shell.

The Fly.io CLI provides a very easy way to gain access:

josh@elmer:~ $ fly ssh console --app terrateam-app-staging
Connecting to fdaa:0:c037:a7b:c6ef:47dd:247:2... complete
/ #

I appreciate not having to be burdened by VPNs, SSH keys, bastion hosts, etc. and really enjoy the simplicity of the fly ssh console command.

IPv6 Private Networking

Each Fly.io organization receives an automatic secure private network with IPv6 endpoints. For a simple application like Terrateam, this is a huge benefit. Not having to create a separate private network is a responsibility I’m happy to offload. I don’t have to worry about CIDRs, subnets, routing, and anything else that comes along with the complexity of networking. It just works.

Multi-Region Scalability

The multi-region scalability Fly.io functionality feels too good to be true. By specifying more than one region in your fly.toml, your application magically automatically lives in more than one region. You can change these regions at any time without any downtime. With other cloud providers this would be a huge amount of work. This is a killer feature.

josh@elmer:~ $ fly regions list --app terrateam-app-production
Region Pool:
dfw
sjc
Backup Region:
josh@elmer:~ $

Clean UI

The Fly.io dashboard is clean, organized, and easy to navigate. It’s not cluttered like other cloud provider dashboards. I can easily find what I’m looking for in no time. Please never change.

Dashboard

Cons

Terraform Provider

While we initially wanted to create everything via Terraform, we quickly realized this wouldn’t be possible. The Fly.io Terraform provider is just not robust enough to create a complete environment. This was disappointing but we decided to move on. It’s worth noting that this is a big problem for Terrateam as we’re a company that stresses the use of Terraform. It’s on our roadmap to contribute to the Fly.io Terraform provider.

Postgres HA

Under the hood, Fly.io uses something called Stolon which is a cluster manager that facilitates Postgres failover. In my experience, Stolon does not have a good story for failover reliability. This is a problem because this is the purpose of Stolon.

I can personally account for an instance where Stolon failed to facilitate a Postgres failover leaving me in a situation where I had to manually create a new database and restore from a backup. Fly.io is working on replacing Stolon with a more robust solution.

Logging

The container logging solution provided by Fly.io is basic. It’s easy to view logs with the Fly.io CLI and via the Fly.io dashboard. However, there’s only a small window of logs that are kept, forcing you to create a remote logging solution either internal to your application or via a separate Fly.io application that ships logs to an external service. This is a piece of operational overhead that I’d like to see removed.

Logs

Support

Fly.io is not a support-first company. If this bothers you, then Fly.io is probably not for you.

When emailing support, and email is the only option, it can take from hours to days to get a response. Sometimes they don’t follow up. It does not feel like their level of support is on par with other cloud providers.

This may go along with their attitude of you running your own infrastructure. Fly.io provides you with the building blocks. Support is not there as a safety net. It’s more of a nice to have. Managing your environment is ultimately up to you.

I’d love to see better support in the future. If I can’t get a specific answer from staff, I at least would like to see a nudge in the right direction. Room for improvement.

Closing Thoughts

No platform is all rainbows and unicorns. Having said that, Fly.io gets pretty close. There are not a lot of complaints with the offering. It does help to understand what exactly the offering is though. I think Fly.io does a great job in explaining what they are and more importantly what they are not.

By design, our stack is simple, which is probably why Fly.io works so well for us. We don’t require a lot of moving parts or infrastructure. We basically just need a place to run a database and a container.

Your mileage may vary. If your organization requires more comprehensive components from a cloud provider like messaging middleware, S3 buckets, IAM, etc., then Fly.io might not be for you.

In closing, we are overall very pleased with Fly.io and can recommend the platform to other teams that wish to move away from other complex and expensive cloud providers.

GitOps-First Infrastructure as Code

Ready to get started?

Build, manage, and deploy infrastructure with GitHub pull requests.