Ocaml to PostgreSQL: Building Dynamic Queries with Type Safety

By Malcolm Matalka on Dec 28, 2023
Ocaml to PostgreSQL

We recently released the Audit Trail feature of Terrateam. Audit Trail is part of the Terrateam Console, allowing users to view and query their Terrateam operations. Audit Trail is a new type of feature for us. Most of Terrateam is driven by reacting to GitHub Webhooks: receive an event, evaluate it, and initiate a GitHub Action. All of these database queries are under our control and optimized. Audit Trail gives users the ability to query the database, with their own queries. This is a more challenging feature to implement, to communicate, and to ensure we maintain quality of service. The development and implementation of Audit Trail is a good view into how Terrateam, as a company, works.

Terrateam is different than a lot of other companies in this space in the technology choices we make. The biggest stand-out is that our backend is implemented entirely in Ocaml. I’ve spoke at length about why we chose Ocaml, see here and here.

We’ve also written a lot of custom frameworks, libraries, and tooling for Ocaml. Our rule of thumb is: no to frameworks, yes to libraries. This is because we want control. Ocaml is such a powerful and expressive language, we are able to deliver features quickly despite taking on so much development. A lot of the underlying protocols and technology we interact with are quite stable, so while we might have a higher initial development cost, our maintenance cost is essentially zero as we do not have to modify our code as frameworks we do not control change over time.

One of our core libraries is Pgsql_io which is an implementation of the PostgreSQL wire protocol in Ocaml. A major feature of Pgsql_io is that it supports adding type annotations to SQL queries. In Pgsql_io, one writes a SQL query string and then provides extra information to specify what kind of values it takes and what kind of values it returns. Here is an example of how we use it to create a work manifest row:

The SQL:

insert into github_work_manifests (
base_sha,
pull_number,
repository,
run_type,
sha,
tag_query,
username
) values (
$base_sha,
$pull_number,
$repository,
$run_type,
$sha,
$tag_query,
$username
)
returning
id,
state,
to_char(created_at, 'YYYY-MM-DD"T"HH24:MI:SS"Z"')

The annotated query in Pgsql_io:

let insert_work_manifest () =
Pgsql_io.Typed_sql.(
sql
// (* id *) Ret.uuid
// (* state *) Ret.text
// (* created_at *) Ret.text
/^ read "insert_github_work_manifest.sql"
/% Var.text "base_sha"
/% Var.(option (bigint "pull_number"))
/% Var.bigint "repository"
/% Var.text "run_type"
/% Var.text "sha"
/% Var.text "tag_query"
/% Var.(option (text "username")))

And here is a usage of it:

let% res =
Pgsql_io.Prepared_stmt.fetch
db
(Sql.insert_work_manifest ())
~f:(fun id state created_at -> (id, state, created_at))
base_sha
pull_number_opt
repo_id
run_type
sha
tag_query
user

To put it all together, we have:

  1. A string which is an SQL expression. This looks like any ol’ SQL, the only difference is it references variables with a $, such as $username.
  2. We have type annotations for that SQL string, which specifies the types of the variables, $username is a text value and can be null. $repository is a bigint and cannot be null. We also specify the type of returned rows. We use comments to say the name of the columns, which is just for developer convenience.
  3. We have the call to Pgsql_io.Prepared_stmt.fetch which takes a database connection, an SQL expression, a function to apply to each returned row, and the values of the variables referenced in the SQL expression.

Pgsql_io takes all of these, rewrites the SQL expression to be a prepared statement and maps the variables to the prepared statement parameters, and executes it, for every row it applies the function and collections the result and returns a list of values.

What’s so powerful about this is that it all gets verified at compile time. That is to say, before we run the program, we know that all uses of the SQL expression match the type annotation. If we refactor an SQL expression, we change the type annotation to match, and the compiler will tell us every call site that we need to fix.

”Hey, wait a second!”, you say, “How do you verify the SQL you wrote matches the type annotation?” We don’t! Originally we had looked at building a DSL that compiles to SQL, then we would know that if the types match so does the SQL. That was quickly abandoned. SQL is just too large of a language and we make use of a lot of the power of SQL, we just did not feel confident that building such a DSL would be feasible for us. So we made a compromise in the name of pragmatism: SQL statements tend not to change that much, and when they are wrong it’s usually pretty blatant. You’ve either specified the correct types or your haven’t, and when that SQL is executed you’ll know quickly if it’s right or wrong.

With that background, on to the interesting part: we had decided to implement a new feature called Audit Trail. While Terrateam is designed to be used within GitHub for the standard workflow, reporting operations live inside of the Terrateam Console. A common request is that users want to be able to see what operations Terrateam has performed. In a busy organization, it’s hard to do this just by scrolling through the list of runs. We wanted to give users a powerful and robust query language to filter results.

Terrateam already makes use of a simple query language we call the Tag Query Language. It’s used in many places in the product. The nice thing about it is that it has a syntax that is straight forward to learn and it can be used in different contexts. We just needed to define which attributes of an operation we wanted to expose to users.

The way Audit Trail works is a user types a query into the UI which makes an API call passing the query as a string. The backend parses the query and translates it to SQL and executes it. For example, the query pr:123 and user:joe would translate to an SQL query like pull_number = 123 and username = 'joe'.

We don’t want to construct a raw SQL query, though. That leaves us vulnerable to all sorts of obvious injection threats. What we really want to do is construct a prepared statement that references the values the user has input. So something like: pull_number = $1 and username = $2. That means every query is dynamic. But given how Pgsql_io works, that would mean every query requires a different type annotation. The power of Pgsql_io is that it checks our queries at compile time, that conflicts with what we’re trying to accomplish: safe and dynamic queries at runtime.

There are two options:

  1. Bypass Pgsql_io and interact with a lower-level API, constructing the necessary values at runtime. This removes any help from the compiler.
  2. Find some way to maintain the static typing and construct the query dynamically.

Option (1) is definitely a fallback option. We lose all the value of the compile time type checks. We can do it if we have to but it would be great if we could find a solution for option (2).

At Terrateam, we generally approach software development problems with types. This is one reason we like our own frameworks: we are willing to pay the price of a less ergonomic API if it lets us use more expressive types. We look at developer productivity not just as how quickly we can develop the code, but also how difficult it is to use the code incorrectly, and how easy it is to refactor. Types are the tool that we’ve chosen to enforce these.

One could use other methodologies, such as TDD, but we find that we have to write far fewer tests because our types allow the compiler to check the code for us, and good types are less work than writing tests. Refactoring is really the killer-app for the type system. As a newer company and product, implementing features often also come with a refactoring.

An example is how we represent operations, such as a plan or apply, which we call work manifests. A work manifest has different representations depending on where it is in its life cycle. A new work manifest has no ID and receives one when it is saved to the database. When we construct a work manifest we don’t want to have to put a fake ID, we don’t want to have an ID at all. Additionally, a work manifest is associated with a pull request, but sometimes when we read a work manifest from the database we aren’t interested in the details of the pull request, so there is no reason to load the pull request from the database. We want to represent all of these different types of work manifests.

One solution is we could store the id and the pull_request in the work manifest as a value that can be null. The downside is that every usage has to check if it is null. It would be nice if code paths that require an ID or a pull request could only receive work manifests where those fields are not null.

If we chose the initial solution, to solve this by making those fields nullable, a work manifest might look like below (in Ocaml, option means that the value can be null). This defines a type, named work_manifest, that is a record (or a “struct” in other languages), with an optional id and src.

type pull_request = { (* .. *) }
type work_manifest = {
id: string option;
src: pull_request option;
(* ... *)
}

Instead, we can make the types of the id and src fields type variables to the type work_manifest. This is called a “generic” in a lot of languages, and it means that the type work_manifest takes two parameters: the type of id and the type of src. People are more familiar with using type variables for collection types, like an array, or a hash map. But any type can take a type variable, including records and structs.

type ('id, 'src) work_manifest = {
id: 'id;
src: 'src;
(* ... *)
}

Now we can specify different types for work manifests that match our use case. A new work manifest, one that does not have an id but does have a pull_request would be an alias to the work_manifest, like below. The unit type is like void in other languages, it means there is no id. And then we have a type called pull_request.

type new_work_manifest = (unit, pull_request) work_manifest

A “lite” work manifest, one that we load from a database, so it has an id, but we are not interested in the pull_request field, so we will use unit again to specify it is not present:

type lite_work_manifest = (string, unit) work_manifest

And then a full work manifest, that has both an id and a pull_request:

type full_work_manifest = (string, pull_request) work_manifest

Now, if a code path requires that the id and src are set, it will use the full_work_manifest type in its definition. A code path can also specify that it requires a src but does not care if there is an id set or not.

This was very helpful when we decided to add support for drift to Terrateam. A drift work manifest has no pull_request, instead it has a reference to the drift schedule that initiated it. To support this, we make a new drift type and then create a drift_work_manifest type that uses it. We can also make a type that is either.

type drift = { (* ... *) }
type drift_work_manifest = (string, drift_schedule) work_manifest
type drift_or_pr = Drift of drift | Pr of pull_request
type drift_or_pr_work_manifest = (string, drift_or_pr) work_manifest

Now that we had these types specified, we added type annotations to our existing code get it to compile. For example, the function to save a work manifest to the database must be able to store a drift work manifest or a pull request work manifest, and it should not have been saved to the database already, so it takes a work manifest of type (unit, drift_or_pr) work_manifest. On the other hand, the RBAC code path only operates on work manifests with pull requests because drift work manifests are always initiated internally and do not need to go through the RBAC checks. So RBAC takes a (unit, pull_request) work_manifest.

The nice part about all of this is that the compiler guides us through the process. We tell it what types we expect values to be at certain points in the program and it tells us if we need to change anything.

Now, back to our original problem: how can we implement Audit Trail queries without losing type safety?

Another way to look at this problem is that we want to generate a query where the query references values passed in by the user, for example the pull request number or the username (something like pr:123 and user:joe). When we turn a user’s query into SQL, we need to do it in such a way that it doesn’t change the type annotation that we gave the query with Pgsql_io.

SQL arrays to the rescue! SQL supports passing arrays. We actually use this in another context: bulk inserts. Rather than construct an insert statement with a bunch of entries in the values section, we pass in an array for each part of the row and turn it into a set of rows and pass it into insert.

For example, we insert all of the repositories an installation contains with the following SQL:

insert into github_installation_repositories (
id,
installation_id,
owner,
name
)
select * from unnest($id, $installation_id, $owner, $name)
on conflict (id) do nothing

The type annotation for this looks like:

let insert_installation_repositories () =
Pgsql_io.Typed_sql.(
sql
/^ query
/% Var.(array (bigint "id"))
/% Var.(array (bigint "installation_id"))
/% Var.(array (text "owner"))
/% Var.(array (text "name")))

How does this help us with creating dynamic queries? For the types of values users can pass in, there are only a few: strings and integers. So, what if, for each value a user passes in, we put it into an array of the correct type, and then reference that array in the query. For a pull request, that would be an integer value, and for users it would be a string. In this case, our type annotation looks like the following:

let audit_trail_query () =
Pgsql_io.Typed_sql.(
sql
// (* return ... *) Ret.not_important_for_this_example
/^ generated_query
/% Var.(array (text "strings"))
/% Var.(array (integer "ints")))

But what do queries look like? Well, take the one we’ve been working with: pr:123 and user:joe. That becomes: (pull_number = ($ints)[1]) and (username = ($strings)[1]). Another example of translating a query that references two values of the same type, the query user:joe or user:moe. would become: (username = ($strings)[1]) or (username = ($strings)[2]). For every variable of that type (string or int) we add it to the array and we reference it by its index.

What’s nice about this is the type annotation stays the same for every query. We know escaping is not a problem, so no fear of injection. Testing becomes quite easy, we don’t even need a database, we just need to verify that the query string we generate matches what we expect.

To recap:

  1. We want to accept dynamic queries from the user.
  2. We want to translate those queries into SQL.
  3. We want to maximize the type safety, which means we need to check as much at compile time as possible.
  4. Dynamic and compile time don’t mix, so we have a problem.
  5. The solution is, when we translate the user query to SQL, to reference values the user passes into their query by array index. This way the type annotation of our SQL query is static even though the contents of the query are dynamic.

That’s great, problem solved!

Well, there are a few operational details. We have software that can translate a user query into a database query and execute it. However, we need to deploy it to users and maintain quality of service in the face of queries that we cannot predict the performance impact. While we have a restrictive query language, we have given a lot of power to our users.

We run a multi-tenant system, so first we have to make sure nobody can escape that and see other customers’ data. To do this, we use common table expressions to filter only the rows that the user can see, and then their user generated query becomes the WHERE clause in querying those rows. They can never filter over more rows than they are able to see.

The other big concern is quality of service. We do not want a customer to monopolize database cycles. Our approaches to this are:

  1. We rate limit users.
  2. Add indices that match common query patterns we expect.
  3. We make use of PostgreSQL’s STATEMENT_TIMEOUT configuration to limit how long a query can run. This is such a great feature! On a per-query basis, PostgreSQL maintains a timeout and if the query execution time exceeds the timeout, the server cancels it.
  4. We also have the possibility of adding more read replicas, as necessary.

With these, we have lot of power in both providing a responsive service to users and also protecting us if a user, either accidentally or intentionally, creates expensive queries that could reduce the quality of the service.

But, while all the queries I tried in my testing environment worked, when we deployed this to a few customers for beta testing, we saw that queries they were executing were timing out. Luckily, the statement timeout feature of PostgreSQL logs the query so we can see the cause.

One of the requested features was the ability to query by the directories that an operation was executed. We also want to do and and or operations on everything, including directories. PostgreSQL has great JSON support, which is a great way to express data like this in a single row. The first solution to this would assemble the “directory” column using a sub-select, then use the JSON operators in the query. The down side is that there is no indexing for this because the column is constructed on the fly. In a query such as dir:foo, the database has to do a sequential scan of the rows the user has available to them to construct the JSON column then apply the WHERE clause. Obviously expensive.

To resolve this we knew we needed to index that column. That meant we would have to denormalize the database a little bit. Not great, but a sacrifice worth making. Then we could use a GIN index with the jsonb_path_ops operator class. You can do really efficient “contains” operators, such as @>. The @> operator takes jsonb values on both sides, and evaluates to true if the value on the left contains the value on the right. For example:

select '{"foo": {"bar": "baz", "zoom": "zaz"}}'::jsonb @> '{"foo": {"bar": "baz"}}'::jsonb;
?column?
----------
t
(1 row)

The jsonb_path_ops operator class is perfect for our usecase. Effectively, what it does is creates an index that hashes the path to every value in the JSON with the value itself. In the example above, index entries would be created for (in some pseudo notation) hash('foo.bar', 'baz') and hash('foo.zoom', 'zaz'). To determine if our query is contained, the index can be checked for hash('foo.bar', 'baz'). The downside is that one cannot efficiently determine if the key foo.baz exists, but for our situation, that is fine. After making the necessary change and creating a migration script to generate the new column for all rows, we deployed. No more statement timeouts!

To wrap up, dynamic queries has been a fascinating journey at Terrateam. By prioritizing type safety and leveraging PostgreSQL’s features like JSON indexing, we’ve equipped users with robust query capabilities while maintaining data integrity.

We use cookies and similar technologies to provide certain features, enhance the user experience and deliver content that is relevant to your interests. Depending on their purpose, analysis and marketing cookies may be used in addition to technically necessary cookies. By clicking on "Agree and continue", you declare your consent to the use of the aforementioned cookies. Here you can make detailed settings or revoke your consent (in part if necessary) with effect for the future. For further information, please refer to our Privacy Policy .