We recently released the Audit Trail feature of Terrateam. Audit Trail is part of the Terrateam Console, allowing users to view and query their Terrateam operations. Audit Trail is a new type of feature for us. Most of Terrateam is driven by reacting to GitHub Webhooks: receive an event, evaluate it, and initiate a GitHub Action. All of these database queries are under our control and optimized. Audit Trail gives users the ability to query the database, with their own queries. This is a more challenging feature to implement, to communicate, and to ensure we maintain quality of service. The development and implementation of Audit Trail is a good view into how Terrateam, as a company, works.
Terrateam is different than a lot of other companies in this space in the technology choices we make. The biggest stand-out is that our backend is implemented entirely in Ocaml. I’ve spoke at length about why we chose Ocaml, see here and here.
We’ve also written a lot of custom frameworks, libraries, and tooling for Ocaml. Our rule of thumb is: no to frameworks, yes to libraries. This is because we want control. Ocaml is such a powerful and expressive language, we are able to deliver features quickly despite taking on so much development. A lot of the underlying protocols and technology we interact with are quite stable, so while we might have a higher initial development cost, our maintenance cost is essentially zero as we do not have to modify our code as frameworks we do not control change over time.
One of our core libraries is Pgsql_io
which is an implementation of the
PostgreSQL wire protocol in Ocaml. A major feature of Pgsql_io
is that it
supports adding type annotations to SQL queries. In Pgsql_io
, one writes a
SQL query string and then provides extra information to specify what kind of
values it takes and what kind of values it returns. Here is an example of how
we use it to create a work manifest row:
The SQL:
The annotated query in Pgsql_io
:
And here is a usage of it:
To put it all together, we have:
- A string which is an SQL expression. This looks like any ol’ SQL, the only
difference is it references variables with a
$
, such as$username
. - We have type annotations for that SQL string, which specifies the types of
the variables,
$username
is atext
value and can benull
.$repository
is abigint
and cannot benull
. We also specify the type of returned rows. We use comments to say the name of the columns, which is just for developer convenience. - We have the call to
Pgsql_io.Prepared_stmt.fetch
which takes a database connection, an SQL expression, a function to apply to each returned row, and the values of the variables referenced in the SQL expression.
Pgsql_io
takes all of these, rewrites the SQL expression to be a prepared
statement and maps the variables to the prepared statement parameters, and
executes it, for every row it applies the function and collections the result
and returns a list of values.
What’s so powerful about this is that it all gets verified at compile time. That is to say, before we run the program, we know that all uses of the SQL expression match the type annotation. If we refactor an SQL expression, we change the type annotation to match, and the compiler will tell us every call site that we need to fix.
”Hey, wait a second!”, you say, “How do you verify the SQL you wrote matches the type annotation?” We don’t! Originally we had looked at building a DSL that compiles to SQL, then we would know that if the types match so does the SQL. That was quickly abandoned. SQL is just too large of a language and we make use of a lot of the power of SQL, we just did not feel confident that building such a DSL would be feasible for us. So we made a compromise in the name of pragmatism: SQL statements tend not to change that much, and when they are wrong it’s usually pretty blatant. You’ve either specified the correct types or your haven’t, and when that SQL is executed you’ll know quickly if it’s right or wrong.
With that background, on to the interesting part: we had decided to implement a new feature called Audit Trail. While Terrateam is designed to be used within GitHub for the standard workflow, reporting operations live inside of the Terrateam Console. A common request is that users want to be able to see what operations Terrateam has performed. In a busy organization, it’s hard to do this just by scrolling through the list of runs. We wanted to give users a powerful and robust query language to filter results.
Terrateam already makes use of a simple query language we call the Tag Query Language. It’s used in many places in the product. The nice thing about it is that it has a syntax that is straight forward to learn and it can be used in different contexts. We just needed to define which attributes of an operation we wanted to expose to users.
The way Audit Trail works is a user types a query into the UI which makes an API
call passing the query as a string. The backend parses the query and translates
it to SQL and executes it. For example, the query pr:123 and user:joe
would
translate to an SQL query like pull_number = 123 and username = 'joe'
.
We don’t want to construct a raw SQL query, though. That leaves us vulnerable
to all sorts of obvious injection
threats. What we really want to do is construct a prepared statement that
references the values the user has input. So something like: pull_number = $1 and username = $2
. That means every query is dynamic. But given how
Pgsql_io
works, that would mean every query requires a different type
annotation. The power of Pgsql_io
is that it checks our queries at compile
time, that conflicts with what we’re trying to accomplish: safe and dynamic
queries at runtime.
There are two options:
- Bypass
Pgsql_io
and interact with a lower-level API, constructing the necessary values at runtime. This removes any help from the compiler. - Find some way to maintain the static typing and construct the query dynamically.
Option (1) is definitely a fallback option. We lose all the value of the compile time type checks. We can do it if we have to but it would be great if we could find a solution for option (2).
At Terrateam, we generally approach software development problems with types. This is one reason we like our own frameworks: we are willing to pay the price of a less ergonomic API if it lets us use more expressive types. We look at developer productivity not just as how quickly we can develop the code, but also how difficult it is to use the code incorrectly, and how easy it is to refactor. Types are the tool that we’ve chosen to enforce these.
One could use other methodologies, such as TDD, but we find that we have to write far fewer tests because our types allow the compiler to check the code for us, and good types are less work than writing tests. Refactoring is really the killer-app for the type system. As a newer company and product, implementing features often also come with a refactoring.
An example is how we represent operations, such as a plan or apply, which we call work manifests. A work manifest has different representations depending on where it is in its life cycle. A new work manifest has no ID and receives one when it is saved to the database. When we construct a work manifest we don’t want to have to put a fake ID, we don’t want to have an ID at all. Additionally, a work manifest is associated with a pull request, but sometimes when we read a work manifest from the database we aren’t interested in the details of the pull request, so there is no reason to load the pull request from the database. We want to represent all of these different types of work manifests.
One solution is we could store the id
and the pull_request
in the work
manifest as a value that can be null
. The downside is that every usage has to
check if it is null
. It would be nice if code paths that require an ID or a
pull request could only receive work manifests where those fields are not
null
.
If we chose the initial solution, to solve this by making those fields
nullable
, a work manifest might look like below (in Ocaml, option
means that
the value can be null
). This defines a type, named work_manifest
, that is a
record (or a “struct” in other languages), with an optional id
and src
.
Instead, we can make the types of the id
and src
fields type variables to
the type work_manifest
. This is called a “generic” in a lot of languages, and
it means that the type work_manifest
takes two parameters: the type of id
and the type of src
. People are more familiar with using type variables for
collection types, like an array, or a hash map. But any type can take a type
variable, including records and structs.
Now we can specify different types for work manifests that match our use case.
A new work manifest, one that does not have an id
but does have a
pull_request
would be an alias to the work_manifest
, like below. The unit
type is like void
in other languages, it means there is no id
. And then we
have a type called pull_request
.
A “lite” work manifest, one that we load from a database, so it has an id
, but
we are not interested in the pull_request
field, so we will use unit
again
to specify it is not present:
And then a full work manifest, that has both an id
and a pull_request
:
Now, if a code path requires that the id
and src
are set, it will use the
full_work_manifest
type in its definition. A code path can also specify that
it requires a src
but does not care if there is an id
set or not.
This was very helpful when we decided to add support for drift to Terrateam. A
drift work manifest has no pull_request
, instead it has a reference to the
drift schedule that initiated it. To support this, we make a new drift
type
and then create a drift_work_manifest
type that uses it. We can also make a
type that is either.
Now that we had these types specified, we added type annotations to our existing
code get it to compile. For example, the function to save a work manifest to
the database must be able to store a drift work manifest or a pull request work
manifest, and it should not have been saved to the database already, so it takes
a work manifest of type (unit, drift_or_pr) work_manifest
. On the other hand,
the RBAC code path only operates on work manifests with pull requests because
drift work manifests are always initiated internally and do not need to go
through the RBAC checks. So RBAC takes a (unit, pull_request) work_manifest
.
The nice part about all of this is that the compiler guides us through the process. We tell it what types we expect values to be at certain points in the program and it tells us if we need to change anything.
Now, back to our original problem: how can we implement Audit Trail queries without losing type safety?
Another way to look at this problem is that we want to generate a query where
the query references values passed in by the user, for example the pull request
number or the username (something like pr:123 and user:joe
). When we turn a
user’s query into SQL, we need to do it in such a way that it doesn’t change the
type annotation that we gave the query with Pgsql_io
.
SQL arrays to the rescue! SQL supports passing arrays. We actually use this in
another context: bulk inserts. Rather than construct an insert statement with a
bunch of entries in the values
section, we pass in an array for each part of
the row and turn it into a set of rows and pass it into insert.
For example, we insert all of the repositories an installation contains with the following SQL:
The type annotation for this looks like:
How does this help us with creating dynamic queries? For the types of values users can pass in, there are only a few: strings and integers. So, what if, for each value a user passes in, we put it into an array of the correct type, and then reference that array in the query. For a pull request, that would be an integer value, and for users it would be a string. In this case, our type annotation looks like the following:
But what do queries look like? Well, take the one we’ve been working with:
pr:123 and user:joe
. That becomes: (pull_number = ($ints)[1]) and (username = ($strings)[1])
. Another example of translating a query that references two
values of the same type, the query user:joe or user:moe
. would become:
(username = ($strings)[1]) or (username = ($strings)[2])
. For every variable
of that type (string
or int
) we add it to the array and we reference it by
its index.
What’s nice about this is the type annotation stays the same for every query. We know escaping is not a problem, so no fear of injection. Testing becomes quite easy, we don’t even need a database, we just need to verify that the query string we generate matches what we expect.
To recap:
- We want to accept dynamic queries from the user.
- We want to translate those queries into SQL.
- We want to maximize the type safety, which means we need to check as much at compile time as possible.
- Dynamic and compile time don’t mix, so we have a problem.
- The solution is, when we translate the user query to SQL, to reference values the user passes into their query by array index. This way the type annotation of our SQL query is static even though the contents of the query are dynamic.
That’s great, problem solved!
Well, there are a few operational details. We have software that can translate a user query into a database query and execute it. However, we need to deploy it to users and maintain quality of service in the face of queries that we cannot predict the performance impact. While we have a restrictive query language, we have given a lot of power to our users.
We run a multi-tenant system, so first we have to make sure nobody can escape
that and see other customers’ data. To do this, we use common table
expressions to
filter only the rows that the user can see, and then their user generated query
becomes the WHERE
clause in querying those rows. They can never filter over
more rows than they are able to see.
The other big concern is quality of service. We do not want a customer to monopolize database cycles. Our approaches to this are:
- We rate limit users.
- Add indices that match common query patterns we expect.
- We make use of PostgreSQL’s
STATEMENT_TIMEOUT
configuration to limit how long a query can run. This is such a great feature! On a per-query basis, PostgreSQL maintains a timeout and if the query execution time exceeds the timeout, the server cancels it. - We also have the possibility of adding more read replicas, as necessary.
With these, we have lot of power in both providing a responsive service to users and also protecting us if a user, either accidentally or intentionally, creates expensive queries that could reduce the quality of the service.
But, while all the queries I tried in my testing environment worked, when we deployed this to a few customers for beta testing, we saw that queries they were executing were timing out. Luckily, the statement timeout feature of PostgreSQL logs the query so we can see the cause.
One of the requested features was the ability to query by the directories that
an operation was executed. We also want to do and
and or
operations on
everything, including directories. PostgreSQL has great JSON support, which is
a great way to express data like this in a single row. The first solution to
this would assemble the “directory” column using a sub-select, then use the JSON
operators in the query. The down side is that there is no indexing for this
because the column is constructed on the fly. In a query such as dir:foo
, the
database has to do a sequential scan of the rows the user has available to them
to construct the JSON column then apply the WHERE
clause. Obviously
expensive.
To resolve this we knew we needed to index that column. That meant we would
have to denormalize the database a little bit. Not great, but a sacrifice worth
making. Then we could use a GIN
index with the jsonb_path_ops
operator
class. You can do really efficient “contains” operators, such as @>
. The
@>
operator takes jsonb
values on both sides, and evaluates to true if the
value on the left contains the value on the right. For example:
The jsonb_path_ops
operator class is perfect for our usecase. Effectively,
what it does is creates an index that hashes the path to every value in the JSON
with the value itself. In the example above, index entries would be created for
(in some pseudo notation) hash('foo.bar', 'baz')
and hash('foo.zoom', 'zaz')
. To determine if our query is contained, the index can be checked for
hash('foo.bar', 'baz')
. The downside is that one cannot efficiently determine
if the key foo.baz
exists, but for our situation, that is fine. After making
the necessary change and creating a migration script to generate the new column
for all rows, we deployed. No more statement timeouts!
To wrap up, dynamic queries has been a fascinating journey at Terrateam. By prioritizing type safety and leveraging PostgreSQL’s features like JSON indexing, we’ve equipped users with robust query capabilities while maintaining data integrity.