In the Terrateam product, users create a pull request and we run an operation on each directory that has changed in that pull request. Users can manually run the operation as many times as they want, after which if all directories have succeeded, the user can perform another operation to finalize the change.
Our data model stores each run across the directories individually and we track
each directory individually. That is to say, each operation is a
work_manifest
and each directory run is a work_manifest_dir
. To complicate
matters more: a user may specify a subset of directories to run in a
work_manifest
.
Data Model
The data model for this case looks like the following:
To create the example data in SQLite, paste the following INSERT
:
Example
A user creates a pull request with a change to three directories: dir1
,
dir2
, and dir3
.
- Run 1 - They run all directories and
dir1
succeeds butdir2
anddir3
fail. - Run 2 - After fixing
dir2
anddir3
they run again and nowdir3
fails. - Run 3 - The user resolves the issue in
dir3
and runs it. Success.
So we have three work_manifest
entries in our database with the following
work_manifest_dir
entries:
MW1
dir1
dir2
dir3
MW2
dir2
dir3
MW3
dir3
The Challenge
The user now wants to perform the finalize operation, how do we determine if the
latest run of each directory is a success? We can’t just find the latest
work_manifest
and make sure all work_manifest_dir
entries are successful
because a directory could have been run as part of an earlier work_manifest
.
Instead, we need to find the most recent run of each directory and verify it was successful.
The Solution (Window Functions)
There are a few ways to accomplish this. One is with sub-selects. But I like using window functions.
The gist of the solution is:
- Select all directories for a pull request.
- Use a window function to partition that output by directory and order them by
the
created_at
timestamp and assign a row number to that ordering. - Select all rows with a row number of
1
.
We will use common table expressions to make everything more readable:
This query will give the results for every pull request in the database. To
limit it to one pull request, add an and pull_request = <pr number>
to the
final select
.
The interesting bit is in the line with row_number()
. What this says if we
were to take all these results and group them by the pull_request
and the
dir
and order that by created_at
, in descending order, what would be the row
number in that output? This is like a nested loop in another language. We then
select from dirs
all those rows with a row number of 1
.
To visualize this, imagine the following data (this is the example data given above):
work_manifest
id | pull_request | created_at |
---|---|---|
WM1 | PR1 | 13:00 |
WM2 | PR1 | 14:00 |
WM3 | PR1 | 15:00 |
work_manifest_dir
work_manifest_id | dir | success |
---|---|---|
WM1 | dir1 | true |
WM1 | dir2 | false |
WM1 | dir3 | false |
WM2 | dir2 | true |
WM2 | dir3 | false |
WM3 | dir3 | true |
Now, if were to partition the data over pull_request
and dir
and order by
created_at
in descending order, we’d get the following three partitions:
PR1 and dir1
pull_request | dir | created_at |
---|---|---|
PR1 | dir1 | 13:00 |
PR1 and dir2
pull_request | dir | created_at |
---|---|---|
PR1 | dir2 | 14:00 |
PR1 | dir2 | 13:00 |
PR1 and dir3
pull_request | dir | created_at |
---|---|---|
PR1 | dir3 | 15:00 |
PR1 | dir3 | 14:00 |
PR1 | dir3 | 12:00 |
Finally, the output of dirs
would be:
pull_request | dir | created_at | success | rn |
---|---|---|---|---|
PR1 | dir1 | 13:00 | true | 1 |
PR1 | dir2 | 14:00 | true | 1 |
PR1 | dir2 | 13:00 | false | 2 |
PR1 | dir3 | 15:00 | true | 1 |
PR1 | dir3 | 14:00 | false | 2 |
PR1 | dir3 | 13:00 | false | 3 |
And by selecting all those rows with an rn
of 1
we see:
pull_request | dir | created_at | success | rn |
---|---|---|---|---|
PR1 | dir1 | 13:00 | true | 1 |
PR1 | dir2 | 14:00 | true | 1 |
PR1 | dir3 | 15:00 | true | 1 |
And they are all successful, so we can finalize.
Conclusion
Window functions are a really powerful way to effectively do an inner loop in a
query but in an efficient way. There are lots of other functions besides
row_number()
and are worth checking. Like most things in SQL, window
functions can get really complicated, they do a lot and I don’t even know half
of their functionality. But this little feature has served me well.