SQL Trick: Latest entry in a set of results

Introduction

In the Terrateam product, users create a pull request and we run an operation on each directory that has changed in that pull request. Users can manually run the operation as many times as they want, after which if all directories have succeeded, the user can perform another operation to finalize the change.

Our data model stores each run across the directories individually and we track each directory individually. That is to say, each operation is a work_manifest and each directory run is a work_manifest_dir. To complicate matters more: a user may specify a subset of directories to run in a work_manifest.

Data Model

The data model for this case looks like the following:

CREATE TABLE work_manifest (
       id text primary key,
       pull_request text not null,
       created_at timestamp not null default (now())
);

CREATE TABLE work_manifest_dir (
       work_manifest_id text not null,
       dir text not null,
       success boolean not null,
       primary key (work_manifest_id, dir),
       foreign key (work_manifest_id) references work_manifest (id)
);

To create the example data in SQLite, paste the following INSERT:

INSERT INTO work_manifest (id, pull_request, created_at) VALUES
('WM1', 'PR1', '13:00'),
('WM2', 'PR1', '14:00'),
('WM3', 'PR1', '15:00');

INSERT INTO work_manifest_dir (work_manifest_id, dir, success) VALUES
('WM1', 'dir1', true),
('WM1', 'dir2', false),
('WM1', 'dir3', false),
('WM2', 'dir2', true),
('WM2', 'dir3', false),
('WM3', 'dir3', true);

Example

A user creates a pull request with a change to three directories: dir1, dir2, and dir3.

Run 1 - They run all directories and dir1 succeeds but dir2 and dir3 fail.
Run 2 - After fixing dir2 and dir3 they run again and now dir3 fails.
Run 3 - The user resolves the issue in dir3 and runs it. Success.

So we have three work_manifest entries in our database with the following work_manifest_dir entries:

MW1
- dir1
- dir2
- dir3
MW2
- dir2
- dir3
MW3
- dir3

The Challenge

The user now wants to perform the finalize operation, how do we determine if the latest run of each directory is a success? We can't just find the latest work_manifest and make sure all work_manifest_dir entries are successful because a directory could have been run as part of an earlier work_manifest.

Instead, we need to find the most recent run of each directory and verify it was successful.

The Solution (Window Functions)

There are a few ways to accomplish this. One is with sub-selects. But I like using window functions.

The gist of the solution is:

Select all directories for a pull request.
Use a window function to partition that output by directory and order them by the created_at timestamp and assign a row number to that ordering.
Select all rows with a row number of 1.

We will use common table expressions to make everything more readable:

with
dirs as (
    select
        wm.pull_request as pull_request,
        wmd.dir as dir,
        wm.created_at as created_at,
        wmd.success as success,
        row_number() over
            (partition by
                wm.pull_request, wmd.dir
             order by wm.created_at desc) as rn
    from work_manifest as wm
    inner join work_manifest_dir as wmd on wmd.work_manifest_id = wm.id
)
select * from dirs where rn = 1

This query will give the results for every pull request in the database. To limit it to one pull request, add an and pull_request = <pr number> to the final select.

The interesting bit is in the line with row_number(). What this says if we were to take all these results and group them by the pull_request and the dir and order that by created_at, in descending order, what would be the row number in that output? This is like a nested loop in another language. We then select from dirs all those rows with a row number of 1.

To visualize this, imagine the following data (this is the example data given above):

work_manifest

id	pull_request	created_at
WM1	PR1	13:00
WM2	PR1	14:00
WM3	PR1	15:00

work_manifest_dir

work_manifest_id	dir	success
WM1	dir1	true
WM1	dir2	false
WM1	dir3	false
WM2	dir2	true
WM2	dir3	false
WM3	dir3	true

Now, if were to partition the data over pull_request and dir and order by created_at in descending order, we'd get the following three partitions:

PR1 and dir1

pull_request	dir	created_at
PR1	dir1	13:00

PR1 and dir2

pull_request	dir	created_at
PR1	dir2	14:00
PR1	dir2	13:00

PR1 and dir3

pull_request	dir	created_at
PR1	dir3	15:00
PR1	dir3	14:00
PR1	dir3	12:00

Finally, the output of dirs would be:

pull_request	dir	created_at	success	rn
PR1	dir1	13:00	true	1
PR1	dir2	14:00	true	1
PR1	dir2	13:00	false	2
PR1	dir3	15:00	true	1
PR1	dir3	14:00	false	2
PR1	dir3	13:00	false	3

And by selecting all those rows with an rn of 1 we see:

pull_request	dir	created_at	success	rn
PR1	dir1	13:00	true	1
PR1	dir2	14:00	true	1
PR1	dir3	15:00	true	1

And they are all successful, so we can finalize.

Conclusion

Window functions are a really powerful way to effectively do an inner loop in a query but in an efficient way. There are lots of other functions besides row_number() and are worth checking. Like most things in SQL, window functions can get really complicated, they do a lot and I don't even know half of their functionality. But this little feature has served me well.

Learn

Connect