LATERAL, a hidden gem in PostgreSQL

LATERAL is a very useful tool that every SQL user should have in his toolbelt. However, it is normally not explained in introductory courses and many get to miss it.

Suppose that we want to find the perfect match between two tables. For example, we could have a table with all students, a table with all schools, and we would like to find the school that is closest to each student. Another example, we have a table of users with some preferences and another table with products, and we want to find the product that is most similar to what they want. Then LATERAL is a lot of times the right tool for you.

Let’s try out to solve the problem with mock data.

First, we start by creating the tables:

CREATE TABLE test_table1 (id1 int PRIMARY KEY, property DOUBLE PRECISION);
CREATE TABLE test_table2 (id2 int PRIMARY KEY, property DOUBLE PRECISION);

INSERT INTO test_table1 (id1, property)
SELECT generate_series, RANDOM() from generate_series (1, 100000);

INSERT INTO test_table2 (id2, property)
SELECT generate_series, RANDOM() from generate_series (1, 100000);

CREATE INDEX ON test_table1(property);
CREATE INDEX ON test_table2(property);

Now, as we said, we want to find for each id1 of test_table1, the id2 of test_table2 which is closest, in our case that is going to be ABS(test_table1.property – test_table2.property).

One possible way to do it is using distinct on:

SELECT
      DISTINCT ON (t1.id1) 
      t1.id1,
      t2.id2
FROM 
      test_table1 t1,
      test_table2 t2 
ORDER BY 
      t1.id1,
      ABS(t1.property - t2.property)
LIMIT 10

But the performance is terrible, after 20min it hadn’t finished, if we do explain on the query:

EXPLAIN SELECT DISTINCT ON (t1.id1) t1.id1, t2.id2 FROM test_table1 t1, test_table2 t2 ORDER BY t1.id1, ABS(t1.property - t2.property) LIMIT 10
'Limit  (cost=2177764254.44..2177769254.44 rows=10 width=16)'
'  ->  Unique  (cost=2177764254.44..2227764254.44 rows=100000 width=16)'
'        ->  Sort  (cost=2177764254.44..2202764254.44 rows=10000000000 width=16)'
'              Sort Key: t1.id1, (abs((t1.property - t2.property)))'
'              ->  Nested Loop  (cost=0.00..175003332.00 rows=10000000000 width=16)'
'                    ->  Seq Scan on test_table1 t1  (cost=0.00..1541.00 rows=100000 width=12)'
'                    ->  Materialize  (cost=0.00..2041.00 rows=100000 width=12)'
'                          ->  Seq Scan on test_table2 t2  (cost=0.00..1541.00 rows=100000 width=12)'

We see that it has to sort a collection of size(test_table1) * size (test_table2) so that is no good.

Lateral is the perfect tool for this kind of problems, it allows to do a query for each row of the first table:

SELECT 
      t1.id1,
      t2.id2
FROM 
      test_table1 t1,
      LATERAL 
            (SELECT 
                   tt.id2 
             FROM 
                   test_table2 tt 
             ORDER BY 
                   ABS(tt.property - t1.property) 
             LIMIT 1
            )t2 
LIMIT 10

It takes 293ms!! More than 6000x faster!! Let’s examine a bit the query inside the lateral clause. We are doing a query for test_table2 where we can reference t1. Wait, that sounds a lot like subqueries, and in fact using one gives exactly the same time.

SELECT 
      t1.id1, 
      (SELECT t2.id2 FROM test_table2 t2 ORDER BY ABS(t1.property - t2.property) LIMIT 1)
FROM 
      test_table1 t1
LIMIT 10

Also 293ms. However, LATERAL is way more powerful. Suppose we want the two closest locations, we simply need to change the LIMIT 1 for LIMIT 2, a subquery will raise an error because we cannot return more than one row. Suppose that apart of the id2 we also want the value of the difference, we just need to add the value to the lateral clause.

SELECT 
      t1.id1,
      t2.id2,
      t2.diff
FROM 
      test_table1 t1,
      LATERAL 
            (SELECT 
                   tt.id2, ABS(tt.property - t1.property) diff 
             FROM 
                   test_table2 tt
             ORDER BY 
                   ABS(tt.property - t1.property) 
             LIMIT 2
             )t2 
LIMIT 10

That is impossible with a subquery, you can only return one column.

In summary, whenever you need to match two tables and try to optimize some criterion you should give it a try to LATERAL.

 

By Gabriel "Fusten" Furstenheim