Parallelizing queries in PostgreSQL with Python

At Geoblink we run more than 20000 queries to generate just one of our several ~100Gb PostgreSQL databases from scratch from our raw data files. If we try to run them in sequential order, the database generation would take too much time. In order to reduce the generation time we parallelize several queries. Doing that by hand would be impossible so we use a nice script written in Python to generate and run the queries.

In this post I will show an example of how to do it in Python.

As an example, we will create a table and we will fill it with 20 million of random numbers.

First we create the table:

DROP TABLE IF EXISTS public.test;
CREATE TABLE public.test(
    value numeric
);

And then we fill it:

INSERT INTO test
SELECT
    random()
FROM
    generate_series(1,20000000)

In my own machine (Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz) it takes about 64 seconds.

Now let’s try to parallelize the query. We will use the python multiprocessing package to parallelize the processes and psycopg2 to connect with the database.

First we define the function to run a query:

import psycopg2

def runQuery(query):
    connect_text = "dbname='%s' user='%s' host=%s port=%s password='%s'" % (dbname, user, host, port, password)
    con = psycopg2.connect(connect_text)
    cur = con.cursor()
    cur.execute(query)
    con.commit()
    con.close()

With this function we can run any query that we want but without output, like creating tables, updating them, etc.
The next step is to create an array of queries.

queries = []
for i in range(N_job):
    query = """
        INSERT INTO test (value)
        SELECT random()
        FROM
            generate_series(1, %d);
    """ % (N_random / N_job)
    query.append(queries)

Where N_jobs is the number of queries that we want to execute. With this, we have created an array of independent queries that we can execute in parallel.
Now we need to create a pool with the number of CPUs that we want to use, 4 in our case.

import multiprocessing
pool = multiprocessing.Pool(CPUS)

Finally we execute the queries using the imap_unordered method. It takes a function and an array and applies the function to each element of the array in random order.

for i in pool.imap_unordered(runQuery, queries)
    continue

With this method the calculation took only 30 seconds using 4 cores, more than a 50 % of improvement with respect to using only one query. Probably we didn’t get a biggest improvement because the read/write disk speed.

As more cores available, more advantage can we take from parallelizing queries.  A good idea is to parallelize using existing indexed columns as provinces, municipalities or distances as we do when we create our databases in Geoblink.

But there is a point where we have to play with the ratio rows/job as we don’t want to have millions of rows per queries and neither thousands of queries.

The final conclusion is that, although parallelizing queries may result in a great performance improvement, we always need to keep a balance between the number of the queries we run and the number of rows per query.

 By Vicente "Tito" Lacuesta