At Geoblink our database runs on PostgreSQL. This works great for us as we get all the speed and flexibility of SQL making it easy to adapt the backend to the changing requirements in the product.
This flexibility comes with a price though, as it is very easy to end up with a monolith-like set of schemas with cross dependencies among the different parts. In our case, it was specially true for ETL processes: introducing new demographic indicators was a very complicated task where we had to juggle with several databases. In fact, making the data ready to be consumed by our application included a lot of manual steps, making it difficult to automate the promotion and deploying of new data.
A few months ago we started planning our international expansion (UK here we come!) and it became obvious that our structure was not going to scale anymore. Looking into possible solutions we found Postgres Foreign Data Wrappers (FDW) and we instantly fell in love with it. We added it to our infrastructure and our ETL processes now work as a breeze!
How it used to work
On one hand we had the demographic database. Our Data team would gather a huge amount of data from all different kinds of demographic sources, ranging from the Spanish National Institute of Statistics or the equivalent British Office for National Statistics to an economic study by a consulting firm. Using various models and heuristics the team would come up with a large demographic database. On the other hand we had the user database, containing the user information from production. In an ideal world these two databases would have been apart. However, that is not possible for us from a business perspective, as we offer several features that require combining data from both.
Our ETL worked as follows: we would replicate the user database from production into a server where we would run the computations. Here we would mix user data with the demographic data in a process that could take several hours. Next step would be to re-download all user data that had changed in production in the meantime, before pushing the new data into production.
We wanted to improve this process, and actually knew that if we wanted to include a new country we would have to make a combined deployment which could not possibly scale.
Postgres Foreign Data Wrappers are a really cool feature that allows us to “connect” two databases and query data from one into the other as if it were a single database.
In its simplest form it works as follows:
CREATE SCHEMA store_data;
IMPORT FOREIGN SCHEMA store_data FROM SERVER remote_server INTO store_data;
Now we can join tables with store data as if the table lived in our server:
SELECT client_stores.id, SUM(buildings.people) as population FROM buildings, store_data.client_stores WHERE ST_DISTANCE(client_stores.geometry, buildings.geometry, true) < 50
Here we would get all the people living within 50 meters of the client stores.
FDW have been a part of Postgres since 9.1 but it was in the 9.6 release that they became really powerful, allowing remote joins.
FDW in action
In our case the FDW structure was very natural, we have a central server where we store user data. Then we have a server for each country that reads from the user schema through a foreign data wrapper.
Whenever we need demographic, “static” data we can query the corresponding country database. If we only need user, “changing” data then we query the User database. In those cases where we need to join user and demographic data (like in the example with the population around the store) we query the country database and the FDW sorts out the join with the User data.
This model is very flexible and has helped us streamline our ETL processes where we cook and push the new demographic data into production.
This is the process we follow (graphics powered by Mermaid):
We have this setup in production, for each country database keeping demographic data there is a FDW so it can read from user tables (UK and ESP represent demographic/static data for those countries)
When the ETL transform stage begins, the server where we cook the data (we call it computation server) can connect to the User database and have the latest information from production and make all required calculations – while the current instance of the database with the “old” data is serving data to production.
Once all the computations are finished we can replicate this new database in production. Since the database contains the data for the new country we can just move the binaries and restore them (instead of having to perform a full backup including recomputing the indices, which is always a costly operation). At that point we have two databases for the same country in the server.
At this point we have two databases for the same country in the server and can perform an instant switch so that users start accessing the new data, and discard the old database.
To wrap up, Postgres FDW has allowed us to simplify the dependencies between the different databases and streamline our generation and promotion workflows. We definitely recommend evaluating it if you have different databases that you want to keep separated as they represent different models, but your ETL processes or functionality require part of the data to be joined.