How to export a Postgresql table as CSV with Node.js Streams

Hi all, today in “How to make happy a BI user” chapter, we will make it possible to download ‘big data’ without endangering our own server.

To be able to do this we are going to use our favorite database, PostgreSQL, and nodejs with the help of two packages, pg and pg-copy-streams.

The main problem of downloading or processing all this data is the size of them. So to avoid having to load all the table in memory at the same time, as it could be pretty big, we use nodejs streams.

Node.js Streams have tons of benefits, for example they have a low memory footprint, they are consumed and processed in buffered chunks and they do not block the thread among others.

Now let’s start with the interesting part.

First of all, imagine that we have a table with millions of rows, called super_big_table. Our user will want to filter and download it. The best way to get fast output from PostgreSQL is via the COPY statement, but it has one problem. COPY statements do not allow parameters. One solution is to create a temporal table, and insert the desired data.

CREATE TEMPORARY TABLE temp_csv_table AS
  SELECT
    t.id, t.value
  FROM
    super_big_table t
  WHERE
    ${customFilters}

Then we just need to execute the COPY query inside pg-copy-streams. Code will be something like this:

const copyTo = require('pg-copy-streams').to
const pg = require('pg')
const client = new pg.Client()

await client.connect()
const q = `COPY temp_csv_table to STDOUT with csv DELIMITER ';'`
const dataStream = client.query(copyTo(q))

dataStream.on('error', async function (err) {
  // Here we can controll stream errors
  await client.end()
})
dataStream.on('end', async function () {
  await client.end()
})

Now imagine that you need to include some headers to this stream. We could use the Transform stream. With this we could modify all the data, or like in this case, just adding a row at the beginning.

We can create our own Tranformer stream by extending Transform and creating our own class.

const {Transform} = require('stream')

class PrefixedStream extends Transform {
  constructor (prefixRow) {
    super()
    this.prefixRow = prefixRow
    this.isFirstChunk = true
  }

  _transform (chunk, encoding, callback) {
    if (this.isFirstChunk) {
      this.isFirstChunk = false
      this.push(this.prefixRow)
      this.push('\n')
    }
    this.push(chunk)
    callback()
  }
}

And just piping both streams.

const csvHeaders = ['Big table id', 'My value']
const csvWithHeadersStream = new streamUtils.PrefixedStream(csvHeaders)
dataStream.pipe(csvWithHeadersStream)

Finally we just need to pipe again this stream, but this time with the response.

csvWithHeadersStream.pipe(res)

So finally, to take advantage of it, we will need to make it possible to the user to download it.

We will need a little bit of code on the client side, thanks to file-saver.

const {saveAs} = require('file-saver')

const blob = new Blob([response.data], {type: 'application/octet-stream'})
saveAs(blob, 'super_big_table_filtered.csv')

So as we just saw is super easy to download custom data in CSV, and make happy BI customers!

 

Now you aren’t afraid anymore about Streams, enjoy it’s powerfulness!

Jose Luis Pillado “Fofi” – Lead Software Engineer

2 days of fun and data BigDataSpain 2017

On Thursday and Friday last week a few geoblinkers from the Tech team were fortunate enough to attend Big Data Spain in Madrid, “one of the three largest conferences in Europe about Big Data”.

The line-up of speakers this year was amazing and they certainly didn’t disappoint. Moreover, our VP of Technology Miguel Ángel Fajardo and our Lead Data Scientist Daniel Domínguez had the chance to actively participate as speakers with a thought-provoking talk titled “Relational is the new Big Data”, where we tried to remark how relational databases can today solve many use cases regardless of the size of your dataset, adding lots of benefits with respect to other No-SQL options.

Relational is the new Big Data

Read more

LATERAL, a hidden gem in PostgreSQL

LATERAL is a very useful tool that every SQL user should have in his toolbelt. However, it is normally not explained in introductory courses and many get to miss it.

Suppose that we want to find the perfect match between two tables. For example, we could have a table with all students, a table with all schools, and we would like to find the school that is closest to each student. Another example, we have a table of users with some preferences and another table with products, and we want to find the product that is most similar to what they want. Then LATERAL is a lot of times the right tool for you.

Let’s try out to solve the problem with mock data.

Read more

Automating data pipelines with jenkins

One of the cool things about being a Data Scientist at Geoblink is that we get to work on all stages of the data science workflow and touch a very diverse stack of technologies. As part of our daily tasks we gather data from a range of sources, clean it and load it into our database; run and validate machine learning models; and work closely with our DevOps/Infrastructure team to maintain our databases.

As it happens in other start-ups, as we grow rapidly it becomes more and more important to automate routine (and indeed boring) tasks, which take away precious development time from our core developers, but also from us data scientists.

While automation tools have long been used in software development teams, the increasing complexity of data science cycles has made clear the need for workflow management tools that automate these processes. No surprise then that both Spotify and AirBnB have built (and even better, open-sourced!) internal tools with that aim: Luigi and Airflow.

As part of our effort to iterate faster and promptly deliver our client requests, in the last couple of weeks I’ve spent some time working with the great automation tool we use, Jenkins, and in this post I’d like to give you a taste of how we use it in the Geoblink’s Data team.

Read more

PostgreSQL: Foreign keys with condition ON UPDATE CASCADE

Foreign keys are a key feature in Relational Databases, ensuring integrity and coherence of data. They allow doing transactions ON CASCADE, which means that changes on the primary key/unique constraint they reference is also applied. This has many advantages as the complexity of the database grows.

However, there might be cases when using ON CASCADE is risky because you can lose the track of what’s actually being changed (specially when deleting). So in general is a good practice for updates, but one must be careful to use it in some cases, especially for deletes. Maybe we created a demo account and we don’t want to allow non-expert users to delete the user account, due to the lose of all relative data.

In this post we are going to compare different alternatives to the ON CASCADE constraint and their performances.

Read more

Working with Graph Databases at Geoblink

Nowadays a lot of companies choose graph databases to save a lot of information, but what kind of information?

Graph databases are great to save relationships, and they are very fast at calculating how the different elements inside are related. A very good example could be social networks, or a family structure. In these cases we have people as “nodes”, and  how the people are related as “relationships”. So storing this in a graph database is easy, right?

When we work with tons of information, the first step is to make a decision about which graph database to use. There are a lot, but one of the most popular ones is Neo4j. With Neo4j we are able to build a big data system because we can build clusters with all our information, and the relationship’s structure.

The skeleton of a graph database are nodes and relationships, so the most important thing is to be very clear about how the information has to be saved. We can save many types of nodes and the same with the relationships, so the type of nodes and types of relationships will be “labels”.

GraphDB
Read more