LATERAL, a hidden gem in PostgreSQL

LATERAL is a very useful tool that every SQL user should have in his toolbelt. However, it is normally not explained in introductory courses and many get to miss it.

Suppose that we want to find the perfect match between two tables. For example, we could have a table with all students, a table with all schools, and we would like to find the school that is closest to each student. Another example, we have a table of users with some preferences and another table with products, and we want to find the product that is most similar to what they want. Then LATERAL is a lot of times the right tool for you.

Let’s try out to solve the problem with mock data.

Read more

JS meets SQL. Say hi to AlaSQL!

At Geoblink, we often find ourselves moving a lot of data from our database to the frontend, to display in useful charts and graphs for our users. Although restructuring data from the backend is usually necessary, it is not especially challenging, as the queries to the database are crafted to return the data how we need it for each purpose. For these cases, Lodash is more than enough to filter and fit the data to needs of the front-end.  But, what happens when we do not query the data ourselves? Sometimes, when using third-party APIs, the data may not be structured how we want it. We need a way to organize this data quickly and easily.

Cue in AlaSQL. AlaSQL gives us the power of SQL queries in Javascript, in a fast, in-memory SQL database. With this library, we can store records in a relational table format, and query the data using SQL syntax. This is extremely useful for counting or grouping a large amount of records in a flash, without having to overthink how to manipulate the JSONs to achieve the required structure.

Read more

Automating data pipelines with jenkins

One of the cool things about being a Data Scientist at Geoblink is that we get to work on all stages of the data science workflow and touch a very diverse stack of technologies. As part of our daily tasks we gather data from a range of sources, clean it and load it into our database; run and validate machine learning models; and work closely with our DevOps/Infrastructure team to maintain our databases.

As it happens in other start-ups, as we grow rapidly it becomes more and more important to automate routine (and indeed boring) tasks, which take away precious development time from our core developers, but also from us data scientists.

While automation tools have long been used in software development teams, the increasing complexity of data science cycles has made clear the need for workflow management tools that automate these processes. No surprise then that both Spotify and AirBnB have built (and even better, open-sourced!) internal tools with that aim: Luigi and Airflow.

As part of our effort to iterate faster and promptly deliver our client requests, in the last couple of weeks I’ve spent some time working with the great automation tool we use, Jenkins, and in this post I’d like to give you a taste of how we use it in the Geoblink’s Data team.

Read more

Using Deep Learning to heal people suffering from cancer

DL is cool

Sometimes, we are happily using Deep Learning for futiles things like generating faces or changing horses into zebras. But most of the time, it’s a powerful tool that can help saving lives.

At the INSA of Rouen, I worked in a team of student implementing a solution based on an article published by researchers, some of them being my teachers. The article is called IODA: An input/output deep architecture for image labeling and was written by Julien Lerouge, Romain Herault, Clément Chatelain, Fabrice Jardin and Romain Modzelewski. Image labeling is the act of determining zones in an image and saying : ‘this zone corresponds to the sky’ or ‘this zone corresponds to a pedestrian’. But what’s fantastic with their work is that it also does image segmentation (it also detects where are the frontiers of the zones).

Example of image segmentation

Example of image segmentation

Read more

Parallelizing queries in PostgreSQL with Python

At Geoblink we run more than 20000 queries to generate just one of our several ~100Gb PostgreSQL databases from scratch from our raw data files. If we try to run them in sequential order, the database generation would take too much time. In order to reduce the generation time we parallelize several queries. Doing that by hand would be impossible so we use a nice script written in Python to generate and run the queries.

In this post I will show an example of how to do it in Python. Read more

PostgreSQL: Foreign keys with condition ON UPDATE CASCADE

Foreign keys are a key feature in Relational Databases, ensuring integrity and coherence of data. They allow doing transactions ON CASCADE, which means that changes on the primary key/unique constraint they reference is also applied. This has many advantages as the complexity of the database grows.

However, there might be cases when using ON CASCADE is risky because you can lose the track of what’s actually being changed (specially when deleting). So in general is a good practice for updates, but one must be careful to use it in some cases, especially for deletes. Maybe we created a demo account and we don’t want to allow non-expert users to delete the user account, due to the lose of all relative data.

In this post we are going to compare different alternatives to the ON CASCADE constraint and their performances.

Read more

Working with Graph Databases at Geoblink

Nowadays a lot of companies choose graph databases to save a lot of information, but what kind of information?

Graph databases are great to save relationships, and they are very fast at calculating how the different elements inside are related. A very good example could be social networks, or a family structure. In these cases we have people as “nodes”, and  how the people are related as “relationships”. So storing this in a graph database is easy, right?

When we work with tons of information, the first step is to make a decision about which graph database to use. There are a lot, but one of the most popular ones is Neo4j. With Neo4j we are able to build a big data system because we can build clusters with all our information, and the relationship’s structure.

The skeleton of a graph database are nodes and relationships, so the most important thing is to be very clear about how the information has to be saved. We can save many types of nodes and the same with the relationships, so the type of nodes and types of relationships will be “labels”.

GraphDB
Read more

No SQL Tech Talk: distributed databases for a distributed world

We recently gave a Tech Talk at Google Campus Madrid about No SQL databases covering different aspects in a general way, like their history, what problems they fix and the different types you can group them by.

More specifically, we talked about:

  • The very origin of modern databases, the IMS system by IBM.
  • The first relational databases and how they shaped the Web 1.0.
  • The growing need for distributed databases that were fast and consistent in the Web 2.0.
  • The CAP theorem.
  • Categories of No SQL databases:
    • Key-value stores (Redis, Riak)
    • Column stores (HBase, Cassandra)
    • Document stores (MongoDB, CouchDB)
    • Graph databases (Neo4j, FlockDB)
    • Search Engines (strictly not No SQL but kind of related, like Solr and ElasticSearch)

You can find the video of the talk (in Spanish) here and the slides (in English) here.

Using Machine Learning at Geoblink to improve the quality of our database

One of the major problems we are dealing with at Geoblink is spotting the duplicated information (point of interest or any other location) on our map. Turns out, we do have something in common with Google Maps…

Just like Google Maps, we want to keep our map up-to-date with the latest information. However, due to the variety of sources of data we use, we sometimes encounter that the same physical point is represented on our map by two or more markers. One of the reasons behind that is how creative people often get when it comes to formatting an address: addresses coming from different data sources might be formatted very differently, so sometimes it becomes extremely difficult to tell whether two different addresses represent the same physical point or not.

Read more

Geoblink named one of the top 50 Startups worldwide 2017

We were recently included in a list put together by Bloomberg with the top 50 most promising unheard startups for this year, under the category Artificial Intelligence/Geospatial business intelligence.

https://www.bloomberg.com/graphics/2017-fifty-best-startups/

The list shows startups from other sectors that are getting a lot of attention these days like Augmented Reality, Drones or Autonomous Driving.

We are very proud to have been included in this list by such a prestigious publication! This powers up our will to create one of the best Geospatial product in the industry.

 

Screen Shot 2017-05-18 at 20.59.24