How to export a Postgresql table as CSV with Node.js Streams

Hi all, today in “How to make happy a BI user” chapter, we will make it possible to download ‘big data’ without endangering our own server.

To be able to do this we are going to use our favorite database, PostgreSQL, and nodejs with the help of two packages, pg and pg-copy-streams.

The main problem of downloading or processing all this data is the size of them. So to avoid having to load all the table in memory at the same time, as it could be pretty big, we use nodejs streams.

Node.js Streams have tons of benefits, for example they have a low memory footprint, they are consumed and processed in buffered chunks and they do not block the thread among others.

Now let’s start with the interesting part.

First of all, imagine that we have a table with millions of rows, called super_big_table. Our user will want to filter and download it. The best way to get fast output from PostgreSQL is via the COPY statement, but it has one problem. COPY statements do not allow parameters. One solution is to create a temporal table, and insert the desired data.

CREATE TEMPORARY TABLE temp_csv_table AS
  SELECT
    t.id, t.value
  FROM
    super_big_table t
  WHERE
    ${customFilters}

Then we just need to execute the COPY query inside pg-copy-streams. Code will be something like this:

const copyTo = require('pg-copy-streams').to
const pg = require('pg')
const client = new pg.Client()

await client.connect()
const q = `COPY temp_csv_table to STDOUT with csv DELIMITER ';'`
const dataStream = client.query(copyTo(q))

dataStream.on('error', async function (err) {
  // Here we can controll stream errors
  await client.end()
})
dataStream.on('end', async function () {
  await client.end()
})

Now imagine that you need to include some headers to this stream. We could use the Transform stream. With this we could modify all the data, or like in this case, just adding a row at the beginning.

We can create our own Tranformer stream by extending Transform and creating our own class.

const {Transform} = require('stream')

class PrefixedStream extends Transform {
  constructor (prefixRow) {
    super()
    this.prefixRow = prefixRow
    this.isFirstChunk = true
  }

  _transform (chunk, encoding, callback) {
    if (this.isFirstChunk) {
      this.isFirstChunk = false
      this.push(this.prefixRow)
      this.push('\n')
    }
    this.push(chunk)
    callback()
  }
}

And just piping both streams.

const csvHeaders = ['Big table id', 'My value']
const csvWithHeadersStream = new streamUtils.PrefixedStream(csvHeaders)
dataStream.pipe(csvWithHeadersStream)

Finally we just need to pipe again this stream, but this time with the response.

csvWithHeadersStream.pipe(res)

So finally, to take advantage of it, we will need to make it possible to the user to download it.

We will need a little bit of code on the client side, thanks to file-saver.

const {saveAs} = require('file-saver')

const blob = new Blob([response.data], {type: 'application/octet-stream'})
saveAs(blob, 'super_big_table_filtered.csv')

So as we just saw is super easy to download custom data in CSV, and make happy BI customers!

 

Now you aren’t afraid anymore about Streams, enjoy it’s powerfulness!

Jose Luis Pillado “Fofi” – Lead Software Engineer

Introduction to Postgis

In this post I’m going to introduce what PostGIS is, how can use it in PostgreSQL, and how to represent spatial data in a map using QGIS.
In future posts I will explain how to manipulate spatial data to be able to do different kind of geometric analyses.

PostGIS

PostGIS is an extension for PostgreSQL. It adds support for geographic object like points, line-strings, polygons and more.

It also includes a large quantity of spatial functions, such as ‘get the area of a polygon’, ‘calculate the length of a line’ or ‘get the distance between two different objects’ , etc. In addition, it has different operators to combine geometries, so we can get the union of two or more geometries, get the intersected polygon or even create a buffer of a object.

Considering that you have PostgreSQL installed in your system, to install it in Ubuntu is as easy as do:
sudo apt-get install postgresql-10-postgis-2.5

In my case I’m installing the PostGIS 2.5 version in PostgreSQL 10.  The next step is to include it in our database as a new extension. Supposing that we have a database called PostgreSQL, we need to connect to it and run the query:
psql (10.4 (Ubuntu 10.4-2.pgdg16.04+1))
SSL connection (protocol: TLSv1.2, cipher: ECDHE-RSA-AES256-GCM-SHA384, bits: 256, compression: off)
Type "help" for help.
postgres=# CREATE EXTENSION postgis;
CREATE EXTENSION

This will install the extension in your first schema of your search_path.
At this point we have a PostgreSQL database with PostGIS ready to be used!

Shapefiles

A shapefile is a geospatial vector data format for geographic information system (GIS) software. It is commonly used in the GIS world to share geospatial data. But we will transform it and insert it in our own postGIS database.

I’m going to use as an example a Game of Thrones shapefile with the different regions controlled by the corresponding Houses. I obtained the shapefiles from a forum that is based on theMountainGoats map, both of them under license  CC BY-NC-SA 3.0

In my case I have the following files (some files can vary):

political.dbf
political.prj
political.qpj
political.shp
political.shx

When we say shapefile, in reality is composed of many different files, 3 of them mandatory and the rest optional.
The mandatory files are:

  • .shp: the shapes itself. It has all the geospatial information.
  • .dfb: columnar attributes for each shape, like the names of the Houses owning each region.
  • .shx: positional index of the features.

And in my case I have 2 more that are optional:

  • .prj: projection format, it contains the coordinate system and projection information.
  • .qpj: another projection format file, but it allows more data.

The first step is to convert our shapefile to something that we can insert in our database.
To do that, we will use the command shp2pgsql, it transforms a shapefile to a SQL query to be run in our database.

shp2pgsql political > political.sql

Now we run it to insert the data in our database:
psql -h localhost -U geoblink -d postgres -p 5432 -f political.sql

Now we have a table in the database with the contents of the shapefile. In the following picture I’m showing the contents of the table.

The last column is the polygon itself in a format that PostGIS understands.

 

Visualizing in QGIS

QGIS is a free and open source GIS application that supports viewing, editing and analysis of geospatial data.
We could have shown directly the shapefile without inserting it in the database, but as we will do some geospatial analysis in the following post,  I decided to insert it in our database. We can also upload the shapefile to a table using different options in QGIS but I preferred to do it in a way that we can automate and include it in a pipeline.

One inside QGIS we need to ad our database as source. The easy way is to go to the Browser Panel, do right click in PostGIS and select New Connection…

A new window will open and we have to write our connection information there.

After it, in the Browser Panel we can navigate through our database and select the desired table. We double click it and it will appear automatically in the map.

Now we can try to improve the visualization changing the colors of the polygons.
To change the colors we have to follow the steps

  • Right click in our layer in the Layers Panel.
  • Click on properties.
  • Go to styles tab.
  • Click in single symbol and categorized.
  • Choose the column that we want to use to categorize, in our case claimed_by
  • Click on classify, it will show the features and the corresponding colors, we can change them if we want.
  • Finally click in Ok

At this point, each region is painted with a different color according to the house governing it, but we can’t tell which house is that. One way would be looking to the legend, but another one is adding a text label in each region:

  • Right click in our layer in the Layers Panel.
  • Click on properties.
  • Go to labels tab.
  • Click in No labels and select show labels for this layer.
  • In Label with, select to column to show, claimed_by in our case and then OK

And that’s it! We have a map with different colors for each region and the name of the owner on it. We can see that right now the Starks are the ones with more land in all Westeros!

Of course now we can keep polishing the map adding different fill or line styles, adding shadows or backgrounds to the label, etc.

In this post we learnt how to convert a shapefile into a PostGIS table and then show it in QGIS.

In the following post we will start to do a deeper analysis and get quantitative information.

 

Vicente ‘Tito’ Lacuesta – Senior Data Scientist

Aggregating frequent locations with MGRS and DBSCAN

In computer science in general and in data science in particular dealing with continuous values is sometimes tricky. There are plenty of ways to understand a given dataset and many more of extracting the value of it. However, in this entry, we will focus on an application of using discretization and clustering with spatial data. More precisely, we will use the DBSCAN algorithm and the MGRS representation with a small GPS dataset.

But, wait!! DBSCAN and what else ?

Well, let me introduce you some concepts first (skip until the following header otherwise). Discretization and clustering are well-known procedures exported from the data mining field to many others. A formal definition of discretization might be “the reduction of value numbers in a continuous variable by grouping them into intervals” a further explanation may be found at [Web]. Regarding clustering, a formal definition might be “detection of registry groups that share similar properties” more details can be read from [Web]. Their purpose is to reduce the amount of data into smaller subsets that contain more relevant information.

MGRS is a coordinate system based on the UTM and used by the NATO militaries. In the figure 1.2 can be seen the grid and the code of the cursor in the right bottom corner. As a grid, it divides the map several times depending on the chosen resolution. For instance, using 4 digits we can obtain a precision up to 10 meters. A code example can be seen in the next figure:

Figure 1.2: Visualization of MGRS in the area of Spain.

 

The main reason behind using MGRS is because it allows discretizing data by defining a specific level or resolution. There is an arguable loss in the process although the noise and redundant information removed make very advantageous to use it.

DBSCAN was presented back in 1996 in the University of Munich, Germany [Web]. DBSCAN is a density-based clustering algorithm which resulting clusters depends on the minimum number of points and the distance among them. In contrast with other clustering algorithms, DBSCAN only takes three parameters as input: the number minimum points, the distance, and the distance function. In the figure 1.3 can be seen the different output produced by some clustering algorithms compared to DBSCAN.

Its authors stated that the algorithm had an average run time of O(n∗logn) way better than the spatial clustering algorithms at that time of O(n 2 ). However, this complexity can be achieved under special conditions by using R-trees, which are hierarchical data structures used for the organization of geometric objects within a subset of d-dimension rectangles. As a result, it has been some discussion about its performance as can be read in the articles. The lower bound was proved to be O(n 4/3 ) as can be read in the proof wrote in article.

The details of these processes are more complicated, and the overall result might be expressed in terms of entropy. If you are curious about entropy, a further analysis into the entropy-based discretization or the always exciting information theory field might be worthy to consider [Web].

From this point, we will take the benefits of these techniques as granted and we will focus on its impact.

Figure 1.3: Comparison of DBSCAN with K-Means family. Extracted from

 

Ok, let’s come back to our main objective…

As we have said before we deal with geolocated data and more precisely with GPS data. In this case, we have access to a reduced dataset of events that contains the position and time. (Please note the data provided in this example was manually generated and any possible similarity is mere coincidence).

The raw GPS data contains the user, location and time each event was taken. Analyzing the raw data we might find out that it came from a young geoblinker that lives in Torrejon. A screenshot of qgis can be seen in the figure 1.4.

Looking at the image we can observe two events that have been highlighted, they are nearby as might be some other events. Therefore, why do not we apply the MGRS to reduce the number of very close points?

In order to appreciate the difference in precision, we will visualize at 10 meters and 100 meters of resolution. An MGRS cell is represented as a shapefile or box in the map, although we will use its centroids as it easier to understand. As we can observe in the figure 1.5, the 10m resolution is a balanced trade-off between error and reduction. The 100m resolution is too big to be able to represent properly the locations without produce a notable error. This small experiment contains just a 3 few events, nevertheless, the proportion of reduction is about 42% with the 10 meters resolution and 65% with 100 meters resolution. The reduction depends not only on the quality and quantity of the events but also on others factors such as the trip nature.

Figure 1.4: Visualization of raw gps data. Projection performed with qgis

 

Figure 1.5: Different resolutions in the MGRS representation. raw(left) 10m(center) 100m(right)

 

On the other hand, we may use the DBSCAN to obtain the stay locations of the events. As we said previously, DBSCAN needs three parameters which in our case will be 20 Meters as epsilon (transforming the distance), 8 minimum points and haversine distance function.

Figure 1.6: Resulting clusters with 8 minimum points and 20 meters. raw(left) dbscan(center) centroids(right)

 

Looking at the result, that can be observed in the figure 1.6, two registries have been obtained. One logical step might be aggregate the sequential timestamps to obtain the time spent in each location as well as the rest of the information in the centroids. Reviewing the new features of the centroids, we could state that the young geoblinker was in two places during the same day. Going into detail, our friend was in the morning in a residential area which likely means he or she was at home. In the afternoon the geoblinker went to the shopping mall most likely to the car garage to check the car.

Concluding we have reviewed some concepts and techniques to approach data mining to spatial data. In this case, we face an easy scenario which was enough to show the strengths and weaknesses of each purposed method. DBSCAN is a powerful algorithm that allows us to retrieve easily the stay location choosing the appropriated parameters, the downside is the lack of good libraries and the penalty in the performance which is not linear as can be obtained using a solution based on MGRS.

The takeaway here is reminding the golden rule in Data Science: there is not a holy grail and the application of one method or another depends on each case… so we even might use several if it is required!

Marcos Bernal

Sharing logic between Vue components with mixins

It’s not uncommon to have two flows that are quite similar in terms of business logic but different in terms of UI. In Geoblink that happens in several places one of them our filter picker.

Filter picker in Territory Manager

TM

Filter picker popup is opened in legend

Legend

There are two very different places where we use the filter picker:

  • In the map legend, to allow filtering POIs displayed in the map.
  • In the header of some modules like Territory Manager.

As you can check, when we display the filter picker in the legend, it displays a set of active filters inside the popup. However, they are displayed next to the popup toggle button when we use the filter picker in a larger container.

Managing filters is not a trivial flow as we support a complex set of actions and filters but even in a simple playground like the one below, the amount of logic common to all filter pickers is noticeable.

https://codepen.io/luzlurrun-geoblink/pen/gjKpXe

If we focus on compact-filters and regular-filters we can notice a lot of repeated stuff: the props inherited are the same and the computed properties are equal, too. Even some methods are repeated.

A possible way to prevent repeating ourselves could be refactoring those components into a single one but the template would get too complicated with extra logic to handle two completely different scenarios. It’s a clear no-no.

Fortunately there’s a better solution: mixins. Mixins allow us to inherit props, computed properties, partial data objects or methods, allowing us to get additional behaviours without repeating code.

In this specific case we could define a new object with all the common stuff of our two filter components:

const commonStuff = {
  props: {
    allFilters: {
      type: Array,
      required: true
    },
    activeFilters: {
      type: Array,
      required: true
    }
  },
  computed: {
    isFilterActive() {
      const isFilterActive = {}
      for (const filter of this.activeFilters) {
        isFilterActive[filter] = true
      }
      return isFilterActive
    },

    availableFilters() {
      return this.allFilters.filter(
        filter => !this.isFilterActive[filter]
      )
    }
  },
  methods: {
    addFilter(filter) {
      this.$emit('add', filter)
    },

    removeFilter(filter) {
      this.$emit('remove', filter)
    }
  }
}

Then we can add that common stuff to our components passing them in the `mixins` array when declaring the component:

Vue.component('compact-filters', {
  mixins: [commonStuff],
  // ...
})

Note that Vue is smart enough to properly merge complex objects like methods: in our compact-filters component we still have access to the two methods that are specific to that component and not part of the mixin. Same thing applies to data, computed or any other property of the mixin.

You can check the resulting playground after the refactor below:

https://codepen.io/luzlurrun-geoblink/pen/mjKJqp

Now it’s time to embrace composition over inheritance (or repetition!).

 

Lluís Ulzurrun de Asanza i Sàez – Senior Software Engineer

Evolution of calling Python from Node

A bit of disclaimer here: I just started learning Node.js from this project so still consider myself a very bad Node.js developer. There might be errors or things I’m not aware of.

A bit of background

At Geoblink, we have 3 different tech teams:

  • Infra, which I won’t talk about here
  • Core, the web team working on the front-end and back-end of the app. They control every of the app itself, the looks, the interactions, the business logic, etc… They use Node.js everywhere and master it completely.
  • Data, doing data things like data-science, data-engineering, data-mining or data-cleaning… This is the team I’m in. A majority of us primarily use Python, even if some prefer Scala or R or Java…

How does the data go from the inside of the brain of a data scientist to the screen of a Geoblink user? Well usually, the data scientist thinks of a solution, write some Python code to do it, this Python code will modify a database, a database that the Core team will make use of to beautifully rendered data. So, the usual flow is data scientist -> Python -> Postgres Database -> Node.js -> end user.

But when I joined, we started to need to make a Node.js service and a Python process talk to each other directly. The database in-between would be really superfluous, add a lot complexity, and plumb the whole service speed. That is because we needed some tools written by a Python guy to be called within a service written by a JS guy.

In this post I will cover how the communication between Python and Node.js evolved over time, as we had to implement this communication in different projects.

The ‘brute’ solution: bash script

The first Python tool we connected to Node was a heavy statistical model. It applies a lot of pre-processing and data-cleaning before giving the actual output and rewriting it into Node.js would have cost a lot of time and efforts.

The solution we came up with was to run it via the `spawn` function of `child_process` of Node.js.

We rewrote the input/output of the Python tool so that it would consume JSON from `stdin` and produce a JSON in `stdout`. Capture `stdout` from Node.js and you got a working tool! But if you do that, you might have problems with `PYTHON_PATH` or with encoding (working with Spanish and French addresses implies caring about accented letters. Enforcing UTF-8 encoding can help here). So we changed how we made the call from a Python file to a Bash file where we would export a valid `PYTHON_PATH` and do other operations.

Good thing:
* You control everything

Bad thing:
* You have to control everything

From the Node side:

const spawn = require('child_process').spawn
const toolParams = JSON.stringify(params)
const pythonProcess = spawn('bash', [config.tool.path, toolParams], {})
const stdout = []
const stderr = []

pythonProcess.stdout.on('data', data => stdout.push(data.toString()))
pythonProcess.stderr.on('data', data => stderr.push(data.toString()))

pythonProcess.on('close', (code) => {
if (code !== 0) {
const errorMessage = stderr.join('')
return reject(errorMessage)
}
const pythonResult = JSON.parse(stdout.join(''))
// Exploit pythonResult
}

Intermediate Bash script:

#!/usr/bin/env bash
export PYTHON_PATH=/some/location/:$PYTHONPATH

cd $DEDUPER_WORKSPACE && \
python3 run_tool.py $@ || exit 1

From the Python side:

import sys
import json

tool_params = json.loads(sys.argv[1])
result = do_something(tool_params)
json_output = json.dumps(result)
sys.stdout.write(json_output)

The ‘let-the-experts-do’ solution: python-shell

Some time later, another challenge arrived: Our service had to execute a lot of small Python scripts, each one very different from the others, each with their dependencies, some of them at the same time, but all returning the same type of data.

This time, I remember deciding to fetch a package to do the job for us. In the Python universe, most of your problems can be solved by a `import antigravity`. So I went to `npm` and found `python-shell` (https://github.com/extrabacon/python-shell) that seemed to do the job.

It works a bit like our previous solution: it sends the parameters to the Python script via `stdin` and gets the output via `stdout`. But, it can also support binary files, show tracebacks or execute the Python process in a child process. Most importantly, it is well tested and proven to work. With that part taken care of, the only thing left was to agree with the Python developers on a common way of writing the input/output of their scripts.

Good thing:
* Your job is reduced to understanding and using a library
* The library is done by people who probably spent more time
than you thinking about the problem

Bad thing:
* You are dependent on a library, its bugs and security failures included
* You are limited by the possibilities of the library (unless you expand it)

From the Node.js side:

const PythonShell = require('python-shell')

const pythonOptions = {
mode: 'json',
scriptPath: '/some/location/small_script.py',
pythonPath: '/usr/bin/python3'
}
const pyshell = new PythonShell(pythonParams.filepath, pythonOptions)
const pythonData = []

pyshell.on('message', function (message) {
pythonData = message
})

pyshell.end(function (err) {
if (err) {
throw err
}
// Exploit pythonData
}

From the Python side:

import sys
import json

data = do_something()

print(json.dumps(data, ensure_ascii=False))

 

Our most recent solution: micro-services

Fast-forward a few months and a new project now required us to connect another tool to a Node.js service. This time, the Python tool we had to connect was a heavy application, another statistical model.
It has these characteristics:

  • When launched, it occupies a lot of memory
  • The launching is not instantaneous
  • It is (kind of) slow (takes seconds to produce a result)
  • It requires a lot of dependencies

It seemed like a dreamed opportunity to build a Python service!

To avoid complexity, we decided to run the service in our local network. We separated the Python code into the loading of resources and the actual use of the tool. Then we prepared it so it would consume and produce a JSON, before agreeing on an endpoint and a JSON format.

There are ton of resources to explain how to deploy a statistical model in a service. Most of them will explain very well how to do it with Flask (http://flask.pocoo.org/). But oftentimes, they will miss one important point about Flask:

Flask’s built-in server is not suitable for production (from Flask documentation)

You need to connect your Flask service to a WSGI HTTP server to scale correctly. If you choose a standalone solution, you can choose between Gunicorn (the one we choose), uWSGI, Twisted, etc…

Good thing:
* Scalable, fully reusable component
* Separated concerns

Bad thing:
* Writing a fully functional service can take some time
(don’t skip security, logging or deploying steps)
* Demands DevOps work
* HTTP request can be slow

From the Node.js side (classic HTTP request):

const request = require('request')

const options = {
url,
json: true,
body: {variable: value}
}
const pythonResult = request.post(options, function (err, response, body) {
if (err) {
throw err
}
if (response.statusCode === 200) {
// Exploit body
}
}

From the Python side (no WSGI setup shown):

from flask import Flask, request
from myproject.models import my_model
app = Flask(__name__)

@app.route(rule='/predict', methods=['POST'])
def do_prediction():
input_sent = request.json
result = my_model.predict(input_sent)
return result, 200

if __name__ == '__main__':
app.run()

Key takeaways

Do you want to control exactly the context in which Python is called?
Write a Bash script.

Do you want an out-of-the-box solution?
Search for an already implemented library or package.

Do you want a scalable solution, something you can easily reuse in you app?
Take the time to create a micro-service.

I like the similarities between biological evolution and choosing a solution in software engineering. There is not really a solution better than the other, but rather the best solution is the one that best fits your needs.
Building a micro-service is cool, but it may not be adapted to the problem you want to solve (and will be more complex and take longer to build).

Thanks for reading.

 

Denis Vivies – Data Science

How to scrape a website

If you came to this page and you feel like extracting content from a webpage for any good (and legal, be very careful with this as there are several lawsuits currently in courts) reason, you have landed to the right place: here is our Web Scraper tutorial! At Geoblink, culture is one of our key values…Victor Hugo, Pedro Almodovar, Marco Asensio… Our Geoblinkers like to scrape the list of the upcoming cultural events so we don’t miss any of them even during the World Cup. Here, we offer you the opportunity to be a Geoblink monkey and scrape some cultural data

Ok, imagine that you want to know which cultural events are going to happen in Madrid during the next month. You can find every culture and entertainment events sorted by district on the webpage https://www.madridcultura.es/. At Geoblink, we like to use the incredible WebScraper to make our lists of cultural events.

Basically, WebScraper is a Google Chrome Extension. You can install it here: https://chrome.google.com/webstore/detail/web-scraper

Ok, let’s dive deep into the Madrid cultural experience:

1) Open WebScraping


Open the Developer tools of Google Chrome Browser in the customize panel or just type Ctrl+Shift+i. Then click on the Web Scraper Headless on the top of the banner.

 

2) Create a Sitemap


Click on the option Create new sitemap, fill a Sitemap name and then select a Start Url…. And let’s scrape

3) Add new selector



Webscraper works as a tree graph, where you have to create nodes that contain more nodes with information. There are different types of selectors, each one works with various characteristics and, depending on the object that you want to choose, you are to select a specific type….which may be a bit tricky at times. The most common types of selectors are Text, Links, and Element (that’s what we are gonna use for this tutorial) but, sometimes it’s necessary to use Element Click or Element Attribute.

We’re going to create this selector graph by choosing a Selector Link that matches each district of Madrid. Click on the button ‘Add new selector’ and complete the following options:



  • Id: Name of the node ( for this example ‘district_links’)
  • Type: Select the kind of node ( chose ‘Link’)
  • Selector : click on the button Select and choose the link that you want to go. If there are more than one you can select all the links by selecting various links. Awhen you are done selecting, click on…
  •  Done selecting! 🙂
  • In order to save the selection. If there is more than one link, you have to match the Multiple option.



To make sure you’re on track, glance at the Data preview and the Element preview. If you can’t see what you expect, well first, it sucks, and then you should scroll up to improve things 🙂

To save all click on the Save selector button.

4) Create a selector element


Click on the saved selector. It appears that you can create more nodes inside. If you go inside one district link you can see that there is more than one activity on the main page. For this example it’s necessary to have an element selector that compiles all the information of the different events.

To create that, it’s the same as the previous example, but now you have to choose Element in the type option. To select all the elements of the page, you have to click on the Select button, mark all the boxes and then choose the Multiple option.

5) Define the information of the event


Again, if you click inside the previous selector, you can create more. At this point, you want to define the different information of the event, like name, date, place. To do that, create one selector for each data choosing Text on the selector type. You can also create a Link selector to obtain more information regarding the event (as the address and the event web page).

6) Pagination


If you have followed the previous steps, you have built a tree graph for the main page of every district. But, if you look at the top of the page, you can see that there are more than one page per district. That’s the reason why pagination is needed.

To create accurate pagination, you have to go to the beginning of the tree, to the same step, then choose Element Selector, then create a Link Selector and pick on the different pages.



After that you need to edit the Event Element and modify its Parents Selector by selecting the previous node and the pagination selector that you have just created.

7) Let’s Scrape


You have finished building the Sitemap, so let’s scrape and see what happens. Click on the button Scrape and choose a Request Interval and Page load delay. The default values work, but sometimes it’s necessary to modify those. So, click on Start Scraping and a pop up should appear.

It’s recommended to click on the Browser button to see how it works and to check everything pans out well.

8) Export the results


The pop up window will close at the end of the scrape. So when it happens, you can download the data. Go to Export data as CSV, and click on Download now! A CSV will be downloaded with all the data of the cultural events in Madrid. After that, it’s recommended that you clean the data a little bit because some columns with links are added in the process. Also, you can download the Sitemap that you have just created.



Eventually you’ll get a nice CSV with almost 600 differents events for the next month in Madrid !


Alejandro Cantera – Data Acquisition

Nicolas Planchon – Data Acquisition

Encrypting a directory in Linux

Safety is something we developers care about. In this modern life our most valuable information is in our electronic devices, and if you are reading this you probably use Linux. Then yes, you found the proper post to keep your secrets safe from the bad guys.

Supposing you use Ubuntu I’ll explain how to encrypt the folder where you keep stuff you don’t want anyone to have access to – for example if someone steals your laptop.

For this task we’ll use eCrypfs. It’s a stacked filesystem for Linux. It can be mounted in a single directory and it does not not require a separate partition.

The mechanism of encryption will be based in mounting the folder using eCryptfs. Once the directory has been mounted with the tool you can manage it as if it was an standard folder. When you finish your work and you want to keep the files inaccessible you need to unmount the directory. If you want to keep using the files you need to mount the folder again.

Preparation steps:

Install eCryptfs

sudo apt-get install ecryptfs-utils

Create the required folders and change their permissions

mkdir ~/.private ~/private
chmod 0700 ~/.private ~/private

Initialize the folder mounting

Initialize eCryptfs (1 1 n n yes yes) (Grab ecryptfs_sig and remember your passphrase)

mount -t ecryptfs ~/.private ~/private
...
WARNING: Based on the contents of [/root/.ecryptfs/sig-cache.txt],
it looks like you have never mounted with this key 
before. This could mean that you have typed your 
passphrase wrong.

Would you like to proceed with the mount (yes/no)? : yes


Would you like to append sig [2f5efa91218fe4d3] to
[/root/.ecryptfs/sig-cache.txt] 
in order to avoid this warning in the future (yes/no)? : yes
Successfully appended new sig to user sig cache file
Mounted eCryptfs

Once the folder has been mounted you can add your files.

The file /root/.ecryptfsrc that saves your preferences will be automatically created. It should look like the image shown below. Check that no passphrase location is in the file, if you see it delete the line:

If you need the ecryptfs_sig it is located in: 

root/.ecryptfs/sig-cache.txt

Unmount

Now when you want to unmount your folder, so that nobody can access it:

sudo umount ~/private

Get your UID

id -u

Append one entry to /etc/fstab (use your UID and SIG obtained in previous steps)

# eCryptfs $HOME/.private mounted to $HOME/private
 /home/foo/.private /home/foo/private ecryptfs rw,noauto,nofail,uid=1000,umask=0077,relatime,ecryptfs_sig=2f5efa91218fe4d3,ecryptfs_cipher=aes,ecryptfs_key_bytes=16,ecryptfs_unlink_sigs,ecryptfs_passthrough=no,ecryptfs_enable_filename_crypto=no 0 0

Ready to use

Remount

And to remount, so that you can read the data again:

sudo mount -t ecryptfs ~/private

You’ll be asked to insert your passphrase every time you want to mount your folder. I hope you chose a safe one.

If you type wrong the mount passphrase then you need to unmount the folder in order to be able to mount it correctly again.

sudo mount -t ecryptfs ~/private/ (wrong passphrase)
sudo umount ~/private/
sudo mount -t ecryptfs ~/private/ (right passphrase)

And this is it.

In conclusion, if you are like me and has villain enemies all around the globe it’s worth the 10 minute setup. Your data will be in a safer place and you’ll have a better sleep.

You’re welcome.

Useful links:

http://ecryptfs.org/about.html

https://wiki.archlinux.org/index.php/ECryptfs

By Diego Borchers – DevOps Engineer

From AngularJS to Vue without dying in the attempt

You should know by now that here at Geoblink we work with a humongous amount of data, most of which has to be beautifully displayed across the whole app. That means that, besides needing a good performance on our backend to serve that data, we need a good performance on the browser side to process it and render it to our users in the fastest way possible.

Until now, and still for months to come, we’ve been relying on AngularJS as our frontend framework of choice. Needless to say, it’s been a 3 years relationship and generally speaking, we’re really happy with it. It’s a robust system, incredibly well supported even though there are 3 new versions ahead of it and easy to work with.

However, with apps growing bigger and becoming each time more and more common to do most of the calculations on the browser, we started to worry about the app performance had we decided to continue with AngularJS forever. Taking that into account, we started looking for other possibilities. Considering the size of our application, we didn’t even consider migrating to the newest versions of Angular, since the breaking changes were so big (completely different framework really) that we would’ve needed months to make the change. Of course we considered React, having that huge ecosystem and with great state management tools like redux, it was something to bear in mind. Yet we were reluctant, since we didn’t think combining both technologies would really be successful. We needed something that we could start working with progressively, instead of building a new application from the ground up. And then we ‘viewed’ VueJS (pun intended).

We had the chance of working with Vue in one of our so called Geothons, and we fell in love IMMEDIATELY. The performance improvements were astonishing, and the development speed increased incredibly fast, even though none us had previously worked with it. Plus, the progressive part of the framework was just what we needed. Start developing with this framework while we kept maintaining the main parts of our app with AngularJS. It was a no brainer.

Download Angular video

Download Vue video

The first test came quick. We had to completely redo a whole section of the app and we decided to start implementing Vue within Geoblink. But stop with the mumbo jumbo you’d say. How do you use two frameworks at the same time without interfering with one another? Really, it couldn’t be more simple, and it’s all thanks to an already included Angular directive.

<div id=”main-vue-app” ng-non-bindable>
<first-vue-component />
<some-other-vue-component />
</div>

That’s it? I told you it was simple. When attaching this directive to any HTML element, AngularJS stops watching whatever may be inside it. Basically it tells the framework to not consider anything inside as angular code. What happens inside a non-bindable element, stays within it.

Having this, you can start developing your new VueJS app as you’d normally do. Declare your main app, attach your components, your stores if you’re using Vuex (and you certainly should do it) and that’s it, you have a full working VueJS app within an AngularJS application.

// Require vue components files
import  firstVueComponent from ‘path-to-component’
import someOtherComponent from ‘path-to-other-component’


new Vue({ 
    el: '#main-vue-app',
    store: mainStore,
    components: {
      firstVueComponent,
      someOtherComponent    
    }
  })

As simple as it may seem, this is really all you need. Of course, this can become way more complicated, but if you are, as we were, thinking of doing a migration for your own app, I can’t recommend enough to consider this framework. So far it’s been great, and we’ve just started! We can’t wait to keep implementing new functionalities with this amazing framework.

Happy coding!

What came and what’s to come at Geoblink Tech

Happy new year! In these first days of 2018 here at Geoblink we have done a quick look back on what were the technologies that got us more excited during 2017. This includes technologies that some of us had to learn to work with our existing systems, the ones that we played around with just for fun and others that were new to us and were cool enough to end up in our production systems.

Not only that but we also compared that list against the technologies that each of us is looking forward to learn or work with in 2018. We hope you find the list interesting, and if you want to comment on it let us know in Twitter (@geoblinkTech).

Read more

2 days of fun and data BigDataSpain 2017

On Thursday and Friday last week a few geoblinkers from the Tech team were fortunate enough to attend Big Data Spain in Madrid, “one of the three largest conferences in Europe about Big Data”.

The line-up of speakers this year was amazing and they certainly didn’t disappoint. Moreover, our VP of Technology Miguel Ángel Fajardo and our Lead Data Scientist Daniel Domínguez had the chance to actively participate as speakers with a thought-provoking talk titled “Relational is the new Big Data”, where we tried to remark how relational databases can today solve many use cases regardless of the size of your dataset, adding lots of benefits with respect to other No-SQL options.

Relational is the new Big Data

Read more