2 days of fun and data BigDataSpain 2017

On Thursday and Friday last week a few geoblinkers from the Tech team were fortunate enough to attend Big Data Spain in Madrid, “one of the three largest conferences in Europe about Big Data”.

The line-up of speakers this year was amazing and they certainly didn’t disappoint. Moreover, our VP of Technology Miguel Ángel Fajardo and our Lead Data Scientist Daniel Domínguez had the chance to actively participate as speakers with a thought-provoking talk titled “Relational is the new Big Data”, where we tried to remark how relational databases can today solve many use cases regardless of the size of your dataset, adding lots of benefits with respect to other No-SQL options.

Relational is the new Big Data

Overall it was a great event with more than 60 speakers and 1200 attendees during 2 days.  Apart from the talks, lots of stands crowded the halls showing their products, giving parallel talks or doing demos of the state-of-the-art in Big Data processing and understanding.  A really inspiring atmosphere overflew Kinépolis Congress Centre in Madrid, with lots of networking and sharing enriching experiences from our work and challenges.

For those who also attended last year, we really felt how the event had became much bigger and professional, and schedule was respected much better than last year (food was very good too 🙂 ).   If we had to mention something to improve, I would say that there was a surprisingly low amount of questions and discussion after each talk, although sometimes they would happen later in the corridors.

Geoblink Data Team

As the amount of talks was insane (4 simultaneous talks every 40 minutes), in this post we will just highlight those that we were able to attend and that we found more interesting/relevant so that you get an overview on what all the fuss was about.

Based on technologies we currently use at Geoblink, the talk by Holden Karau on the upcoming performance improvements of pySpark (e.g. vectorized User-Defined Functions) facilitated by Apache Arrow  was really exciting.

We also attended a couple of interesting talks about graph databases, very relevant to our work, as we use them on a daily basis to store our road network and perform our routing algorithms.

First one was about ArangoDB, which is a distributed graph db whose most relevant feature is that vertices are schema-free objects, meaning that they may be documents, jsons, etc. Another interesting feature is the implementation of an index called Vertex Centric Index), which indexes all edges based on their vertices and some other arbitrary attributes. The speed they claim in their queries seemed interesting. Finally, their SQL-like language, AQL, makes the learning curve less steep.

The second talk  was about how Neo4j uses Raft algorithm to achieve consistency on their graph db. Raft is a consensus algorithm involves multiple servers agreeing on value. It allows to preserve data consistency along all the clusters in the network.

While the focus of the conference was on Big Data technologies, we must not forget they are tools that help us data scientists develop our daily work, which largely consists on using statistical models to gain insight on a variety of topics. Big Data Spain didn’t forget about statistics either and there were a couple of quite technical but very interesting talks by Totte Harinen and Sean J. Taylor, both data scientists at Uber and Facebook, respectively.

IMG_0001

In his talk, Totte Harinen argued that Big Data, far from killing causal inference as exposed by Chris Anderson in his famous article https://www.wired.com/2008/06/pb-theory/, is actually more alive than ever, and he mentioned some real-life use cases from his work at Uber. On the other hand, Sean J. Taylor gave a practical list of tips to avoid making common data errors and emphasized the importance of using models that estimate uncertainty when possible. He also made a point of the need of testing statistical models and reproducible data science workflows.

It’s also remarkable the number of talks about streaming and real-time processing, with plenty of talks illustrating practical use-cases for IioT, NLP, etc. using combinations of Kafka, Spark streaming and Apache Flink.

The presentation by Tyler Akidau (Google) on what is required to provide robust stream processing support in SQL left us with a much better understanding on the relationship between different types of data processing (batch vs streaming) and how the Apache Beam model is trying to unify them. While all still very theoretical, it did sound very promising.

IMG_0003

Those who struggle with data every day know that normally it is not just about Big Data, it is much more about good data.  Irene Gonzálvez came from Spotify to talk about the importance of data quality to take good decisions (or better said, to avoid taking wrong ones) and how they work on monitoring and testing data.  At Geoblink we work hard trying to set quality assessment pipelines for all our data ETL, so we are really concerned about this topic.

Last but not least, an issue that kept popping up over and over in the conference was the implementation of the General Data Protection Regulation (GDPR) by the European Union in 2018, and whether it will put European companies at a competitive disadvantage with respect to non-European ones. Paco Nathan (O’Reilly), Holden Karau (Google), Andy Petrella (Kensu), Jim Webber (Neo4j), José Borja Tomé (Agencia Tributaria) were some of the speakers that had the chance to give their opinions on the matter.

All in all, Big Data Spain has been a great experience for us and we are looking forward to coming back next year!

 

By Jordi Giner and Vicente 'Tito' Lacuesta