MICROSOFT BOT FRAMEWORK AND OUR FRIEND LUIS

In a previous blog post, we talked about our experience at the 2017 Hackatrips Hackathon at Fitur. In this blog, we go more in depth into the technologies we used, and how we put it all together, with a special emphasis on Natural Language Processing (NLP) with LUIS.

To refresh your memories, we developed a bot on the Microsoft Bot Framework to be able to share Cabify rides.

Getting started with the Bot Framework was quite simple, thanks to the Node.js SDK. Another very cool thing about Microsoft Bot Framework is the Bot Framework Emulator. This cool little piece of software allows you to test your bot very easily on localhost. While it is missing some interesting features such as having more than one conversation at a time, it did wonders for developing the bot quickly and efficiently, which was key at the hackathon.

The dialog tree for our app was very straightforward. After the initial greeting, the user can ask for a cab. The app asks questions until all necessary info (pickup time and destination) is gathered, and then sends a quote with the price of the ride. If during the waiting period another user in the vicinity is going to a nearby destination, both users are notified that they will share a ride, and the new, cheaper price is displayed.

Now, of course there are infinitely many ways to ask for a cab, and infinitely many ways to ask for a time and destination. We need our bot to understand what it is being told. To tackle this problem, we used LUIS, Microsoft’s natural language classifier. A natural language classifier is a machine learning tool that can take a sample sentence and classify it according to its intent, and recognize entities within these intents.

The concept of intent and entity is crucial for this exercise. Intent is defined by the will of the user, it is what they want to get from an action. Entities are the relevant keys that can be identified from an intent. For example, our app had only two intents, to greet or to order a cab. Within the ‘asking for a cab’ intent, there were two entities, pickup time, and destination. The way we designed this bot is a so-called ‘slot-filling’ model. Essentially the user invokes the intent, and the bot keeps asking questions until all the slots are filled, and all the info is acquired. In our case, we had three scenarios:

  1. The user just asks for a cab: In this case we are missing the location and pickup time slot, so the bot asks first for one, and then the other, and then it’s ready to go.
  2. The user gives of the intents: User says they want to go to Hotel Marigold. We know they want a cab, we know where they want to go, we are just missing the time, so the bot asks for it. Vice versa if the user gives just a pickup time. The bot has to understand what entity has been gathered, and ask for the other one.
  3. The user says everything in one sentence: In this case, it is important that the model is well trained to perfectly separate the time and destination entities.

LUIS was a great fit for our bot because it is relatively simple to use, quick to train, integrates seamlessly to the bot framework, and -let’s face it – we really wanted that Xbox One. Once our intents and entities were defined, we had to train the natural language model.  We fed the model with a few dozens of different ways to ask for a cab, with all possible combinations of slot-filling. After a few, the model could recognize the entities by itself, needing only validation from the human.

Screen Shot 2017-02-16 at 19.58.27

In addition to building the guts of the bot, we had to work with the Cabify API. We devised a system of ordering and canceling rides in order to be able to share cabs, and then integrated in into the conversation, giving the user the price and definitive pickup time. User requests were stored, so when a second user ordered a cab from a nearby location, both users could be unified in a single cab request. We also used the Google Places API to geocode the destination request and get a latitude and longitude for the API to set as the destination for the taxi. Once the entire Cabify logic was finished, it was incorporated it into the conversation.

The end result was something like this:

Screen Shot 2017-02-17 at 13.23.33

Our initial objective was to deploy the app to Azure, and then connect the bot to Facebook Messenger through a Facebook app we had created for this purpose. We did run into a bit of trouble in this aspect, and time was certainly a constraint, so we settled with demoing the project on the emulator.

You can have a quick look at the code here: https://github.com/hackatrips-team-3/hackatrips

By Ignacio "Guli" Moreno

Geoblinkers @ Hackatrips: A Hackathon Experience

This past January, Fitur (Spain’s largest yearly tourism event) and minube (online travel platform), joined forces to organize the first ever Hackatrips Hackathon. The objective? To engage a group of developers, tourist specialists and designers to find and build upon great ideas on sustainable travel. After all, 2017 is the Year of Sustainable Tourism for Development.

The logistics of the hackathon itself was in the hands of Hackathon Lovers, a team dedicated to the organization of hackathons and spreading passion for development. In my opinion, they did a fantastic job putting together this hackathon. They made sure that participation was more than enough, evenly built the teams, and handled technical issues and communications channels through Slack.

Plenty of sponsors co-produced the event, providing APIs, awards, or both. The main sponsors were of course minube, as well as Microsoft, HotelBeds, Cabify, Porsche, Alcalá de Henares, Carto, Goldcar, Hoteles Lopesan and Tryp Hoteles.

The Hackathon

The event was very well put together. I was actually quite surprised at the turnout, a total of 60 people from different sectors and fields of expertise. We were organized in 12 teams of 5, each consisting of 3 developers, a tourism expert, and a designer. Representing Team Geoblink were Gabriel Furstenheim and myself, both of us Software Engineers in the Web team at Geoblink. The team also included Valentín Berlin, a freelance JS developer, and Fanny Fernández, a UI/UX designer.

We had a nice breakfast upon arrival, met our new teammates, and listened to the intros. Each tech partner pitched why we should use their API -and backed up their pitch with the great awards each offered- and we had a quick 10 seconds to introduce ourselves. The instructions were very simple: We had until 1:30pm the next day to create an app related to sustainable tourism, using only the APIs provided or other public APIs, a video with a demo of how it works, and a nice pitch to show off our idea. There were no other rules nor limitations. And so we got to work.

After a brainstorming session, pondering the pros and cons of each API, evaluating our know-how, and taking time into consideration, we settled on exploring the use of Microsoft and Cabify APIs. We were quick to organize the team into a parallel task machine, and made sure our expectations and objectives were clear. It is very common for people to want to do too much, and end up achieving too little. In a time-critical environment such as a hackathon, it is better to do one thing very well, than to try to do a million things and end up trying to explain to the judges how close you were to actually doing it.

Our objective was to create a conversational interface to share rides using Cabify. We decided to use the Microsoft Bot Framework to handle the dialog, LUIS to understand language, and naturally Cabify’s API to ask for rides.

The Work

We chose to develop the bot using the Node.js SDK. We chose Node.js for its familiarity (it’s the framework that powers Geoblink, and the other team members were familiar with JavaScript) and scalability. Fortunately, the Microsoft Bot Framework has plenty of docs to develop your bot in Node. After everyone was set up, we established the different tasks to work in parallel. Part of the team focused on understanding the intricacies of the Cabify API, and figuring out a way to share rides, something the API wasn’t quite prepared to do, while the others focused on getting our bot talkin’. If you want a more in depth look at how the bot was developed, you can check out the blog post here.

While the developers worked on the code, the designers worked on communicating the idea. Fanny, a designer by trade, created a nice set of slides and a story to explain what we were doing and how we did it. We set up a nice dynamic and communication channel with Fanny to give her a good vision of what the app could do, and how it could be improved in the future, which helped her establish the main selling points and create a solid presentation. The constant feedback loop among the two sub-teams worked great.

The Pitch

We had 5 minutes to present our pitch to the crowd and the panel of judges, and then a 2 minute round of questions. Getting the timing right was the biggest challenge, and we were close to running out.

You can check out the slides here: https://www.slideshare.net/secret/EmkifUajFAu2ES

IMG_1577

The Projects

One of the nicer things about this hackathon was the great availability and diversity of the different APIs that the teams could use. The presence of plenty of sponsors, along with teams of people from each one, made it easy to understand each of the APIs, their capabilities, limitations, and implementations. Both the Microsoft and Cabify teams were very helpful and responsive. Having such a broad range of possibilities made the competition a lot more interesting, with an equally diverse idea line-up. Not two projects were quite the same.

Among my personal favorites, I’d like to highlight two. The first, called Envify, used Microsoft’s image recognition API to read images of the paradise-like landscapes of Tahiti or the Caribbean…and then showed you the closest matches right here in Spain. Very creative, and definitely in the sustainable tourism theme. It also looked beautiful. The second, Hidden Gems, suggested installing touch screen devices in small towns and rural areas, to engage the local community with rural tourism. Again, it was beautifully designed, with a nice presentation.

The Conclusion

We walked out of there with an Xbox One for the best use of Microsoft API’s, €100 in Cabify credit for the best use of the Cabify API, and the pride of positively representing the Geoblink engineering team.

IMG_1583

All in all, it was a fantastic experience, a great team-building exercise, and an amazing learning opportunity. I’d highly encourage not just open hackathons such as this one, but also organizing internal hackathons to get the creative juices flowing in your company. I can’t wait for the first 2017 Geothon!

P.S. Lunch both days was amazing! Just another touch to a great weekend.

By Ignacio "Guli" Moreno

From CERN to Geoblink: A transition from the largest lab in the world to a not-so-large startup

As some of you, I am one of those 28% of data scientists holding a PhD. In my case, I did my PhD in High Energy Physics at CIEMAT institute, and thanks to that I was able to spent 4 unforgettable years at CERN looking for Higgs boson particles in the CMS experiment.  My role,basically, was the statistical analysis and interpretation of the data. When I finished, I realised I had not become that-big-bang-theory-guy, and decided I wanted to work and learn from fields other than Physics. So I decided to transition to a Data Science career path (c’mon, it’s the “sexiest job in the XXI century”, this must be good).   

To make the transition, I started reading a lot about Data Science related topics like IA, Machine Learning and others. I also attended several online and offline courses. All of that helped me land several interviews, and got a position at Geoblink over a year ago.

Looking back, I couldn’t be more thrilled about the new path I chose. The world of Data science and analytics couldn’t be more exciting right now, and positions are offered in cool projects that  require creativity and offer a big impact not only in specific companies but in the way every branch of the Data realm is changing the world.

Throughout this transition, experience made me learn about a few of things related to moving out from Academia, which I would like to share:

  • Data Scientists spend a big chunk of time being “Data Cleaners” (in some cases, “Data Kleenex” even). You have probably heard about it. You’ll spend most of your time cleaning/mining/formatting/structuring the raw data you get your hands dirty with. At Geoblink, we gather data from dozens of public/private sources for each country we offer services on, plus all the internal information from our clients, coming in all types and structures. This makes data mining complex, but on the other hand it has forced us to create new powerful tools to deal with all sort of problems with the data, which is pretty cool. In the end, fancy Machine Learning is just the tip of the iceberg, but the essential  work is done well before.  
  • Statistics is your greater weapon. In my opinion this is the core part of our role; the correct manipulation and interpretation of the data. You will have to defend your results in front of your managers and stakeholders. Sometimes they will have a different vision/intuitive expectation, and Statistics will be the way to explain your point and make it prevail. At Geoblink we are quite lucky, as everyone in the Product team and most members of the Marketing & Sales team are former engineers, so it is very easy for the Data team to explain the direction we want to take.
  • Devil is in the detail. During my years at CERN, I lost count of the times it looked like I had discovered a new fundamental particle, and then realise it was just a statistical artifact from a bad histogram binning, or whatever stupid thing. When you are treating correlated data as we do here, you must be very careful about what you are doing and how you interpret it. Again, that’s what Statistics was invented for (although sometimes is really tricky).
  • Coding matters. Based on my experience, good coding skills are a great asset, even if it might not be essential, depending on your role in the company. On one hand, they will greatly increase your productivity, as you won’t spend as much time rewriting crappy and buggy code (as happened to me when I started my PhD). Also, in a SaaS product as the one we build at Geoblink, we intend to put all the analysis in the production pipeline, where you might have to interact with different languages, so we need more than just math skills. Finally, you will need to optimize the algorithms you use to improve analysis performance. In the Data team we have heterogeneous profiles, ranging from GIS developers to engineers, mathematicians and physicists to cover all the aspects, but we are open to develop in the language that is most suitable for the task (R, Python, C++, Node, …). In that sense, we haven’t taken a side in the R vs Python battle.

Geoblink_grupo-12

  • Simple is better. Avoid the hype when you don’t see a clear benefit. Yes, Tensor Flow is very cool, also the last R library which includes ***add here the newest machine learning algorithm winner of the last whatever Kaggle competition***. But most of the time you won’t need it. In fact, you might not be able to use it given the data. And some other times, it won’t add much more value over a well understood, more classic, faster-to-develop approach. Of course, using it sometimes makes a difference. Just learn about it, understand how it works and the value it provides and make the right choice for your project.
  • Unicorns don’t exist Personally, when I started looking for a job, I was overwhelmed by the amount of frameworks/tools/languages which (they said) were required to know in order to do the job. Don’t feel like you lack the skills, au contraire, you just need to learn the jargon and a few important tools depending on your position. At Geoblink, I’ve learnt tons of SQL, Python, R, Machine Learning Techniques, Spatial analysis, GIS tools… and learning enough about them so that I can use them effectively wasn’t that difficult. The only thing that I would really recommend is some database knowledge, and in particular SQL language, as I think that sooner or later you will need it, whatever your role is.

In conclusion, I think this is a great moment to work in Data Science. The massive amount of data being generated these days opens a vast universe of possibilities, new ideas and knowledge to play with. And a tech profile from Academia, while not being mandatory, has a lot of skills in common with the job:

eureka

 

If this article caught your attention and you are interested in solving some awesome spatial-data challenges, visit Geoblink open positions!

PS: Sorry fellows, but almost nobody ever heard about ROOT and you won’t probably ever use it again (unfortunately).

By Daniel Domínguez

Postgres Foreign Data Wrappers (FDW)

Breaking the monolith

At Geoblink our database runs on PostgreSQL. This works great for us as we get all the speed and flexibility of SQL making it easy to adapt the backend to the changing requirements in the product.

This flexibility comes with a price though, as it is very easy to end up with a monolith-like set of schemas with cross dependencies among the different parts. In our case, it was specially true for ETL processes: introducing new demographic indicators was a very complicated task where we had to juggle with several databases. In fact, making the data ready to be consumed by our application included a lot of manual steps, making it difficult to automate the promotion and deploying of new data.

A few months ago we started planning our international expansion (UK here we come!) and it became obvious that our structure was not going to scale anymore. Looking into possible solutions we found Postgres Foreign Data Wrappers (FDW) and we instantly fell in love with it. We added it to our infrastructure and our ETL processes now work as a breeze!

How it used to work

On one hand we had the demographic database. Our Data team would gather a huge amount of data from all different kinds of demographic sources, ranging from the Spanish National Institute of Statistics or the equivalent British Office for National Statistics to an economic study by a consulting firm. Using various models and heuristics the team would come up with a large demographic database. On the other hand we had the user database, containing the user information from production. In an ideal world these two databases would have been apart. However, that is not possible for us from a business perspective, as we offer several features that require combining data from both.

Our ETL worked as follows: we would replicate the user database from production into a server where we would run the computations. Here we would mix user data with the demographic data in a process that could take several hours. Next step would be to re-download all user data that had changed in production in the meantime, before pushing the new data into production.

We wanted to improve this process, and actually knew that if we wanted to include a new country we would have to make a combined deployment which could not possibly scale.

Postgres FDW

Postgres Foreign Data Wrappers are a really cool feature that allows us to “connect” two databases and query data from one into the other as if it were a single database.

In its simplest form it works as follows:

CREATE SCHEMA store_data; 
IMPORT FOREIGN SCHEMA store_data FROM SERVER remote_server INTO store_data;

Now we can join tables with store data as if the table lived in our server:

SELECT client_stores.id, SUM(buildings.people) as population FROM buildings, store_data.client_stores 

WHERE ST_DISTANCE(client_stores.geometry, buildings.geometry, true) < 50

Here we would get all the people living within 50 meters of the client stores.

FDW have been a part of Postgres since 9.1 but it was in the 9.6 release that they became really powerful, allowing remote joins.

FDW in action

In our case the FDW structure was very natural, we have a central server where we store user data. Then we have a server for each country that reads from the user schema through a foreign data wrapper.

Whenever we need demographic, “static” data we can query the corresponding country database. If we only need user, “changing” data then we query the User database. In those cases where we need to join user and demographic data (like in the example with the population around the store) we query the country database and the FDW sorts out the join with the User data.

This model is very flexible and has helped us streamline our ETL processes where we cook and push the new demographic data into production.

This is the process we follow (graphics powered by Mermaid):

  • We have this setup in production, for each country database keeping demographic data there is a FDW so it can read from user tables (UK and ESP represent demographic/static data for those countries)Screen Shot 2017-01-16 at 16.03.22
  • When the ETL transform stage begins, the server where we cook the data (we call it computation server) can connect to the User database and have the latest information from production and make all required calculations – while the current instance of the database with the “old” data is serving data to production.Screen Shot 2017-01-16 at 16.03.29
  • Once all the computations are finished we can replicate this new database in production. Since the database contains the data for the new country we can just move the binaries and restore them (instead of having to perform a full backup including recomputing the indices, which is always a costly operation). At that point we have two databases for the same country in the server.Screen Shot 2017-01-16 at 16.03.35 
  • At this point we have two databases for the same country in the server and can perform an instant switch so that users start accessing the new data, and discard the old database.Screen Shot 2017-01-16 at 16.03.42

To wrap up, Postgres FDW has allowed us to simplify the dependencies between the different databases and streamline our generation and promotion workflows. We definitely recommend evaluating it if you have different databases that you want to keep separated as they represent different models, but your ETL processes or functionality require part of the data to be joined.

You can find more information in the Postgres documentation. Enjoy!

By Gabriel Furstenheim

FIRST EDITION OF THE GEOBLINK HACKATHON, THE GEOTHON

This week is a special week at Geoblink! We have organized our first-ever, company-wide hackathon, we call it GEOTHON.

The whole team behind Geoblink has come together and put aside our regular tasks for 2 days to focus on building something cool and different. Our objective is to come up with new ideas that would bring innovation to the features we offer our clients and improvements to our internal processes. For us this is very a very important event as innovation and creativity are embedded in our DNA and we think they are some of the things that sets us apart from other companies and products.

By the way, the GEOTHON is not a tech-only event! It’s a company-wide hackathon where we expect everyone to spark some creativity and come up with cool ideas, either in the shape of new features, new internal tools, pitching new features or presentations where we image what could be the new cutting edge feature we include in our solution.

screen-shot-2016-10-20-at-21-12-36

Hello world!

var http = require(‘http’)
var server = http.createServer(function (request, response) {
 response.writeHead(200, {‘Content-Type’: ‘text/plain’})
 response.end(‘Hello World!’)
})

server.listen(8000)