From CERN to Geoblink: A transition from the largest lab in the world to a not-so-large startup

As some of you, I am one of those 28% of data scientists holding a PhD. In my case, I did my PhD in High Energy Physics at CIEMAT institute, and thanks to that I was able to spent 4 unforgettable years at CERN looking for Higgs boson particles in the CMS experiment.  My role,basically, was the statistical analysis and interpretation of the data. When I finished, I realised I had not become that-big-bang-theory-guy, and decided I wanted to work and learn from fields other than Physics. So I decided to transition to a Data Science career path (c’mon, it’s the “sexiest job in the XXI century”, this must be good).   

To make the transition, I started reading a lot about Data Science related topics like IA, Machine Learning and others. I also attended several online and offline courses. All of that helped me land several interviews, and got a position at Geoblink over a year ago.

Looking back, I couldn’t be more thrilled about the new path I chose. The world of Data science and analytics couldn’t be more exciting right now, and positions are offered in cool projects that  require creativity and offer a big impact not only in specific companies but in the way every branch of the Data realm is changing the world.

Throughout this transition, experience made me learn about a few of things related to moving out from Academia, which I would like to share:

  • Data Scientists spend a big chunk of time being “Data Cleaners” (in some cases, “Data Kleenex” even). You have probably heard about it. You’ll spend most of your time cleaning/mining/formatting/structuring the raw data you get your hands dirty with. At Geoblink, we gather data from dozens of public/private sources for each country we offer services on, plus all the internal information from our clients, coming in all types and structures. This makes data mining complex, but on the other hand it has forced us to create new powerful tools to deal with all sort of problems with the data, which is pretty cool. In the end, fancy Machine Learning is just the tip of the iceberg, but the essential  work is done well before.  
  • Statistics is your greater weapon. In my opinion this is the core part of our role; the correct manipulation and interpretation of the data. You will have to defend your results in front of your managers and stakeholders. Sometimes they will have a different vision/intuitive expectation, and Statistics will be the way to explain your point and make it prevail. At Geoblink we are quite lucky, as everyone in the Product team and most members of the Marketing & Sales team are former engineers, so it is very easy for the Data team to explain the direction we want to take.
  • Devil is in the detail. During my years at CERN, I lost count of the times it looked like I had discovered a new fundamental particle, and then realise it was just a statistical artifact from a bad histogram binning, or whatever stupid thing. When you are treating correlated data as we do here, you must be very careful about what you are doing and how you interpret it. Again, that’s what Statistics was invented for (although sometimes is really tricky).
  • Coding matters. Based on my experience, good coding skills are a great asset, even if it might not be essential, depending on your role in the company. On one hand, they will greatly increase your productivity, as you won’t spend as much time rewriting crappy and buggy code (as happened to me when I started my PhD). Also, in a SaaS product as the one we build at Geoblink, we intend to put all the analysis in the production pipeline, where you might have to interact with different languages, so we need more than just math skills. Finally, you will need to optimize the algorithms you use to improve analysis performance. In the Data team we have heterogeneous profiles, ranging from GIS developers to engineers, mathematicians and physicists to cover all the aspects, but we are open to develop in the language that is most suitable for the task (R, Python, C++, Node, …). In that sense, we haven’t taken a side in the R vs Python battle.

Geoblink_grupo-12

  • Simple is better. Avoid the hype when you don’t see a clear benefit. Yes, Tensor Flow is very cool, also the last R library which includes ***add here the newest machine learning algorithm winner of the last whatever Kaggle competition***. But most of the time you won’t need it. In fact, you might not be able to use it given the data. And some other times, it won’t add much more value over a well understood, more classic, faster-to-develop approach. Of course, using it sometimes makes a difference. Just learn about it, understand how it works and the value it provides and make the right choice for your project.
  • Unicorns don’t exist Personally, when I started looking for a job, I was overwhelmed by the amount of frameworks/tools/languages which (they said) were required to know in order to do the job. Don’t feel like you lack the skills, au contraire, you just need to learn the jargon and a few important tools depending on your position. At Geoblink, I’ve learnt tons of SQL, Python, R, Machine Learning Techniques, Spatial analysis, GIS tools… and learning enough about them so that I can use them effectively wasn’t that difficult. The only thing that I would really recommend is some database knowledge, and in particular SQL language, as I think that sooner or later you will need it, whatever your role is.

In conclusion, I think this is a great moment to work in Data Science. The massive amount of data being generated these days opens a vast universe of possibilities, new ideas and knowledge to play with. And a tech profile from Academia, while not being mandatory, has a lot of skills in common with the job:

eureka

 

If this article caught your attention and you are interested in solving some awesome spatial-data challenges, visit Geoblink open positions!

PS: Sorry fellows, but almost nobody ever heard about ROOT and you won’t probably ever use it again (unfortunately).

By Daniel Domínguez