It is often said that 70% of data science is about manipulating, cleaning and transforming data. Talk to any data scientist and the person will tell you that the 70% number is too optimistic. On a serious note, the science part of data scientists’ job comes in picture only after data is in a really nice vector space format for celebrated machine libraries to consume. I’ve always found that clever and intuitive features beat black box libraries in the long run.
Although the team is very open and adept at trying out new technologies, it is also very practical and hard-to-please. Geeks at Sokrati rarely rely on hypothetical quadrants or graphs and other baseless comparisons while picking technologies. It was thus a little surprise when the team was sold on the Hadoop (and subsequently Apache Spark) bandwagon pretty early. Given the sheer mind numbing scale of data and number of advertisers that we process, the choice was inevitable.
Working with Hadoop in early days, however, was not fun. The big picture of rapidly creating features for machine learning pipelines was getting obscured by low level plumbing needed for working with Hadoop Java APIs directly. We were always looking to adopt a better abstraction on top of a wonderful tool like Hadoop, when we stumbled on functional programming.
Although we never believe in “language/paradigm X is absolutely better than Y”, we did find tools like Cascading (abstraction on top of Hadoop APIs), Cascalog (a Clojure abstraction over Cascading) and Spark to be real value additions to our kit. Most of the work done in building machine learning pipelines like feature engineering and aggregations fits well with the functional paradigm i.e.
a) the computations are stateless
b) they can be easily modelled as a tuple getting transformed into another tuple through a (mathematical) function.
Functional programming tools have allowed us to code and (unit) test the business logic succinctly and easily. The result has been a faster and productive way of working with big data. Many teams at Sokrati have built variety of jobs using Cascading, Cascalog and Spark for various stages of our big data processing pipeline successfully. Tapomay – a machine learning geek in the team who has been positively surprised by Spark says, “Just a good understanding of concept of resilient distributed datasets allowed me to express very complex computations in a functional way. I didn’t have to worry about the distributed processing aspect at all, it was completely transparent”. The team is now quite excited about leveraging Apache Spark further.
If you are excited to work in a team that plays around with data to generate business insights, simply write to us at firstname.lastname@example.org.