Data Science: Beyond the Kaggle15 May 2016
A few weekends ago, on a snowy Saturday in April (not uncommon in Denver), I signed into Kaggle for the first time in several months, looking to play around with some competition data in order to while away the chilly day. My kids’ endless chatter and my wife’s disapproving looks faded into the background, and I blissfully wrangled data from the Expedia Hotel Recommendation competition for several hours. I submitted a few entries, slowly climbing the leaderboard, and then finally I got up to help with my family duties.
That night in bed, my mind whirled with possibilities for what I could do with the data to improve my score – different variables I could use, several time-related features I could engineer, and thoughts about how to ensemble a couple dissimilar models together.
I woke up early Sunday and fired up my project in RStudio. Between breakfast and reading a few news articles, I submitted a few more entries. None were improvements on my previous entries. By noon, I lost interest. I quit. I shut my computer, said sorry to my wife, I got on the floor, and started playing with my kids. My mind slowly spun down, and I focused on the moment. And I became happy again.
I have a love-hate relationship with Kaggle, the self-proclaimed “Home for Data Science.” There are a lot of reasons to love Kaggle. The competitions can be interesting, the forum participation fascinating and educating, it’s instructive to see the sorts of ways that companies are using predictive analytics, and the whole process is a fun way to learn about Machine Learning techniques in a hands-on manner.
But it’s not Data Science. Kaggle’s bread-and-butter is its Machine Learning competitions, or Predictive Analytics competitions. The processes used in Machine Learning competitions only encompass a very small fraction of the Data Science process. I suppose “Your Home for Data Science” sounds better from a marketing standpoint than “Your Home for Machine Learning Competitions” or “Your Home for Cross-Validation, XGBoost, and Overfitting Techniques.” But I think people, especially those new to the craft and/or those companies looking to hire Data Scientists, should be aware that the Kaggle competitions only encompass a small part of the Data Science process.
As I have made my own transition from long-time data analyst to modern Data Scientist, I have read several articles about “How to be a Data Scientist.” Many, perhaps most, of the current articles suggest participating in Kaggle, or building a Kaggle resume/portfolio and getting a good ranking as a ticket to a job. Some authors stress this more than others. One person suggested that participating in Kaggle competitions could be the equivalent of a Master’s degree. Another comment for a different article said “all you need to do is Kaggle.”
Really? That’s all you need to do? Sure, the skills learned and practiced in Kaggle competitions are important to Data Science. But, based on my experience and the range of articles I have read on the subject, the skills to be good at Kaggle are maybe only 5-10% of the skills that you need to be a useful Data Scientist. Here are some of the reasons why Data Science > Kaggle:
- Defining the Question: When working Data Science for a company, one of the first steps, and perhaps the most important step is defining the question(s). What is the problem you are trying to solve? How will it help your business? What data do you currently have to support your analysis? What data do you need to find or create? If, as a Data Scientist, you can’t ask good questions, you are sunk before your ship even sails. In Kaggle competitions, you are just given the question – it is served up to you on a silver platter. Get the data and go!
- Data Acquisition: Working Data Science for a company, one of your challenges will likely be just finding the data. Where do you get it? Maybe you scrape a web page. Or you hit a public API, which might be an ugly municipal government SOAP API. Perhaps you use SQL to pull from a company relational database. Or you go harass your friend in engineering to deliver you a data set. For a Kaggle competition, you are given a data set or data sets, likely in CSV format. Many competitions forbid finding and using external data for your solution.
- Data Cleaning and Data Set Creation: In the real world, after you’ve found a data set, or several data sets that you have to figure out how to merge, there should be a significant amount of data janitor work. This is a big hurdle, and arguably takes the most amount of effort from any Data Scientist. With the Kaggle competition data sets, sure, there might be some cleaning you have to do, but for the most part the data are fairly clean already and you just have to figure out what to do with NA values or outliers.
- Anonymized Data: Companies that work with Kaggle to create competitions are understandably conscious about competitors using their data to gain a competitive advantage. To prevent this, Kaggle competitions usually use anonymized data. While understandable, I think this gets in the way of really understanding the data. If all I have is some arbitrary feature names, like V1, V2, V3, etc, and some integer values for those features, it is hard to use intuition to help solve the problems. Sure, we can do feature engineering to “combine” features in creative ways, but it’s not the same as intuiting that you can multiply V1 and V2 and divide by V3 to get a reasonable engineered feature.
- Evaluation Metrics: Another difficult part about real-world Data Science is coming up with an evaluation metric to evaluate the effectiveness of your machine learning model. Kaggle competitions, in order to be a fair competition, must hand you the evaluation metrics. The flip-side of this is that I’ve learned about several metrics that I otherwise might not have.
- Scoring Against the Evaluation Metric: In the real world, accuracy and having a good algorithm are important. You want your machine learning model to be accurate. However, more important than having the most accurate algorithm is that your model is useful. Ensembling a dozen or more different machine learning models to win a Kaggle competition is cool, and it displays a fair amount of computer science chops (running and combining that many models isn’t trivial!), but it’s not very useful in the real world (the winning Netflix Prize algorithm is a famous example, though that wasn’t a Kaggle competition). In real-world Data Science, it’s more important to have a useful model that you can explain, and that can run in a short enough amount of time to be usable in a production setting. In my own job, I’ve suggested to higher-up managers that I could use ensembling to improve our models, maybe achieving an accuracy 5-10% improved compared to our current accuracy. They said effectively “that’s cool, but will our customers notice an incremental improvement?” I said probably not. “So let’s focus on answering some other questions first.” Fair enough!
- Scripts and Scoring: Kaggle scripts are an ingenious way for competitors to share their work, and for newcomers especially to learn about coding techniques that the experts are using. Unfortunately, the scripts can lead to inflated user rankings for unscrupulous and lazy users who don’t want to do the work themselves. I believe it is too easy to fork a nice script that someone has written, click a button to run the script, and submit the results (you could do it all from your iPhone!). This takes a few seconds to minutes at most, and doesn’t require you to download data, import it into R or Python, and really figure out how the algorithm works. And yet, you can get full points for the leaderboard position your copied script takes! I found one such script-button-pusher, completely evident from a few of his results (2 or 3 submissions for each of about half a dozen competitions, with the “script” icon next to his results, finishing in a knot of others with the exact same score using the exact same script) that had gotten enough points to be in the top 0.75% of Kaggle user rankings! His LinkedIn profile headline says “Machine Learning Enthusiast.” Apparently not enthusiastic enough to do his own work. So I say: Recruiters beware of a person’s Kaggle ranking! That’s only one data point that could be easily fudged to say “I’m top 1%!” For more good bullet points on the pros/cons of Kaggle scripts, see this Quora discussion.
- Substantive Expertise: One of the coolest parts about Kaggle is that you get to learn about how different companies are using Data Science in their operations. As a budding Data Scientist, it helps you realize that Data Science is becoming more important in virtually every industry and scientific endeavor. That’s exciting! But in a 2- to 3-month competition, unless you have virtually no family or work commitments, it seems as if it would be almost impossible to become a substantive expert in your competition of choice. And, of course, the famous Venn Diagram of Data Science shows that substantive expertise is just as important as Hacking Skills (programming) and Math/Stats knowledge. No doubt competition winners, maybe top five percenters, get a fair amount of substantive expertise during the course of the competition (or already had it coming in), but for the rest of the field you are likely more in the Machine Learning slice than the Data Science slice.
- Telling the Story: According to an HBR article from 2013, a Data Scientist’s real job is storytelling. You need to tell the story of your data, from beginning to end, what it means, and how it affects the business. When telling the story, effective visualizations are paramount! For the overwhelming majority of Kaggle participants, this step is completely skipped. Sure, winners might be obligated to write up results as part of the condition of accepting prizes, and many winners and others blog about their results, but those blogs are more about the machine learning process, what features they engineered, and how they ensembled their models. Interesting from a technical perspective? Yes! Useful from a business perspective? Not so much. With anonymous features and a lack of substantive expertise, it becomes extremely hard to present a compelling story about the data, which is what you would be required to do in a business setting. So, for me, more admirable than being a consistent top-percentage finisher in Kaggle competitions would be to have a Data Science blog where you compile and display your own projects and your own stories (one such excellent example is Julia Silge’s data science ish blog).
My point here isn’t to complain that Kaggle is a bad company, or has a bad service. I still think Kaggle is, overall, an awesome company with a great vision about crowd-sourcing machine learning. I am excited about Kaggle Datasets, and seeing what people can do with that new service. They have an amazing jobs listing service, which is another great place to learn what companies are hiring Data Scientists. Their forum is full of amazing contributions from brilliant machine learning experts from across the world. It’s a fun place to while away a weekend or a few evenings if you have the time to do so, and a great place to learn about new machine learning techniques. My main point is to warn aspiring Data Scientists that there is a lot more to Data Science than just Kaggle competitions, or your Kaggle scripts profile. Take courses (MOOCs or formal education); find your own data sets; do full-stack analytics, including exploratory analysis, machine learning, and visualization; engage on social media; start a blog. Show you know how to ask questions, source your own data, build your own project from scratch, and write up the results. Kaggle can be an enormous time-suck, so don’t allot it more time than its “weight” if you want to be a well-rounded Data Scientist (maybe 5-10%). And then use your skills to make the world a better place.
What do you think? I’d love to hear your opinions, whether supporting or antagonizing. I’m ready for the “sour grapes” comments because my Kaggle ranking isn’t all that high (5195 out of 545,284 as I publish – see my Kaggle profile), so I’d prefer those that disagree to use a more thought-out argument (my counter-point to the sour grapes argument is that I usually lose interest in the competitions after a few hours due to some of the reasons bulleted above). Otherwise, interested to hear what others in the Data Science profession think.