You are on page 1of 2

Competing with the best data scientists can be challenging.

Especially so, if some of them


have been doing so for years. I know a few people who have well automated scripts to perform
most of the data exploration! These people are out deciding on best algorithms when rest of
the world is still figuring out the nuances of the data.

Here are a few things you need to keep in mind before starting a problem on Kaggle :

1. Like all good things in life, winning a Kaggle competition is all about hard work. Get
ready to devote long hours wondering on the same problem for days/weeks/months.
2. Team up with a good team mate for competing in initial competitions. Good team mate
is some one with similar bent of mind and thought process, but might have
complementary skills on tool / domain / work experience.
3. Be ready to do a lot of feature engineering – that is what differentiates the best from
the rest.
4. Do a preliminary research on the domain and the problem. There might be good
research papers with non-conventional effective solutions available on the internet.
5. Make simple initial solutions and submit them to get a sense on how much gap you
need to cover
6. Always be open to start from scratch
7. Experiment with different algorithms and be prepared to prepare ensembles.

The list is not exhaustive, but covers a significant portion. Now let’s look at a simple framework
to approach a Kaggle problem. Participants are challenged at each step of this framework by
Kaggle.

Framework to approach a Kaggle Problem


Next, we will take you through a step by step process of taking a simple shot on a Kaggle
statement. The process generally involve following pieces :

1. Importing the training / test population : Kaggle challenges you to import the training /
test dataset. In general, this is not very straight forward. For example in following problems,
training data needs to messaged well before we start working on the model.

Here are two problem statements where you need to extract data from multiple excel files :

a. Driver Telematic Analysis

b. BCI Challenge @ NER 2015

2. Sampling the population : In general the population size is huge and might not be the
best idea to train using the entire population. For example, “Sentiment Analysis fro Movie
Review” with an enormous number of phrases might be a bad idea to build an initial dictionary.
Choosing this sample can be done randomly or in a stratified way.
3. Choosing the right attributes : This is the most critical step which distinguishes different
submissions on Kaggle. In general we use Principle component analysis, factor analysis,
Information Value, Weight of Evidence to do this part. But there is no set procedure to do this.

4. Compare different ensemble / simple models : Once we have the input and the target
variables, we start building different models. The choice of model depends on the evaluation
metrics, type of input / target variable, distribution of population on target values etc.

In this article we will start with the first step leveraging the BCI challenge. We will start with
the problem statement and then define the scope of this article. After reading this article, I
believe you can start competing on Kaggle and start your journey to discover the new era of
Analytics & Machine Learning.

You might also like