This is like the most usefull resource on practical
AI that you can possibly find today! Have you
ever wanted to play around with
the AMAZING Kaggle datasets? If this interests
you, buckle in!
GETTING STARTED!
If you are new to Kaggle, you can create your
account with
- Google
- Facebook
- Kaggle
YOU ALREADY LOGGED IN KAGGLE
Ok, Now you already have logged into Kaggle, to
start playind around you can go to the
competitions. You can have a look at the most
recent and what prizes they offer.
HUMPBACK WHALE IDENTIFICATION
CHALLENGE
In this tutorial we're going to be looking at
the recent ( late 2018 ) Humpback Whale
Identification Challenge, MNIST is also a good
place to start, you can look have a look at
how simple it is.
CREATING YOUR FIRST KERNEL
Now we get to the competitions page, you can
see a blue button called "New Kernel", just press
it .
USING JUPYTER NOTEBOOKS
If you have never heard of Jupyter Notebooks (Do
you live in a cave? hahhaa ). They are an amazing
resource to share replicable code, they were used even
for the Gravitational Waves!
Pretty amazing not? We are going to be using them in
our Kernels at Kaggle
GAME ON!
Now the Game is On! We can have amazing
Kernels and share with the community!
COMMITING YOUR CHANGES
If you aren't familiar with the term commit have a
look here (This channel has also Amazing resources
within the Data Science field! ). You can commit
your changes, don't worry, this wont make your
Kernel public yet, you can go nuts!
YOUR PROFILE HAS YOUR
KERNELS!
You finished all your code and tests and now you
are ready to move on! Make the final commit and
go to your profile. There lies your precious Kernels!
ALMOST THERE!
You can open your Kernel by clicling on the title
GO PUBLIC
To make your now Private Kernel Public you need
to click on the "Access" button
BE PROUD
Change the Privacy Options from Private to Public and
that's it! But wait, my kernel isn't showing up, I know I know,
I've been were you are. If you went to the public kernels
and didn't find your own, don't panic, the Kaggle website
takes sometime to update the Kernels. Now be PROUD,
you've made your very first public Kernel!
10 MAJOR STEPS IN YOUR KERNEL
In this video I'll show you how the 10 major steps in creating
your very first simple model to this Whale Competion! If you
enjoy the content, consider subscribing and activating the
notification, I upload videos every week on Data Science
topics!
If you want more on this
content just or any other
content. Let me know on the
comments bellow. We post
weekly on topics such as Data
Science, if you don't want to
miss out, just subscribe to our
Newsletter to receive weekly
news on your email!
KAGGLE IMAGE COMPETITION,
HOW TO DEAL WITH LARGE
DATASETS
When I have to deal with Huge image datasets, this is what I
do. Working with image datasets in Kaggle competitions
can be quite problematic, your computer could just freeze
and don't care about you anymore. To stop this things from
happening, I'm going to be sharing with you here the 5
Major Steps to work with Image datasets.
THE 5 MAJOR STEPS
I'm posting videos every week and if you don't
want to miss out, subscribe to the channel!
KAGGLE TUTORIAL :
COMPETITIONS – PART I
This Kaggle competition is a great way to get your
hands on real data science and data analysis
problems.
HUMPBACK WHALE IDENTIFICATION
One of the major problems when learning data science is
how to get your hands on real problems. If you want to
become a real data scientist or learn data science, Kaggle is
one of best places to practice data science.
ABOUT THIS TUTORIAL
Here I'm going to be doing this kaggle tutorial on how to get
started in one of the current competitions of the website. If
you want to follow along, just go to the competitions, and
scroll down to the Humpback whale identification challenge.
I've been playing around with the humpback whale
identification challenge for about a month now. You can
checkout the prizes for this competition, they are up to 10k
dollars.
BREAKING DOWN
This Kaggle competition is a great way to get your hands on
real data science and data analysis I'm going to be breaking
down this competition from the very start.
We are going to be going from 0 to creating a model to
make our submissions.
In this first video I'll be showing you the kernel I've made so
that you can follow along with the videos. For those that
aren't familiar with kaggle, this kernels are like jupyter
notebooks that you can run on the cloud.
You can check out the specifications for the machine
running your scripts here. And you can also check out the
commits made to the kernel. The specifications are quite
reasonable to run your first models
LET'S GET CODING
Ok, Now that I gave you an introduction to the kernels at
kaggle, we can move into the coding part. To make our
model we'll use pytorch, I got quite surprised when I asked
if you wanted more videos on keras or pytorch and you
choose pytorch, but this great, I've enjoyed pytorch much
better then keras and tensorflow so far.
A part from pytorch, you can see pytorch beeing imported
here, you'll use the os library, to work with the files, also
going to be using pandas, We can't miss that on our data
science project .
For the matrices and vector calculations we'll be using our
old friend numpy. To understand a bit better and visualize
our dataset we'll use matplotlib.
KAGGLE TUTORIAL :
COMPETITIONS – PART II
This Kaggle competition is a great way to get your
hands on real data science and data analysis
problems.
HUMPBACK WHALE IDENTIFICATION
We are going to take the first steps to the kaggle
competition today! YEAH! To participate in kaggle, one of
the major choices one has to make today is what deep
learning frameworks to use, because, well, there’s lot’s of
frameworks out there.
PYTORCH
I’ve asked around and you’ve choosen PyTorch, and this is
great, because I’m loving PyTorch so far. If you haven’t see
the first video, it’s fine, I know your time is precious, I’ll just
lay out for you a review. I just introduced the Kaggle
website, the competition, the prizes and whatthat this is a
series of videos is all about, after finishing this video you still
want to watch the first, great. I see you there.
BREAKING DOWN
I’m going to be breaking down this competition from the
very start. We are going to be going from 0 to creating a
model to make our submissions.
In this first video I’ll be showing you the kernel I’ve made so
that you can follow along with the videos. For those that
aren’t familiar with kaggle, this kernels are like jupyter
notebooks that you can run on the cloud.
You can check out the specifications for the machine
running your scripts here. And you can also check out the
commits made to the kernel.
The specifications are quite reasonable to run your first
models
THE FUN PART
Now for the fun part. We already go through the libraries
here, the next step is to create a class for our dataset.
But why do we need a class for our dataset? I understand
you, the first time I’ve tried to play around with PyTorch I get
a little frustrated that there wasn’t a simple way to load the
dataset.
I’m not talking about MNIST and CIFAR10 like datasets here,
there are simple ways to load this datasets into memory.
THE FUN PART
I’m talking about a custom dataset, just like you’ll encounter if
get the chance to work as a data scientist. But I’m glad I got
around and created the dataset, because this get’s pretty
handy to deal with more complex situations
And when you create the first time for a dataset, you pretty
much copy and paste the Class and make the adjustments for
your specific dataset,
I myself followed this tutorial on the pytorch documentation, if
you want to have a look, it’s a great reading addition to this
tutorial, let me know in the comments if you founded usefull
the reference so I make more of these in the videos
THE CLASS
The first thing we create here is the __init__ method, it’s a
good idea if you want to share your code to create a docstring
in the functions. I’ve explained here the parameters to this
function, we need to pass the path of the csv file containing
the data, also we need to pass the root directory of our project,
then we can pass a transform, We’ll come back to this later,
and we can also pass if this is the testing dataset.
You can see here, if we have a test dataset, In this case I passed
the dataset to the class, you could also change this to receive
the csv path filename to the test dataset and read with
pandas inside here.
THE CLASS
If we are not passing the test dateset, we call the one hot
encoding function. Here we read the training dataset with
pandas, you can use df.head() to checkout the dataset, we
have the name of the images and the classes.
Now that we have created our dataframe, we can create
also a variable for our labels, To transform our labels into
one hot encoded vectors, we can use sklearn. We can see
here that it transformed the class into a one dimensional
vector.
Continuing here we just add the roo directory and the
transform , we’ll get back to this transform later. Now we
have two more methods, the len and getitem, the len
method will only return the length of our dataset, the
__getitem__ is more interesting.
THE CLASS
This function is the one you need to implement to get one
record from your dataset, we get the img_name by joining
the root directory of our project and the name of the image,
we use the iloc function from pandas here.
We can use this to get a record from our dataset, if we just
put the index 0 here, it’ll return the first record from our
dataset, but we want the image name, so we add another
argument to let the function know we want the first
column.
After this we get the associated label with that image, load
the image into memory and return as a dict.
INSTANTIATE OUR CLASS
We can instatiate our dataset now. You can call the dataset
and pass the index, this is the index used in the getitem
function we just saw.
We have the image and the label, we can use matplotlib to
plot if we want to check if it’s ok. In the next tutorials we’ll
be moving on to creating a class to handle our dataset, then
making some basic preprocessing so we can create our
conv neu net with pytorch.