You are on page 1of 8

Dataset

(most famous)

1
MNST – Hand written digit
▪ The dataset was constructed from a number of scanned document datasets available from
the National Institute of Standards and Technology (NIST). This is where the name for the
dataset comes from, as the Modified NIST or MNIST dataset

▪ Link to original database: http://yann.lecun.com/exdb/mnist/


▪ Link to best results:
http://rodrigob.github.io/are_we_there_yet/build/classification_datasets_results.html

2
CIFAR-10 dataset
▪ CIFAR = Canadian Institute for Advanced Research

▪ The CIFAR-10 dataset consists of 60,000 photos divided into 10 classes (hence the name CIFAR-10)1. Classes include common
objects such as airplanes, automobiles, birds, cats and so on.

▪ The dataset is split in a standard way, where 50,000 images are used for training a model and the remaining 10,000 for
evaluating its performance.

▪ The photos are in colour with red, green and blue channels, but are small measuring 32x32 pixel squares.

▪ State of the art result can be checked here:


http://rodrigob.github.io/are_we_there_yet/build/classification_datasets_results.html

▪ Official website: https://www.cs.toronto.edu/~kriz/cifar.html

3
IMDB dataset
▪ The Large Movie Review Dataset (often referred to as the IMDB dataset) contains 25,000
highly-polar movie reviews (good or bad) for training and the same amount again for
testing. The problem is to determine whether a given moving review has a positive or
negative sentiment.

▪ The data was collected by Stanford researchers and was used in a 2011 paper where a split
of 50-50 of the data was used for training and test2. An accuracy of 88.89% was achieved.

▪ Official website: http://ai.stanford.edu/~amaas/data/sentiment/

4
IRIS dataset
▪ The Iris dataset, a staple of the machine learning community, was introduced by statistician
Robert Fischer in 1936.
▪ Its easy accessibility, small size, clean data, and symmetry of values have made it a popular
choice for testing classification algorithms.
▪ The Iris dataset represents 3 kind of Iris flowers (Setosa, Versicolour and Virginica) with 4
attributes: sepal length, sepal width, petal length and petal width.

5
Diabetes
▪ The Diabetes dataset is a regression dataset of 442 diabetes patient. The prediction
columns include age, sex, BMI (body mass index), BP (blood pressure), and five serum
measurements. The target column is the progression of the disease after 1 year.

6
Breast Cancer dataset
▪ The Wisconsin Breast Cancer dataset, used to predict whether a patient has breast cancer,
has 569 rows and 30 columns.

▪ Some extra info can be obtained here


▪ https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.ht
ml?highlight=load_breast_cancer

7
100+ dataset to train your AI model

▪ See link below:


▪ https://www.kdnuggets.com/2021/05/awesome-list-datasets.html

You might also like