You are on page 1of 35

DATA Sets

for Machine Learning


scikit-learn
Machine Learning in Python
● Simple and efficient tools for data mining and
data analysis
● Accessible to everybody, and reusable in
various contexts
● Built on NumPy, SciPy, and matplotlib
● Open source, commercially usable - BSD
license

Scikit-learn Dataset loading utilities
http://scikit-learn.org/stable/user_guide.html

5.1. General dataset API 5.10. The Labeled Faces in the Wild face recognition dataset

5.2. Toy datasets 5.11. Forest covertypes

5.3. Sample images 5.12. RCV1 dataset

5.4. Sample generators 5.13. Boston House Prices dataset

5.5. Datasets in svmlight / libsvm format 5.14. Breast Cancer Wisconsin (Diagnostic) Database

5.6. Loading from external datasets 5.15. Diabetes dataset


5.16. Optical Recognition of Handwritten Digits Data Set
5.7. The Olivetti faces dataset
5.17. Iris Plants Database
5.8. The 20 newsgroups text dataset
5.18. Linnerrud dataset
5.9. Downloading datasets from the mldata.org
repository
Toy datasets (from SciKit-Learn)
scikit-learn comes with a few small standard datasets that do not require
to download any file from some external website.

load_boston([return_X_y])
Load and return the boston house-prices dataset (regression).
load_iris([return_X_y])
Load and return the iris dataset (classification).
load_diabetes([return_X_y])
Load and return the diabetes dataset (regression).
load_digits([n_class, return_X_y])
Load and return the digits dataset (classification).
load_linnerud([return_X_y])
Load and return the linnerud dataset (multivariate regression).
Loading Data from SK-learn
from sklearn import datasets
iris = datasets.load_iris()
digits = datasets.load_digits()
Sample images
The scikit also embed a couple of sample JPEG
images.
Those image can be useful to test algorithms and
pipeline on 2D data.
from sklearn.datasets import load_sample_image
load_sample_images()
Load sample images for image manipulation.
load_sample_image(image_name)
Load the numpy array of a single sample image
hand-written digits Dataset
Number of Instances: 5620
Number of Attributes: 64
Attribute Information:
8x8 image of integer pixels
in the range 0..16. (i.e 16 gray levels only)
hand-written digits Dataset
Loading Input data and target labels
digits.data gives access to the features that can be
used to classify the digits samples:
>>> print(digits.data)
[[ 0. 0. 5. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 10. 0. 0.]
[ 0. 0. 0. ..., 16. 9. 0.]
...,
[ 0. 0. 1. ..., 6. 0. 0.]
[ 0. 0. 2. ..., 12. 0. 0.]
[ 0. 0. 10. ..., 12. 1. 0.]]
Loading Input data and target labels

digits.target gives the ground truth for the digit


dataset, that is the number corresponding to each
digit image that we are trying to learn:

>>> digits.target
array([0, 1, 2, ..., 8, 9, 8])
Shape of the data arrays
The data is a 2D array, shape (n_samples, n_features).
In the case of the digits, each original sample is an
image of shape (8, 8) and can be accessed using:

>>> digits.images[0]
array([[ 0., 0., 5., 13., 9., 1., 0., 0.],
[ 0., 0., 13., 15., 10., 15., 5., 0.],
[ 0., 3., 15., 2., 0., 11., 8., 0.],
[ 0., 4., 12., 0., 0., 8., 8., 0.],
[ 0., 5., 8., 0., 0., 9., 8., 0.],
[ 0., 4., 11., 0., 1., 12., 7., 0.],
[ 0., 2., 14., 5., 10., 12., 0., 0.],
[ 0., 0., 6., 13., 10., 0., 0., 0.]])
Iris Plants Database
Data Set Characteristics:
Number of Instances: 150 (50 in each of three classes)
Number of Attributes: 4 numeric, predictive attributes and the class
Attribute Information:
sepal length in cm
sepal width in cm
petal length in cm
petal width in cm
class:
Iris-Setosa
Iris-Versicolour
Iris-Virginica
Class Distribution: 33.3% for each of 3 classes.
Iris Plants Database
Iris Plants Database
sepal length sepal width petal length petal width Class
in cm in cm in cm in cm
4.5,2.3,1.3,0.3,Iris-setosa
4.4,3.2,1.3,0.2,Iris-setosa
5.0,3.5,1.6,0.6,Iris-setosa
5.1,3.8,1.9,0.4,Iris-setosa
7.0,3.2,4.7,1.4,Iris-versicolor
6.4,3.2,4.5,1.5,Iris-versicolor
6.9,3.1,4.9,1.5,Iris-versicolor
5.5,2.3,4.0,1.3,Iris-versicolor
6.5,2.8,4.6,1.5,Iris-versicolor
6.7,2.5,5.8,1.8,Iris-virginica
7.2,3.6,6.1,2.5,Iris-virginica
6.5,3.2,5.1,2.0,Iris-virginica
6.4,2.7,5.3,1.9,Iris-virginica
Important Notes
Before start Programming
Variables initialization with
Random Values
tf.random_normal(shape, mean=0.0, stddev=1.0, dtype=tf.float32,
seed=None, name=None)
Outputs random values from a normal distribution.

tf.truncated_normal(shape, mean=0.0, stddev=1.0, dtype=tf.float32,


seed=None, name=None)
The generated values follow a normal distribution with specified
mean and standard deviation, except that values whose magnitude is
more than (2 * standard deviations ) from the mean are dropped
and re-picked.
Variables initialization with
Random Values
tf.random_uniform(shape, minval=0, maxval=None,
dtype=tf.float32, seed=None, name=None)

dtype: The type of the output: float32, float64, int32, or int64


For floats, the default range is [0, 1).
For ints, at least maxval must be specified explicitly.

The generated values follow a uniform distribution in the range


[minval, maxval). The lower bound minval is included in the
range, while the upper bound maxval is excluded.
Generation of same repeatable
sequence
To generate the same repeatable sequence for an op across sessions, set the
seed for the op:

a = tf.random_uniform([1], seed=1)
b = tf.random_normal([1])
# Repeatedly running this block with the same graph will generate the same
# sequence of values for 'a', but different sequences of values for 'b'.
print("Session 1")
with tf.Session() as sess1: Output
print(sess1.run(a)) # generates 'A1' Session 1
print(sess1.run(a)) # generates 'A2'
[ 0.23903739]
print(sess1.run(b)) # generates 'B1'
[ 0.22267115]
[-0.56301004]
print(sess1.run(b)) # generates 'B2'
[-0.97901398]
print("Session 2")
with tf.Session() as sess2: Session 2
print(sess2.run(a)) # generates 'A1' [ 0.23903739]
print(sess2.run(a)) # generates 'A2' [ 0.22267115]
print(sess2.run(b)) # generates 'B3' [ 1.26448703]
print(sess2.run(b)) # generates 'B4' [-0.76988888]
Generation of same repeatable
sequence
To make the random sequences generated by all ops be repeatable across sessions, set a graph-level seed:

tf.set_random_seed(1234)
a = tf.random_uniform([1])
b = tf.random_normal([1])
# Repeatedly running this block with the same graph will generate different
# sequences of 'a' and 'b'.print("Session 1")
with tf.Session() as sess1:
print(sess1.run(a)) # generates 'A1' Output
print(sess1.run(a)) # generates 'A2'
Session 1
[ 0.93559742]
print(sess1.run(b)) # generates 'B1'
[ 0.87699151]
print(sess1.run(b)) # generates 'B2'
[ 2.46717691]
print("Session 2") [ 1.58331776]
with tf.Session() as sess2:
print(sess2.run(a)) # generates 'A1' Session 2
print(sess2.run(a)) # generates 'A2' [ 0.93559742]
print(sess2.run(b)) # generates 'B3'
[ 0.87699151]
print(sess2.run(b)) # generates 'B4' [ 2.46717691]
[ 1.58331776]
Randomly shuffles a tensor along its
first dimension.
tf.random_shuffle(value, seed=None, name=None)

Output
>>> import numpy as np Shuffle 1
array([[1, 2],
>>> import tensorflow as tf [5, 6],
[3, 4]], dtype=int32)
>>> sess=tf.Session() Shuffle 2
>>> c = tf.constant([[1,2,3], [ 4,5,6], [7,8,9]]) array([[1, 2],
[3, 4],
>>> shuff = tf.random_shuffle(c) [5, 6]], dtype=int32)

Shuffle 3
>>> sess.run(shuff) array([[5, 6],
[3, 4],
>>> sess.run(shuff) [1, 2]], dtype=int32)
>>> sess.run(shuff)
Split Data into
Training and Testing
train_test_split(arrays, options)

Arrays
lists, numpy arrays, and scipy-sparse matrices are allowed

Options
test_size : float, int, or None (default is None)
train_size : float, int, or None (default is None)
If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the
test/train split.
If int, represents the absolute number of test samples.
If None, the value is automatically set to the complement of the train/test size.
Both train and test size are None, test size is set to 0.25.

random_state : int value (==seed of random number generator).


If None, output will differ each time.
Split Data into
Training and Testing
from sklearn.model_selection import train_test_split
X, y = np.arange(24).reshape((8, 3)), range(8)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
>>> X

Output
>>> y
Input Training Data
array([[ 0, 1, 2], array([[21, 22, 23],
>>> X_train [ 3, 4, 5], [ 6, 7, 8],
[ 6, 7, 8],
[ 9, 10, 11], [12, 13, 14],
>>> y_train [12, 13, 14], [ 9, 10, 11],
[15, 16, 17], [18, 19, 20]])
[18, 19, 20],
>>> X_test [21, 22, 23]])
[7, 2, 4, 3, 6]
>>> y_test
[0, 1, 2, 3, 4, 5, 6, 7]
Test Data
array([[ 3, 4, 5],
if repeated you will get the same: [15, 16, 17],
X_train and X_test samples and in same order [ 0, 1, 2]])
(as the random state is fixed)
to let it random every run don’t assign this field [1, 5, 0]
Conversion from 1-D Target vector to
ONE-HOT vector
Traditional Target vector (Labels) {assume three classes 0,1,2}
Target=
0
0
0
1
0
2
2

ONE-HOT target=
1 0 0
1 0 0
1 0 0
0 1 0
1 0 0
0 0 1
0 0 1
Conversion from 1-D Target vector to
ONE-HOT vector
to convert from target to one-hot
p=eye(3) Input Output
p= Target
0 1 0 0
1 0 0 0 1 0 0
0 1 0 0 1 0 0
0 0 1 1 0 1 0
0 1 0 0
0 0 1
Note: 2
0 0 1
2
P[0]= 1 0 0
p[1]= 0 1 0
P[2]= 0 0 1
Ont_hot=p[target]
build Neural Network
Steps to build Neural Network Example

1) get data set


2) read data set as Inputs and Targets,
3) Shape inputs as l=m*n array where m is number of samples, n length
of feature vector
4) Shape Targets as one-HOT vectors . Each target is Q*1 where Q is
number of classes. All values are “zero” except value corresponding
to proper class equal “one”
5) Select RANDOMLY percentage of data set samples to be training
data and the rest as test data
6) Define architecture of Neural Network
7) initialize random weights for NN (Uniform/ Normal, define min&max for
Uniform, mean&SD for normal) define Randomness of seed
(Operation based or graph based)
Neural Network
Define Imports library
import tensorflow as tf

import numpy as np

from sklearn import datasets

from sklearn.model_selection import train_test_split


Weight Initialization Function
def init_weights(shape):
""" Weight initialization """
weights = tf.random_normal(shape, stddev=0.1)
return tf.Variable(weights)

Define the Shape (dimension of the Weight matrix)


This function will return Tensorflow variable
Initialized using normal distribution

tf.set_random_seed(1234)
Remember to use this function(outside fn definition)
Read Data from Dataset
def get_iris_data():
""" Read the iris data set and split them into training and test sets
"""
iris = datasets.load_iris()
data = iris["data"]
target = iris["target"]

Remember:
● Data and target still needs to split into training

and testing samples


● Target is NOT a one-HOT (requires conversion
Add Bias to Input Data
Convert target to one-hot
# Prepend the column of 1s for bias
N, M = data.shape
all_X = np.ones((N, M + 1))
all_X[:, 1:] = data

# Convert into one-hot vectors


num_labels = len(np.unique(target))
all_Y = np.eye(num_labels)[target]
return train_test_split(all_X, all_Y, test_size=0.33, random_state=1234)

Notes:
● Use Fixed random generator in order to get same

training and testing samples each run of the program


● 2/3 data for training and 1/3 data for testing

● all_X, all_Y, returned variable as np variables (not

tensorflow varibles i.e. required place holder)


Feed forward
to Calculate Outputs
def forwardprop(X, w_1, w_2):

h = tf.nn.sigmoid(tf.matmul(X, w_1)) # The \sigma function


yhat = tf.matmul(h, w_2)
return yhat

Notes:
yhat is not softmax since TensorFlow's softmax_cross_entropy_with_logits()
does that internally.
“h” is the output of Hidden layer (calculated using W_1 * X input) the
squashed using Sigmoid activation function
“yhat” if the output of O/P layer (calculated using W_2 * h)
NN Architecture
x_size = train_X.shape[1] # Number of input nodes: 4 features and 1 bias
h_size = 256 # Number of hidden nodes
y_size = train_y.shape[1] # Number of outcomes (3 iris flowers)

# Symbols
X = tf.placeholder("float", shape=[None, x_size])
y = tf.placeholder("float", shape=[None, y_size])

Notes:
● As train_X, train_Y, test_X, test_Y are Numpy variables, there should be
place holder to intercept their values and pass them to tensorflow operations.
● W_1 size should be x_size * h_size
● W_2 size should be h_size * y_size
Backward Propagation
(Training)
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(yhat, y))
updates = tf.train.GradientDescentOptimizer(0.01).minimize(cost)

Notes:
● Cost is calculate by performing “Softmax” on “yhat” before calculating the
LOSS. There was no need to define “Softmax” in output layer.
● There are different ways to calculate loss (other than cross_entropy)
● There are different ways to minimize loss other than Gradient Descent.
● {refer to Tensorflow documentation fr other methods}
Calculating Output Accurecy
predict = tf.argmax(yhat, dimension=1)

train_accuracy = np.mean(np.argmax(train_y, axis=1) ==


sess.run(predict, feed_dict={X: train_X, y: train_y}))
test_accuracy = np.mean(np.argmax(test_y, axis=1) ==
sess.run(predict, feed_dict={X: test_X, y: test_y}))

Remember:
● “yhat” is one-HOT output, however, train_y and test_y labels are not.
● Perdict whould convert “yhat” one-HOT into traditional labels
● “Predict” requires “yhat” which required train_X and train_Y ot Test_X and
Test_y. All are not tensorflow variable, it is required to use feed_dict with
place holders.
Epochs
predict = tf.argmax(yhat, dimension=1)

train_accuracy = np.mean(np.argmax(train_y, axis=1) ==


sess.run(predict, feed_dict={X: train_X, y: train_y}))
test_accuracy = np.mean(np.argmax(test_y, axis=1) ==
sess.run(predict, feed_dict={X: test_X, y: test_y}))

Remember:
● Each epoch required to UPDATE weights “number of training samples”
time. Then calculate Accurecy “ONCE”
● For better results system should iterate a suitable number of Epochs. Not too
small (bad training), not too large (memorize instead of training)

You might also like