You are on page 1of 54

MACHINE LEARNING LAB

ETCS-454

Faculty Name: Dr. Pooja Gupta Name: Harshit Garg


Roll No: 04414802716
Semester: 8th
Group: 8-C-3

Maharaja Agrasen Institute of Technology, PSP Area,


Sector – 22, Rohini, New Delhi – 110085
MACHINE LEARNING LAB

PRACTICAL RECORD

PAPER CODE : ETCS-454

Name of the student : Harshit Garg

University Roll No. : 04114802716

Branch : CSE

Section/ Group : 8-C-3

PRACTICAL DETAILS

a) Experiments according to the list provided by GGSIPU

Exp. no Experiment Name Date of Date of Remarks Marks


performa checking (10)
nce
Introduction to Machine
Learning Lab with
1. Python(3.x)
Understanding of
Machine learning
2. algorithms.

Understand clustering
3. approaches and
implement K means
Algorithm using Sci-Kit
Learn
Study of datasets and
4. understanding attributes
evaluation in regard to
problem description.
Working of Major
5. Classifiers
a) Naïve Bayes b)
Decision Tree
c)CART
d) ARIMA
e) linear and logistics
regression
f) Support vector
machine
g) KNN
Implement supervised
6. learning (KNN
classification) .Estimate
the accuracy of using 5-
fold cross-validation.

7. Introduction to R. Be
aware of the basics of
machine learning
methods in R.
8. Develop a machine
learning method using
Neural Networks in
Python to Predict stock
prices based on past price
variation.
MAHARAJA AGRASEN INSTITUTE OF TECHNOLOGY

VISION

To nurture young minds in a learning environment of high academic value and imbibe spiritual and
ethical values with technological and management competence.

MISSION

The Institute shall endeavour to incorporate the following basic missions in the teaching methodology:

❖Engineering Hardware – Software Symbiosis: Practical exercises in all Engineering and


Management disciplines shall be carried out by Hardware equipment as well as the related
software enabling a deeper understanding of basic concepts and encouraging inquisitive nature.

❖Life-Long Learning: The Institute strives to match technological advancements and encourage
students to keep updating their knowledge for enhancing their skills and inculcating their habit
of continuous learning

❖Liberalization and Globalization: The Institute endeavors to enhance technical and


management skills of students so that they are intellectually capable and competent
professionals with Industrial Aptitude to face the challenges of globalization.

❖Diversification: The Engineering, Technology and Management disciplines have diverse


fields of studies with different attributes. The aim is to create a synergy of the above attributes
by encouraging analytical thinking.

❖Digitization of Learning Processes: The Institute provides seamless opportunities for innovative
learning in all Engineering and Management disciplines through digitization of learning processes
using analysis, synthesis, simulation, graphics, tutorials and related tools to create a platform for
multi-disciplinary approach.

Entrepreneurship: The Institute strives to develop potential Engineers and Managers by enhancing their
skills and research capabilities so that they emerge as successful entrepreneurs and responsible citizens.
MAHARAJA AGRASEN INSTITUTE OF TECHNOLOGY

COMPUTER SCIENCE & ENGINEERING DEPARTMENT

VISION

To Produce “Critical thinkers of Innovative Technology”

MISSION

To provide an excellent learning environment across the computer science


discipline to inculcate professional behaviour, strong ethical values, innovative
research capabilities and leadership abilities which enable them to become
successful entrepreneurs in this globalized world.

❖ To nurture an excellent learning environment that helps students to enhance


their problem solving skills and to prepare students to be lifelong learners by
offering a solid theoretical foundation with applied computing experiences
and educating them about their professional, and ethical responsibilities.

❖ To establish Industry-Institute Interaction, making students ready for the


industrial environment and be successful in their professional lives.

❖ To promote research activities in the emerging areas of technology


convergence.

❖ To build engineers who can look into technical aspects of an engineering


solution thereby setting a ground for producing successful entrepreneur
EXPERIMENT 1

Aim : Introduction to Machine Learning concepts.

An Introduction to Python
Python is a popular object-oriented programming language having
the capabilities of a high-level programming language. It's easy to
learn syntax and portability makes it popular these days. The
followings facts gives us the introduction to Python −
● Python was developed by Guido van Rossum at Stichting Mathematisch Centrum in the
Netherlands.
● It was written as the successor of a programming language named ‘ABC’.
● It’s first version was released in 1991.
● The name Python was picked by Guido van Rossum from a TV show named Monty
Python’s Flying Circus.
● It is an open source programming language which means that we can freely download it
and use it to develop programs. It can be downloaded from www.python.org..
● Python programming language is having the features of Java and C both. It is having the
elegant ‘C’ code and on the other hand, it is having classes and objects like Java for
object-oriented programming.
● It is an interpreted language, which means the source code of Python program would be
first converted into bytecode and then executed by Python virtual machine.

Strengths and Weaknesses of Python


Every programming language has some strengths as well as weaknesses, so does Python too.

Strengths
● According to studies and surveys, Python is the fifth most
important language as well as the most popular language for
machine learning and data science. It is because of the
following strengths that Python has −
● Easy to learn and understand − The syntax of Python is simpler;
hence it is relatively easy, even for beginners also, to learn
and understand the language.
● Multi-purpose language − Python is a multi-purpose
programming language because it supports structured
programming, object-oriented programming as well as
functional programming.
● Huge number of modules − Python has a huge number of modules
for covering every aspect of programming. These modules are
easily available for use hence making Python an extensible
language.
● Support of open source community − As being an open source
programming language, Python is supported by a very large
developer community. Due to this, the bugs are easily fixed by
the Python community. This characteristic makes Python very
robust and adaptive.
● Scalability − Python is a scalable programming language
because it provides an improved structure for supporting large
programs than shell-scripts.

Weakness
● Although Python is a popular and powerful programming language, it has its own
weakness of slow execution speed.
● The execution speed of Python is slow as compared to compiled languages because
Python is an interpreted language. This can be a major area of improvement for the
Python community.

Installing Python
For working in Python, we must first have to install it. You can
perform the installation of Python in any of the following two ways

● Installing Python individually
● Using Pre-packaged Python distribution: Anaconda

Let us discuss these each in detail.

Installing Python Individually


If you want to install Python on your computer, then then you need to download only the binary
code applicable for your platform. Python distribution is available for Windows, Linux and Mac
platforms.
The following is a quick overview of installing Python on the above-
mentioned platforms −
On Unix and Linux platform
With the help of following steps, we can install Python on Unix and
Linux platform −
● First, go to www.python.org/downloads/.
● Next, click on the link to download zipped source code available for Unix/Linux.
● Now, Download and extract files.
● Next, we can edit the Modules/Setup file if we want to customize some options.
○ Next, write the command run ./configure script
○ make
○ make install

Using Pre-packaged Python Distribution: Anaconda


Anaconda is a packaged compilation of Python which have all the
libraries widely used in Data science. We can follow the following
steps to setup Python environment using Anaconda −
● Step 1 − First, we need to download the required installation
package from Anaconda distribution. The link for the same is
www.anaconda.com/distribution/. You can choose from Windows, Mac and Linux OS as
per your requirement.
● Step 2 − Next, select the Python version you want to install on
your machine. The latest Python version is 3.7. There you will
get the options for 64-bit and 32-bit Graphical installer both.
● Step 3 − After selecting the OS and Python version, it will
download the Anaconda installer on your computer. Now, double
click the file and the installer will install the Anaconda
package.
● Step 4 − For checking whether it is installed or not, open a
command prompt and type Python as follows
Why Python for Machine Learning ?
● Python is the fifth most important language as well as most
popular language for Machine learning and data science. The
following are the features of Python that makes it the
preferred choice of language for data science −
● Extensive set of packages
Python has an extensive and powerful set of packages which are ready to be used in
various domains. It also has packages like numpy, scipy, pandas, scikit-learn etc. which
are required for machine learning and data science.

● Easy prototyping
Another important feature of Python that makes it the choice of language for data science
is the easy and fast prototyping. This feature is useful for developing new algorithm.

● Collaboration feature
The field of data science basically needs good collaboration and Python provides many
useful tools that make this extremely.
● One language for many domains
A typical data science project includes various domains like data extraction, data
manipulation, data analysis, feature extraction, modelling, evaluation, deployment and
updating the solution. As Python is a multi-purpose language, it allows the data scientist
to address all these domains from a common platform.

Components of Python ML Ecosystem


In this section, let us discuss some core Data Science libraries that
form the components of Python Machine learning ecosystem. These
useful components make Python an important language for Data
Science. Though there are many such components, let us discuss some
of the importance components of Python ecosystem here −
● Jupyter Notebook − Jupyter notebooks basically provides an
interactive computational environment for developing Python
based Data Science applications

EXPERIMENT 2
Aim : Understanding of Machine learning algorithms.

Algorithms Grouped by Learning Methodology

This taxonomy or way of organizing machine learning algorithms is useful because it forces us
to think about the roles of the input data and the model preparation process and select one that is
the most appropriate for the problem in order to get the best result.

1. Supervised Learning

Supervised Learning AlgorithmsInput data is called training data and has a known label or result
such as spam/not-spam or a stock price at a time.
A model is prepared through a training process in which it is required to make predictions and is
corrected when those predictions are wrong. The training process continues until the model
achieves a desired level of accuracy on the training data.

Example problems are classification and regression.

Example algorithms include: Logistic Regression and the Back Propagation Neural Network.

2. Unsupervised Learning

Unsupervised Learning AlgorithmsInput data is not labeled and does not have a known result.
A model is prepared by deducing structures present in the input data. This may be to extract
general rules. It may be through a mathematical process to systematically reduce redundancy, or
it may be to organize data by similarity.

Example problems are clustering, dimensionality reduction and association rule learning.

Example algorithms include: the Apriori algorithm and K-Means.

3. Semi-Supervised Learning

Semi-supervised Learning AlgorithmsInput data is a mixture of labeled and unlabelled examples.

There is a desired prediction problem but the model must learn the structures to organize the data
as well as make predictions.

Example problems are classification and regression.


Example algorithms are extensions to other flexible methods that make assumptions about how
to model the unlabeled data.

Algorithms Grouped By Similarity

Regression Algorithms

Regression AlgorithmsRegression is concerned with modeling the relationship between variables


that is iteratively refined using a measure of error in the predictions made by the model.
Regression methods are a workhorse of statistics and have been co-opted into statistical machine
learning. This may be confusing because we can use regression to refer to the class of problem
and the class of algorithms. Really, regression is a process.

The most popular regression algorithms are:

● Ordinary Least Squares Regression (OLSR)


● Linear Regression
● Logistic Regression
● Stepwise Regression
● Multivariate Adaptive Regression Splines (MARS)
● Locally Estimated Scatterplot Smoothing (LOESS)

Instance-based Algorithms

Such methods typically build up a database of example data and compare new data to the
database using a similarity measure in order to find the best match and make a prediction. For
this reason, instance-based methods are also called winner-take-all methods and memory-based
learning. Focus is put on the representation of the stored instances and similarity measures used
between instances.

The most popular instance-based algorithms are:

● k-Nearest Neighbor (kNN)


● Learning Vector Quantization (LVQ)
● Self-Organizing Map (SOM)
● Locally Weighted Learning (LWL)
● Support Vector Machines (SVM)
Regularization Algorithms
An extension made to another method (typically regression methods) that penalizes models
based on their complexity, favoring simpler models that are also better at generalizing.
The most popular regularization algorithms are:

● Ridge Regression
● Least Absolute Shrinkage and Selection Operator (LASSO)
● Elastic Net
● Least-Angle Regression (LARS)

Decision Tree Algorithms


Decision tree methods construct a model of decisions made based on actual values of attributes
in the data.
Decisions fork in tree structures until a prediction decision is made for a given record. Decision
trees are trained on data for classification and regression problems. Decision trees are often fast
and accurate and a big favorite in machine learning.

The most popular decision tree algorithms are:

● Classification and Regression Tree (CART)


● Iterative Dichotomiser 3 (ID3)
● C4.5 and C5.0 (different versions of a powerful approach)
● Chi-squared Automatic Interaction Detection (CHAID)
● Decision Stump
● M5
● Conditional Decision Trees

Bayesian Algorithms
Bayesian methods are those that explicitly apply Bayes’ Theorem for problems such as
classification and regression.

The most popular Bayesian algorithms are:

● Naive Bayes
● Gaussian Naive Bayes
● Multinomial Naive Bayes
● Averaged One-Dependence Estimators (AODE)
● Bayesian Belief Network (BBN)
● Bayesian Network (BN)
Clustering Algorithms
Clustering, like regression, describes the class of problems and the class of methods.

Clustering methods are typically organized by the modeling approaches such as centroid-based
and hierarchical. All methods are concerned with using the inherent structures in the data to best
organize the data into groups of maximum commonality.

The most popular clustering algorithms are:

● k-Means
● k-Medians
● Expectation Maximisation (EM)
● Hierarchical Clustering

Association Rule Learning Algorithms


Association rule learning methods extract rules that best explain observed relationships between
variables in data.
These rules can discover important and commercially useful associations in large
multidimensional datasets that can be exploited by an organization.
The most popular association rule learning algorithms are:

● Apriori algorithm
● Eclat algorithm

Artificial Neural Network Algorithms


Artificial Neural Networks are models that are inspired by the structure and/or function of
biological neural networks.

They are a class of pattern matching that are commonly used for regression and classification
problems but are really an enormous subfield composed of hundreds of algorithms and variations
for all manner of problem types.
The most popular artificial neural network algorithms are:
● Perceptron
● Multilayer Perceptrons (MLP)
● Back-Propagation
● Stochastic Gradient Descent
● Hopfield Network
● Radial Basis Function Network (RBFN)

Deep Learning Algorithms


Deep Learning methods are a modern update to Artificial Neural Networks that exploit abundant
cheap computation.

They are concerned with building much larger and more complex neural networks and, as
commented on above, many methods are concerned with very large datasets of labelled analog
data, such as image, text. audio, and video.

The most popular deep learning algorithms are:

● Convolutional Neural Network (CNN)


● Recurrent Neural Networks (RNNs)
● Long Short-Term Memory Networks (LSTMs)
● Stacked Auto-Encoders
● Deep Boltzmann Machine (DBM)
● Deep Belief Networks (DBN)

Dimensionality Reduction Algorithms


Like clustering methods, dimensionality reduction seek and exploit the inherent structure in the
data, but in this case in an unsupervised manner or order to summarize or describe data using less
information.

This can be useful to visualize dimensional data or to simplify data which can then be used in a
supervised learning method. Many of these methods can be adapted for use in classification and
regression.

● Principal Component Analysis (PCA)


● Principal Component Regression (PCR)
● Partial Least Squares Regression (PLSR)
● Sammon Mapping
● Multidimensional Scaling (MDS)
● Projection Pursuit
● Linear Discriminant Analysis (LDA)
● Mixture Discriminant Analysis (MDA)
● Quadratic Discriminant Analysis (QDA)
● Flexible Discriminant Analysis (FDA)

Ensemble Algorithms
Ensemble methods are models composed of multiple weaker models that are independently
trained and whose predictions are combined in some way to make the overall prediction.
Much effort is put into what types of weak learners to combine and the ways in which to
combine them. This is a very powerful class of techniques and as such is very popular.

● Boosting
● Bootstrapped Aggregation (Bagging)
● AdaBoost
● Weighted Average (Blending)
● Stacked Generalization (Stacking)
● Gradient Boosting Machines (GBM)
● Gradient Boosted Regression Trees (GBRT)
● Random Forest
EXPERIMENT 3

Aim : Understand clustering approaches and implement K means Algorithm using


Sci-Kit Learn

Clustering
It is basically a type of unsupervised learning method . An unsupervised learning method is a
method in which we draw references from datasets consisting of input data without labelled
responses. Generally, it is used as a process to find meaningful structure, explanatory underlying
processes, generative features, and groupings inherent in a set of examples.
Clustering is the task of dividing the population or data points into a number of groups such that
data points in the same groups are more similar to other data points in the same group and
dissimilar to the data points in other groups. It is basically a collection of objects on the basis of
similarity and dissimilarity between them.

For ex– The data points in the graph below clustered together can be classified into one single
group. We can distinguish the clusters, and we can identify that there are 3 clusters in the below
picture.

K-Means Algorithm
The KMeans algorithm clusters data by trying to separate samples in n groups of equal variance,
minimizing a criterion known as the inertia or within-cluster sum-of-squares (see below). This
algorithm requires the number of clusters to be specified. It scales well to a large number of
samples and has been used across a large range of application areas in many different fields.
The k-means algorithm divides a set of samples into disjoint clusters , each described by the
mean μ of the samples in the cluster. The means are commonly called the cluster “centroids”.
The K-means algorithm aims to choose centroids that minimise the inertia, or within-cluster
sum-of-squares criterion.
K-means is often referred to as Lloyd’s algorithm. In basic terms, the algorithm has three steps.
The first step chooses the initial centroids, with the most basic method being to choose samples
from the dataset . After initialization, K-means consists of looping between the two other steps.
The first step assigns each sample to its nearest centroid. The second step creates new centroids
by taking the mean value of all of the samples assigned to each previous centroid. The difference
between the old and the new centroids are computed and the algorithm repeats these last two
steps until this value is less than a threshold. In other words, it repeats until the centroids do not
move significantly.
K-means is equivalent to the expectation-maximization algorithm with a small, all-equal,
diagonal covariance matrix.
The algorithm can also be understood through the concept of Voronoi diagrams. First the
Voronoi diagram of the points is calculated using the current centroids. Each segment in the
Voronoi diagram becomes a separate cluster. Secondly, the centroids are updated to the mean of
each segment. The algorithm then repeats this until a stopping criterion is fulfilled. Usually, the
algorithm stops when the relative decrease in the objective function between iterations is less
than the given tolerance value. This is not the case in this implementation: iteration stops when
centroids move less than the tolerance.
Given enough time, K-means will always converge, however this may be to a local minimum.
This is highly dependent on the initialization of the centroids. As a result, the computation is
often done several times, with different initializations of the centroids.

Implementation of K-Means using Linear Algebra LIbraries only.

import numpy as np
from numpy.linalg import norm

class Kmeans:
'''Implementing Kmeans algorithm.'''
def __init__(self, n_clusters, max_iter=100, random_state=123):
self.n_clusters = n_clusters
self.max_iter = max_iter
self.random_state = random_state

def initializ_centroids(self, X):


np.random.RandomState(self.random_state)
random_idx = np.random.permutation(X.shape[0])
centroids = X[random_idx[:self.n_clusters]]
return centroids

def compute_centroids(self, X, labels):


centroids = np.zeros((self.n_clusters, X.shape[1]))
for k in range(self.n_clusters):
centroids[k, :] = np.mean(X[labels == k, :], axis=0)
return centroids

def compute_distance(self, X, centroids):


distance = np.zeros((X.shape[0], self.n_clusters))
for k in range(self.n_clusters):
row_norm = norm(X - centroids[k, :], axis=1)
distance[:, k] = np.square(row_norm)
return distance

def find_closest_cluster(self, distance):


return np.argmin(distance, axis=1)

def compute_sse(self, X, labels, centroids):


distance = np.zeros(X.shape[0])
for k in range(self.n_clusters):
distance[labels == k] = norm(X[labels == k] -
centroids[k], axis=1)
return np.sum(np.square(distance))

def fit(self, X):


self.centroids = self.initializ_centroids(X)
for i in range(self.max_iter):
old_centroids = self.centroids
distance = self.compute_distance(X, old_centroids)
self.labels = self.find_closest_cluster(distance)
self.centroids = self.compute_centroids(X, self.labels)
if np.all(old_centroids == self.centroids):
break
self.error = self.compute_sse(X, self.labels, self.centroids)

def predict(self, X):


distance = self.compute_distance(X, old_centroids)
return self.find_closest_cluster(distance)

K-Means on Geyser’s Eruptions Segmentation

DATA:
The dataset has 272 observations and 2 features. The data covers the waiting time between
eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National
Park, Wyoming, USA. We will try to find K subgroups within the data points and group them
accordingly. Below is the description of the features:
● eruptions (float): Eruption time in minutes.
● waiting (int): Waiting time for the next eruption.

As a test , working with 2 clusters. (i.e., k = 2 )

# Modules
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_samples, silhouette_score

# Import the data


df = pd.read_csv('../data/old_faithful.csv')
# Standardize the data
X_std = StandardScaler().fit_transform(df)

# Run local implementation of kmeans


km = Kmeans(n_clusters=2, max_iter=100)
km.fit(X_std)
centroids = km.centroids

# Plot the clustered data


fig, ax = plt.subplots(figsize=(6, 6))
plt.scatter(X_std[km.labels == 0, 0], X_std[km.labels == 0, 1],
c='green', label='cluster 1')
plt.scatter(X_std[km.labels == 1, 0], X_std[km.labels == 1, 1],
c='blue', label='cluster 2')
plt.scatter(centroids[:, 0], centroids[:, 1], marker='*', s=300,
c='r', label='centroid')
plt.legend()
plt.xlim([-2, 2])
plt.ylim([-2, 2])
plt.xlabel('Eruption time in mins')
plt.ylabel('Waiting time to next eruption')
plt.title('Visualization of clustered data', fontweight='bold')
ax.set_aspect('equal');

Using ELBOW Method for Choosing optimum K


# Run the Kmeans algorithm and get the index of data points clusters
sse = []
list_k = list(range(1, 10))

for k in list_k:
km = KMeans(n_clusters=k)
km.fit(X_std)
sse.append(km.inertia_)

# Plot sse against k


plt.figure(figsize=(6, 6))
plt.plot(list_k, sse, '-o')
plt.xlabel(r'Number of clusters *k*')
plt.ylabel('Sum of squared distance');
EXPERIMENT 4

Aim : Study of datasets and understanding attributes evaluation in regard to


problem description.

There are lot of data files that store attributes details of problem description and they store data
in either of formats

1. CSV- comma separated value


2. ARFF
3. Excel -xls
4. Sqlite
5. XML

For our use case, we’ll consider CSVs and understand attribute evaluation.

For that we’ll take an example of white variant of Wine Quality data set which is available on
UCI Machine Learning Repository and try to catch hold of as many insights from the data set
using EDA.

To start with,import necessary libraries (for this example pandas, numpy,matplotlib and seaborn)
and load the data set.

● Original data is separated by delimiter “ ; “ in given data set.


● To take a closer look at the data took help of “ .head()”function of pandas library which
returns first five observations of the data set.Similarly “.tail()” returns last five
observations of the data set.
● Dataset comprises of 4898 observations and 12 characteristics.
● Out of which one is dependent variable and rest 11 are independent variables — physico-
chemical characteristics.

● Data has only float and integer values.


● No variable column has null/missing values.
● Here we see mean value is less than the median value of each column which is
represented by 50%(50th percentile) in the index column.
● There is notably a large difference between 75th %tile and max values of predictors
“residual sugar”,”free sulfur dioxide”,”total sulfur dioxide”.
● Thus observations 1 and 2 suggests that there are extreme values-Outliers in our data set.

● Target variable/Dependent variable is discrete and categorical in nature.


● “quality” score scale ranges from 1 to 10;where 1 being poor and 10 being the best.
● 1,2 & 10 Quality ratings are not given by any observation. Only scores obtained are
between 3 to 9.

● This tells us vote count of each quality score in descending order.


● “quality” has most values concentrated in the categories 5, 6 and 7.
● Only a few observations made for the categories 3 & 9.
EXPERIMENT 5

Aim : Working of Major Classifiers


a) Naïve Bayes
b) Decision Tree
c) Logistics regression
d) Support vector machine
e) KNN.
f) ARIMA

a) Naïve Bayes

Loading Data
Let's first load the required wine dataset from scikit-learn datasets.

#Import scikit-learn dataset library


from sklearn import datasets

#Load dataset
wine = datasets.load_wine()

Exploring Data
You can print the target and feature names, to make sure you have the right dataset, as such:

# print the names of the 13 features


print "Features: ", wine.feature_names

# print the label type of wine(class_0, class_1, class_2)


print "Labels: ", wine.target_names
Features: ['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash',
'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols',
'proanthocyanins', 'color_intensity', 'hue',
'od280/od315_of_diluted_wines', 'proline']
Labels: ['class_0' 'class_1' 'class_2']
# print data(feature)shape
wine.data.shape
(178L, 13L)
# print the wine data features (top 5 records)
print wine.data[0:5]

[[ 1.42300000e+01 1.71000000e+00 2.43000000e+00 1.56000000e+01


1.27000000e+02 2.80000000e+00 3.06000000e+00 2.80000000e-01
2.29000000e+00 5.64000000e+00 1.04000000e+00 3.92000000e+00
1.06500000e+03]
[ 1.32000000e+01 1.78000000e+00 2.14000000e+00 1.12000000e+01
1.00000000e+02 2.65000000e+00 2.76000000e+00 2.60000000e-01
1.28000000e+00 4.38000000e+00 1.05000000e+00 3.40000000e+00
1.05000000e+03]
[ 1.31600000e+01 2.36000000e+00 2.67000000e+00 1.86000000e+01
1.01000000e+02 2.80000000e+00 3.24000000e+00 3.00000000e-01
2.81000000e+00 5.68000000e+00 1.03000000e+00 3.17000000e+00
1.18500000e+03]
[ 1.43700000e+01 1.95000000e+00 2.50000000e+00 1.68000000e+01
1.13000000e+02 3.85000000e+00 3.49000000e+00 2.40000000e-01
2.18000000e+00 7.80000000e+00 8.60000000e-01 3.45000000e+00
1.48000000e+03]
[ 1.32400000e+01 2.59000000e+00 2.87000000e+00 2.10000000e+01
1.18000000e+02 2.80000000e+00 2.69000000e+00 3.90000000e-01
1.82000000e+00 4.32000000e+00 1.04000000e+00 2.93000000e+00
7.35000000e+02]]
# print the wine labels (0:Class_0, 1:class_2, 2:class_2)
print wine.target

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]
Splitting Data
First, you separate the columns into dependent and independent variables(or features and label).
Then you split those variables into train and test set.

# Import train_test_split function


from sklearn.cross_validation import train_test_split

# Split dataset into training set and test set


X_train, X_test, y_train, y_test = train_test_split(wine.data,
wine.target, test_size=0.3,random_state=109) # 70% training and 30%
test

Model Generation

After splitting, we will generate a Naive Bayes model on the training set and perform prediction
on test set features.

#Import Gaussian Naive Bayes model


from sklearn.naive_bayes import GaussianNB

#Create a Gaussian Classifier


gnb = GaussianNB()

#Train the model using the training sets


gnb.fit(X_train, y_train)

#Predict the response for test dataset


y_pred = gnb.predict(X_test)

Evaluating Model
After model generation, check the accuracy using actual and predicted values.

#Import scikit-learn metrics module for accuracy calculation


from sklearn import metrics

# Model Accuracy, how often is the classifier correct?


print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
>>> ('Accuracy:', 0.90740740740740744)

b) Decision Tree
Also on the Wine Dataset used above.

Model Generation

After splitting, we will generate a Decision Tree model on the training set and perform
prediction on test set features.

#Import Decision Tree model

from sklearn.tree import DecisionTreeClassifier

#Create a Decision Tree Classifier


clf = DecisionTreeClassifier(random_state=0)

#Train the model using the training sets


clf.fit(X_train, y_train)

#Predict the response for test dataset


y_pred = clf.predict(X_test)

Evaluating Model
After model generation, check the accuracy using actual and predicted values.

#Import scikit-learn metrics module for accuracy calculation


from sklearn import metrics

# Model Accuracy, how often is the classifier correct?


print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

>>> ('Accuracy:', 0.84)

c) Logistic Regression
Also on the Wine Dataset used above.
Model Generation

After splitting, we will generate a Logistic Regression model on the training set and perform
prediction on test set features.

#Import Logistic Regression model


from sklearn.linear_model import LogisticRegression

#Create a Logistic Regression Classifier


clf = LogisticRegression()

#Train the model using the training sets


clf.fit(X_train, y_train)

#Predict the response for test dataset


y_pred = clf.predict(X_test)

Evaluating Model
After model generation, check the accuracy using actual and predicted values.

#Import scikit-learn metrics module for accuracy calculation


from sklearn import metrics

# Model Accuracy, how often is the classifier correct?


print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

>>> ('Accuracy:', 0.78)

d) Support Vector Machine

Also on the Wine Dataset used above.

Model Generation

After splitting, we will generate a Support Vector Machine model on the training set and
perform prediction on test set features.

#Import Support Vector Machine model


from sklearn import svm
#Create a Logistic Regression Classifier
clf =svm.SVC()

#Train the model using the training sets


clf.fit(X_train, y_train)

#Predict the response for test dataset


y_pred = clf.predict(X_test)

Evaluating Model
After model generation, check the accuracy using actual and predicted values.

#Import scikit-learn metrics module for accuracy calculation


from sklearn import metrics

# Model Accuracy, how often is the classifier correct?


print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

>>> ('Accuracy:', 0.91)

e) k Nearest Neighbour

Also on the Wine Dataset used above.

Model Generation

After splitting, we will generate a k Nearest Neighbour model on the training set and perform
prediction on test set eee
#Import k Nearest Neighbour model
from sklearn.neighbors import NearestNeighbors

#Create a Logistic Regression Classifier


clf = NearestNeighbors(n_neighbors=2, algorithm='ball_tree')

#Train the model using the training sets


clf.fit(X_train, y_train)
#Predict the response for test dataset
y_pred = clf.predict(X_test)

Evaluating Model
After model generation, check the accuracy using actual and predicted values.

#Import scikit-learn metrics module for accuracy calculation


from sklearn import metrics

# Model Accuracy, how often is the classifier correct?


print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

>>> ('Accuracy:', 0.887)

f) ARIMA

We are considering the shampoo sales data.

from pandas import read_csv


from pandas import datetime
from matplotlib import pyplot

def parser(x):
return datetime.strptime('190'+x, '%Y-%m')

series = read_csv('shampoo-sales.csv', header=0, parse_dates=[0],


index_col=0, squeeze=True, date_parser=parser)
print(series.head())
series.plot()
pyplot.show()

Month
1901-01-01 266.0
1901-02-01 145.9
1901-03-01 183.1
1901-04-01 119.3
1901-05-01 180.3
Name: Sales, dtype: float64

Sales Data Plot

Model Training

from pandas import read_csv


from pandas import datetime
from pandas import DataFrame
from statsmodels.tsa.arima_model import ARIMA
from matplotlib import pyplot

def parser(x):
return datetime.strptime('190'+x, '%Y-%m')

series = read_csv('shampoo-sales.csv', header=0, parse_dates=[0],


index_col=0, squeeze=True, date_parser=parser)
# fit model
model = ARIMA(series, order=(5,1,0))
model_fit = model.fit(disp=0)
print(model_fit.summary())
# plot residual errors
residuals = DataFrame(model_fit.resid)
residuals.plot()
pyplot.show()
residuals.plot(kind='kde')
pyplot.show()
print(residuals.describe())

OUTPUT
ARIMA Model Results
=====================================================================
=========
Dep. Variable: D.Sales No. Observations:
35
Model: ARIMA(5, 1, 0) Log Likelihood
-196.170
Method: css-mle S.D. of innovations
64.241
Date: Mon, 12 Dec 2016 AIC
406.340
Time: 11:09:13 BIC
417.227
Sample: 02-01-1901 HQIC
410.098
- 12-01-1903
=====================================================================
============
coef std err z P>|z| [ 95.0%
Conf. Int.]
---------------------------------------------------------------------
------------
const 12.0649 3.652 3.304 0.003
4.908 19.222
ar.L1.D.Sales -1.1082 0.183 -6.063 0.000
-1.466 -0.750
ar.L2.D.Sales -0.6203 0.282 -2.203 0.036
-1.172 -0.068
ar.L3.D.Sales -0.3606 0.295 -1.222 0.231
-0.939 0.218
ar.L4.D.Sales -0.1252 0.280 -0.447 0.658
-0.674 0.424
ar.L5.D.Sales 0.1289 0.191 0.673 0.506
-0.246 0.504
Roots
=====================================================================
========
Real Imaginary Modulus
Frequency
---------------------------------------------------------------------
--------
AR.1 -1.0617 -0.5064j 1.1763
-0.4292
AR.2 -1.0617 +0.5064j 1.1763
0.4292
AR.3 0.0816 -1.3804j 1.3828
-0.2406
AR.4 0.0816 +1.3804j 1.3828
0.2406
AR.5 2.9315 -0.0000j 2.9315
-0.0000
---------------------------------------------------------------------
--------

Inference
series = read_csv('shampoo-sales.csv', header=0, parse_dates=[0],
index_col=0, squeeze=True, date_parser=parser)
X = series.values
size = int(len(X) * 0.66)
train, test = X[0:size], X[size:len(X)]
history = [x for x in train]
predictions = list()
for t in range(len(test)):
model = ARIMA(history, order=(5,1,0))
model_fit = model.fit(disp=0)
output = model_fit.forecast()
yhat = output[0]
predictions.append(yhat)
obs = test[t]
history.append(obs)
print('predicted=%f, expected=%f' % (yhat, obs))
error = mean_squared_error(test, predictions)
print('Test MSE: %.3f' % error)
# plot
pyplot.plot(test)
pyplot.plot(predictions, color= 'red')
pyplot.show()

OUTPUT
predicted=349.117688, expected=342.300000
predicted=306.512968, expected=339.700000
predicted=387.376422, expected=440.400000
predicted=348.154111, expected=315.900000
predicted=386.308808, expected=439.300000
predicted=356.081996, expected=401.300000
predicted=446.379501, expected=437.400000
predicted=394.737286, expected=575.500000
predicted=434.915566, expected=407.600000
predicted=507.923407, expected=682.000000
predicted=435.483082, expected=475.300000
predicted=652.743772, expected=581.300000
predicted=546.343485, expected=646.900000
Test MSE: 6958.325
EXPERIMENT 6

Aim : Implement supervised learning (KNN classification) .Estimate the accuracy


of using 5-fold cross-validation.

k-Nearest Neighbors
The k-Nearest Neighbors algorithm or KNN for short is a very simple technique.

The entire training dataset is stored. When a prediction is required, the k-most similar records to
a new record from the training dataset are then located. From these neighbors, a summarized
prediction is made.

Similarity between records can be measured many different ways. A problem or data-specific
method can be used. Generally, with tabular data, a good starting point is the Euclidean distance.

Once the neighbors are discovered, the summary prediction can be made by returning the most
common outcome or taking the average. As such, KNN can be used for classification or
regression problems.

# k-nearest neighbors on the Iris Flowers Dataset


from random import seed
from random import randrange
from csv import reader
from math import sqrt

# Load a CSV file


def load_csv(filename):
dataset = list()
with open(filename, 'r') as file:
csv_reader = reader(file)
for row in csv_reader:
if not row:
continue
dataset.append(row)
return dataset

# Convert string column to float


def str_column_to_float(dataset, column):
for row in dataset:
row[column] = float(row[column].strip())

# Convert string column to integer


def str_column_to_int(dataset, column):
class_values = [row[column] for row in dataset]
unique = set(class_values)
lookup = dict()
for i, value in enumerate(unique):
lookup[value] = i
for row in dataset:
row[column] = lookup[row[column]]
return lookup

# Find the min and max values for each column


def dataset_minmax(dataset):
minmax = list()
for i in range(len(dataset[0])):
col_values = [row[i] for row in dataset]
value_min = min(col_values)
value_max = max(col_values)
minmax.append([value_min, value_max])
return minmax

# Rescale dataset columns to the range 0-1


def normalize_dataset(dataset, minmax):
for row in dataset:
for i in range(len(row)):
row[i] = (row[i] - minmax[i][0]) / (minmax[i][1] -
minmax[i][0])

# Split a dataset into k folds


def cross_validation_split(dataset, n_folds):
dataset_split = list()
dataset_copy = list(dataset)
fold_size = int(len(dataset) / n_folds)
for _ in range(n_folds):
fold = list()
while len(fold) < fold_size:
index = randrange(len(dataset_copy))
fold.append(dataset_copy.pop(index))
dataset_split.append(fold)
return dataset_split

# Calculate accuracy percentage


def accuracy_metric(actual, predicted):
correct = 0
for i in range(len(actual)):
if actual[i] == predicted[i]:
correct += 1
return correct / float(len(actual)) * 100.0

# Evaluate an algorithm using a cross validation split


def evaluate_algorithm(dataset, algorithm, n_folds, *args):
folds = cross_validation_split(dataset, n_folds)
scores = list()
for fold in folds:
train_set = list(folds)
train_set.remove(fold)
train_set = sum(train_set, [])
test_set = list()
for row in fold:
row_copy = list(row)
test_set.append(row_copy)
row_copy[-1] = None
predicted = algorithm(train_set, test_set, *args)
actual = [row[-1] for row in fold]
accuracy = accuracy_metric(actual, predicted)
scores.append(accuracy)
return scores

# Calculate the Euclidean distance between two vectors


def euclidean_distance(row1, row2):
distance = 0.0
for i in range(len(row1)-1):
distance += (row1[i] - row2[i])**2
return sqrt(distance)
# Locate the most similar neighbors
def get_neighbors(train, test_row, num_neighbors):
distances = list()
for train_row in train:
dist = euclidean_distance(test_row, train_row)
distances.append((train_row, dist))
distances.sort(key=lambda tup: tup[1])
neighbors = list()
for i in range(num_neighbors):
neighbors.append(distances[i][0])
return neighbors

# Make a prediction with neighbors


def predict_classification(train, test_row, num_neighbors):
neighbors = get_neighbors(train, test_row, num_neighbors)
output_values = [row[-1] for row in neighbors]
prediction = max(set(output_values), key=output_values.count)
return prediction

# kNN Algorithm
def k_nearest_neighbors(train, test, num_neighbors):
predictions = list()
for row in test:
output = predict_classification(train, row, num_neighbors)
predictions.append(output)
return(predictions)

# Test the kNN on the Iris Flowers dataset


seed(1)
filename = 'iris.csv'
dataset = load_csv(filename)
for i in range(len(dataset[0])-1):
str_column_to_float(dataset, i)
# convert class column to integers
str_column_to_int(dataset, len(dataset[0])-1)
# evaluate algorithm
n_folds = 5
num_neighbors = 5
scores = evaluate_algorithm(dataset, k_nearest_neighbors, n_folds,
num_neighbors)
print('Scores: %s' % scores)
print('Mean Accuracy: %.3f%%' % (sum(scores)/float(len(scores))))

OUTPUT
Scores: [96.66666666666667, 96.66666666666667, 100.0, 90.0, 100.0]
Mean Accuracy: 96.667%
EXPERIMENT 7

Aim : Introduction to R. Be aware of the basics of machine learning methods in R.

Libraries

The R basic install doesn’t come with every libraries,

install.packages(tidyverse) # For installing tidyverse

Once a package is installed, here are some operations for libraries :

library(readr) # Load library


# Load multiple libraries
p.names <- c('xgboost', 'caret', 'dplyr', 'e1071')
lapply(p.names, library, character.only = TRUE)

installed.packages() # List available packages

remove.packages("tidyverse") # Uninstall a package


# Getting help and documentation
?functionName
help(functionName)
example(functionName)

Writing some code


Variables
Defining variables is pretty straightforward, we equally use the “=” or “<-” operators. One
unusual thing, if you come from Python, is that variable names may contain points “.” and you
get variables like “my.data.vector”. This is actually very common in code snippets found online.

# Create new variables


my_var <- 54
my.data.vector = c(34, 54, 65)
# Clean a variable
my_var <- NULL

Functions

Functions in R are similar to Python functions :

● Assign the function like you would assign a variable.


● Use the function keyword with parameters inside parenthesis.
● Use return as exit points

The following small function named prod_age, takes creation_date as argument. With an if
statement, we treat the NULL cases, otherwise we cast the value as date.

prod_age <- function(creation_date) {


if (is.na(creation_date)) {return(as.numeric(-1))}
else {return(as.Date(creation_date))}
}

Working with dataframes

Load data, read files

The read_delim function, from the readr library offers a lot of tools to read most of filetypes.
In the example below, we specify the data type of each column. The file has 56 columns, and we
want all of them to be read as characters, so we use the col_types argument with “c…c”, each
character corresponding to a column.
# Load the library
library(readr)
# Create dataframe from CSV file
my_dataframe <- read_delim("C:/path/to/file.csv",
delim = "|",
escape_double = FALSE,
col_types = paste(rep("c", 56), collapse = ''))

Subsetting a dataframe
Dataframes are not only encountered by importing your dataset. Sometimes functions results are
dataframe. The main tool for subsetting is the brackets operator.
To access a specific column, use the $ operator, very convenient.

y_train <- my.dataframe$label


To access specific rows, we use the [] operator. You might be
familiar with this syntax : [rows, columns]
# Works with numeric indices
y_train <- my.dataframe[c(0:100), 8]
# Works with negative indices to exclude
y_test <- my.dataframe[-c(0:100), 8]
# Here is another technique still using the bracket syntax. The which
# and names operators are used to subset rows and columns.
filtered.dataframe <- my.dataframe[
which(my.dataframe$col1 == 2), # Filter rows on condition
names(my.dataframe) %in% c("col1","col2","col3")] # Subset cols

The subset function : first argument is the dataframe, then the filter condition on rows, then the
columns to select.

filtered.dataframe <- subset(


my.dataframe,
col1 == 2,
select = c("col1","col2","col3"))

The dplyr library


plyr is a popular library for data manipulation. From this came the dplyr package, which
introduces a grammar for the most common data manipulation challenges.
If you come from Python, you may be familiar with chaining commands with a dot. Here with
dplyr you can do just the same with a special pipe : %>%.

starwars %>%
filter(species == "Droid")
starwars %>%
select(name, ends_with("color"))
starwars %>%
mutate(name, bmi = mass / ((height / 100) ^ 2)) %>%
select(name:mass, bmi)
starwars %>%
arrange(desc(mass))
starwars %>%
group_by(species) %>%
summarise(
n = n(),
mass = mean(mass, na.rm = TRUE)
) %>%
filter(n > 1)

dplyr is actually very convenient for data filtering and exploration, and the grammar is
straightforward.

Modify column values


package, which introduces a grammar for the most common data manipulation challenges
Modify column values.
When a dataframe object is created, we access specific columns with the $ operator.

# Filtering rows based on a specific column value


my_datarame <- subset(my_dataframe, COLNAME != 'str_value')
# Assign 0 where column values match condition
non_conformites$REGUL_DAYS[non_conformites$REGUL_DAYS_NUM < 0] <- 0
# Create new column from existing columns
table$AMOUNT <- table$Q_LITIG * table$PRICE
# Delete a column
my_dataframe$COLNAME <- NULL

Apply a function to a column


Once we have a dataframe and functions ready, we often need to apply functions on columns, to
apply transformations.
Here we use the apply operator. We use it to apply an operation to a blob of structured data, so
it’s not limited to dataframes. Of course, every point must have the same type.

# Product age function


prod_age <- function(creation_date) {
if (xxx) {return(as.numeric(-1))}
else { return(as.Date(creation_date))}
}
# Apply function on column
mytable$PRODUCT_AGE <-
apply(mytable[,c('DATE_CREA'), drop=F], 1, function(x) prod_age(x))

Plotting

R comes with several libraries for plotting data. The plot function is actually similar to plt.plot
with python.
RStudio is very convenient for plotting, it has a dedicated plotting window, with a possibility to
back on previous plots.

Line charts

plot(
ref_sales$Date, ref_sales$Sales,
type = 'l',
xlab = "Date", ylab = "Sales",
main = paste('Sales evolution over time for : ', article_ref)
)

Various charts

R being the language of statisticians, it comes with various charts for plotting data distributions.
values <- c(1, 4, 8, 2, 4)
barplot(values)
hist(values)
pie(values)
boxplot(values)
Machine learning : XGBoost library

The xgboost package is a good starting point, as it is well documented. It enables to gain quick
insights on a dataset, such as feature importance, as we will see below.
For this part, we need those specific libraries :
- xgboost : Let’s work around XGB famous algorithm.
- caret : Classification And REgression Training, includes lots of data processing functions
- dplyr : A fast, consistent tool for working with data frame like objects, both in memory and out
of memory.

Train-Test split
Once the dataframe is prepared, we split it into train and test sets, using an index (inTrain) :

set.seed(1337)
inTrain <- createDataPartition(y = my.dataframe$label, p = 0.85, list
= FALSE)
X_train = xgb.DMatrix(as.matrix(my.dataframe[inTrain, ] %>% select(-
label)))
y_train = my.dataframe[inTrain, ]$label
X_test = xgb.DMatrix(as.matrix(my.dataframe[-inTrain, ] %>% select(-
label)))
y_test = my.dataframe[-inTrain, ]$label

Parameter search for XGBoost

What the following function does :


- Take our train/test sets as input.
- Define a trainControl for cross validation .
- Define a grid for parameters.
- Setup a XGB model including the parameter search.
- Evaluate the model’s accuracy
- Return the set of best parameters

param_search <- function(xtrain, ytrain, xtest, ytest) {


# Cross validation init
xgb_trcontrol = trainControl(method = "cv", number = 5,
allowParallel = TRUE,
verboseIter = T, returnData = FALSE)
# Param grid
xgbGrid <- expand.grid(nrounds = 60, #nrounds = c(10,20,30,40),
max_depth = 20, #max_depth = c(3, 5, 10, 15, 20, 30),
colsample_bytree = 0.6,#colsample_bytree = seq(0.5, 0.9, length.out
= 5),
eta = 0.005, #eta = c(0.001, 0.0015, 0.005, 0.1),
gamma=0, min_child_weight = 1, subsample = 1
)
# Model and parameter search
xgb_model = train(xtrain, ytrain, trControl = xgb_trcontrol,
tuneGrid = xgbGrid, method = "xgbTree",
verbose=2,
#objective="multi:softprob",
eval_metric="mlogloss")
#num_class=3)
# Evaluate du model
xgb.pred = predict(xgb_model, xtest, reshape=T)
xgb.pred = as.data.frame(xgb.pred, col.names=c("pred"))
result = sum(xgb.pred$xgb.pred==ytest) / nrow(xgb.pred)
print(paste("Final Accuracy =",sprintf("%1.2f%%", 100*result)))
return(xgb_model)
}

Once the parameter search is done, we can use it directly to define our working model, we access
each element with the $ operator :

best.model <- xgboost(


data = as.matrix(my.dataframe[inTrain, ] %>% select(-IMPORTANCE)),
label = as.matrix(as.numeric(my.dataframe[inTrain,]$IMPORTANCE)-1),
nrounds = xgb_model$bestTune$nrounds,
max_depth = xgb_model$bestTune$max_depth,
eta = xgb_model$bestTune$eta,
gamma = xgb_model$bestTune$gamma,
colsample_bytree = xgb_model$bestTune$colsample_bytree,
min_child_weight = xgb_model$bestTune$min_child_weight,
subsample = xgb_model$bestTune$subsample,
objective = "multi:softprob", num_class=3)
Compute and plot feature importance
Here again, a lot of functions are available in the xgboost package.
The documentation presents most of them.
xgb_feature_imp <- xgb.importance(
colnames(donnees[inTrain, ] %>% select(-label)),
model = best.model
)
gg <- xgb.ggplot.importance(xgb_feature_imp, 40); gg

Below is an example of a feature importance plot, as displayed in Rstudio. Clusters made with
xgboost simply group features by similar score, there is no other specific meaning for these.
EXPERIMENT 8

Aim : Develop a machine learning method using Neural Networks in Python to


Predict stock prices based on past price variation.

What’s a Neural Network?


Most introductory texts to Neural Networks bring up brain analogies when describing them.
Without delving into brain analogies, I find it easier to simply describe Neural Networks as a
mathematical function that maps a given input to a desired output.
Neural Networks consist of the following components
● An input layer, x
● An arbitrary amount of hidden layers
● An output layer, ŷ
● A set of weights and biases between each layer, W and b
● A choice of activation function for each hidden layer, σ. In this tutorial, we’ll use a
Sigmoid activation function.

Implementing a 2-Layer Neural Net in Python


class NeuralNetwork:
def __init__(self, x, y):
self.input = x
self.weights1 = np.random.rand(self.input.shape[1],4)
self.weights2 = np.random.rand(4,1)
self.y = y
self.output = np.zeros(self.y.shape)

def feedforward(self):
self.layer1 = sigmoid(np.dot(self.input, self.weights1))
self.output = sigmoid(np.dot(self.layer1, self.weights2))

def backprop(self):
# application of the chain rule to find derivative of the
loss function with respect to weights2 and weights1
d_weights2 = np.dot(self.layer1.T, (2*(self.y - self.output)
* sigmoid_derivative(self.output)))
d_weights1 = np.dot(self.input.T, (np.dot(2*(self.y -
self.output) * sigmoid_derivative(self.output), self.weights2.T) *
sigmoid_derivative(self.layer1)))
# update the weights with the derivative (slope) of the loss
function
self.weights1 += d_weights1
self.weights2 += d_weights2

Using keras to build a RNN based Neural Network to predict Google Stock Prices
# Importing Packages
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import keras
from keras.models import load_model
from tensorflow.contrib import lite
from sklearn.preprocessing import MinMaxScaler
from keras.models import Sequential
from keras.layers import Dense, Dropout, LSTM

# Importing Dataset
dataset_train = pd.read_csv('google_train.csv')
train_dataset = dataset_train.iloc[:, 1:2].values

# Scaling

sc = MinMaxScaler()
train_dataset_scaled = sc.fit_transform (train_dataset)

# Data Preprocessing

X_train = []
y_train = []

for i in range(60, train_dataset.shape[0]):


X_train.append(train_dataset_scaled[i-60: i, 0])
y_train.append(train_dataset_scaled[i, 0])
X_train, y_train = np.array(X_train), np.array(y_train)

X_train = np.reshape(X_train, (X_train.shape[0], X_train.shape[1],


1))

# Model Generation
regressor = Sequential()

regressor.add(LSTM(units=50, return_sequences=True,
input_shape=(X_train.shape[1],1)))
regressor.add(Dropout(0.2))

regressor.add(LSTM(units=50, return_sequences=True))
regressor.add(Dropout(0.2))

regressor.add(LSTM(units=50, return_sequences=True))
regressor.add(Dropout(0.2))

regressor.add(LSTM(units=50))
regressor.add(Dropout(0.2))

regressor.add(Dense(units=1))

# Model Compilation
regressor.compile('adam', loss='mean_squared_error')

## Model Training
regressor.fit(X_train, y_train, batch_size=32, epochs=50)

Snapshot of training output


Epoch 45/50
3213/3213 [==============================] - 12s 4ms/step - loss:
6.9449e-04
Epoch 46/50
3213/3213 [==============================] - 12s 4ms/step - loss:
8.2133e-04
Epoch 47/50
3213/3213 [==============================] - 12s 4ms/step - loss:
7.4369e-04
Epoch 48/50
3213/3213 [==============================] - 12s 4ms/step - loss:
6.2250e-04
Epoch 49/50
3213/3213 [==============================] - 12s 4ms/step - loss:
9.9489e-04
Epoch 50/50
3213/3213 [==============================] - 12s 4ms/step - loss:
6.9578e-04

Inference

#Loading Testing Data


dataset_test = pd.read_csv('google_test.csv')
real_values = dataset_test.iloc[:, 1:2].values

Preprocessing
dataset_total = pd.concat((dataset_train['Open'],
dataset_test['Open']), axis=0)

dataset_total = dataset_total[len(dataset_total) - len(dataset_test)


- 60:].values

dataset_total = dataset_total.reshape(-1, 1)

Predicitions and Result generation

X_test = []

for i in range(60, dataset_total.shape[0]):


X_test.append(dataset_total[i-60: i, 0])

X_test = np.array(X_test)

X_test = sc.transform(X_test)
X_test = np.reshape(X_test, (X_test.shape[0], X_test.shape[1], 1))

pred = regressor.predict(X_test)

pred = sc.inverse_transform(pred)

# Visualising Results
plt.plot(real_values, label = 'real values')
plt.plot(pred, label = 'predicted values')
plt.xlabel('Time')
plt.ylabel('Stock Prices')
plt.legend()
plt.show()

You might also like