You are on page 1of 65

"Machine Marathon"

Submitted in partial fulfillment of the requirements for the degree of


Bachelor of Computer Engineering

by

Jainam Shah

Internal guide
(Prof. )
External guide

(Mr. Pallav Mamtora)

Ahmedabad Institute of Technology


Declaration

We declare that this written submission represents our ideas in our own words
and where others ideas or words have been included; we have adequately cited
and referenced the original sources. We also declare that we have adhered to all
principles of academic honesty and integrity and have not misrepresented or
fabricated or falsified any idea/data/fact/source in our submission. We
understand that any violation of the above will be cause for disciplinary action
by the Institute and can also evoke penal action from the sources which have
thus not been properly cited or from whom proper permission has not been taken
when needed.

Jainam Shah
Abstract

This project is about making user’s life easier using machine learning and python.
In this we are providing assistance to the user in filling up their forms and saving
them a lot of time in filling up their basic details in everyday using of the internet.
Our goal is to give user a platform where he/she has to fill-up the details once and
for all, then every time any that info is required our app will fill it up for you. We
want the the system to learn the basics of the user every time he/she fill anything
by them self and store for future use. Our system’s main goal is to save your time
and efforts in creating new accounts and login pages in day to day use as much as
possible.

Jainam Shah
BE CE
AIT College
Contents

1 Project Overview 1
1.1 Introduction...........................................................................................................1
1.1.1 Background Introduction..........................................................................2
1.1.2 Motivation................................................................................................2
1.2 Problem Definition................................................................................................3
1.3 Current Systems....................................................................................................3
1.4 The Problems with Current System......................................................................3
1.4.1 Advantages Over Current System............................................................4
1.5 Goals and Objectives.............................................................................................4
1.6 Scope and Applications.........................................................................................5

2 Review Of Literature 6
2.1 Machine Marathon................................................................................................6
2.1.1 Description...............................................................................................6
2.1.2 Pros...........................................................................................................6
2.1.3 Cons..........................................................................................................7
2.2 Technological Review...........................................................................................8
2.2.1 Python Django Scikit-Learn Machine Learning ................................8
2.2.2 MySQL......................................................................................................9

3 Requirement Analysis 11
3.1 Platform Requirement :.........................................................................................11
3.1.1 Supportive Operating Systems :...............................................................11
3.2 Software Requirement :.........................................................................................11
3.3 Hardware Requirement :.......................................................................................12
3.4 Database Requirement :........................................................................................13

4 System Design and Architecture 14


4.1 System Architecture.............................................................................................14
4.2 Use Case Diagram.................................................................................................15
4.3 Class Diagram......................................................................................................16

v
4.4 Block Diagram……………………………………………………………….17
4.5 E-R Diagram…………………………………………………………………….18

5 Methodology 20
5.1 Modular Description.............................................................................................20
5.1.1

6 Implementation Details 28
6.1 Assumptions And Dependencies...........................................................................28
6.1.1 Assumptions.............................................................................................28
6.1.2 Dependencies............................................................................................28
6.2 Implementation Methodologies............................................................................29
6.2.1
6.3 Competitive Advantages of the project:................................................................34

7 Results and Analysis 35


7.1 Analysis and Result Discussion............................................................................35

8 Conclusion and Future Scope 36


8.1 Conclusion.............................................................................................................36
8.2 Limitations............................................................................................................36
8.3 Future Enhancement..............................................................................................37

References 38

9 Appendix A 39
9.1 Liner Regression...................................................................................................39
9.2 Logistic Regression...............................................................................................39
9.3 Random Forest Classifier......................................................................................40
9.4 Pandas and Numpy.................................................................................................40

10 Data-Dictionary 41
10.1 Data-Dictionary .................................................................................................. 41

v
i
Chapter 1

Project Overview

1.0 Introduction

The Machine Marathon is to take out-of-the-box models and apply them to different datasets.

This marathon is for 3 main reasons:

First, you’ll build intuition for model-to-problem fit. Which models are robust to missing
data? Which models handle categorical features well? Yes, you can dig through textbooks to
find the answers, but you’ll learn better by seeing it in action.

Second, this project will teach you the invaluable skill of prototyping models quickly. In the
real world, it’s often difficult to know which model will perform best without simply trying
them.

Finally, this exercise helps you master the workflow of model building. For example, you’ll
get to practice….

1
8.1. Introduction

1.1.1 Background Introduction

Our project is not intended for any specific user, every person that use internet and has an
account or social media identity can get benefit from our project. Each person how is going
through the hustle of writing all the details by themselves for any login or creating account can
use this to make it a way easier.

1.1.2 Motivation

Our motivation leads to give user a platform where he/she has to fill-up the details once and for
all, and then every time any that info is required our app will fill it up for you. We want the
system to learn the basics of the user every time he/she fill anything by them self and store for
future use. Our system’s main goal is to save your time and efforts in creating new accounts and
login pages in day to day use as much as possible.

2
Chapter 1. Project Overview

1.2 Problem Definition

The Machine Marathon is to take out-of-the-box models and apply them to different datasets.
This marathon is for 3 main reasons:
First, you’ll build intuition for model-to-problem fit. Which models are robust to missing data?
Which models handle categorical features well? Yes, you can dig through textbooks to find the
answers, but you’ll learn better by seeing it in action.
Second, this project will teach you the invaluable skill of prototyping models quickly. In the
real world, it’s often difficult to know which model will perform best without simply trying
them.
Finally, this exercise helps you master the workflow of model building. For example, you’ll get
to practice

1.3 Current Systems

In Current System:
 All the process where scattered
 Lack of Data
 Number of Categories to process and preprocess
 Manual Data Entry at some levels
 Data Importing was manual
 Data Cleaning and Analysis was manual

1.4 The Problems with Current System

 Takes Up a Lot of Space.


 The biggest downfall to manual document filing is the amount of space it can take up. ...
 Prone to Damage and Being Misplaced.
 Hard to Make Changes. ...
 Access Time. ...
 Lack of Security. ...
 Higher Cost.

3
1.5. Goals and Objectives

1.4.1 Advantages over Current System

Only advantage in existing system was data was managed manually under surveillance , so data manual
trust factor was there but was time consuming to on the contrarily.

1.5 Goals and Objectives

The main objective/scope of the marathon is to concurrently do the below objectives to


achieve accurate result:
• Importing data
• Cleaning data
• Splitting it into train/test or cross-validation sets
• Pre-processing
• Transformations
• Feature Modeling
• Technology Used: Python, Django and Sklearn

4
Chapter 1. Project Overview

1.6 Scope and Applications

Our project is not intended for any specific user, every person that use internet and has an
account or social media identity can get benefit from our project. Each person how is going
through the hustle of writing all the details by themselves for any login or creating account
can use this to make it a way easier .

5
Chapter 2

Review Of Literature

2.1 Machine Marathon

2.1.1 Description

ABSTRACT: This project is about making user’s life easier using machine learning and python.
In this we are providing assistance to the user in filling up their forms and saving them a lot of
time in filling up their basic details in everyday using of the internet. Our goal is to give user a
platform where he/she has to fill-up the details once and for all, and then every time any that info
is required our app will fill it up for you. We want the system to learn the basics of the user every
time he/she fill anything by them self and store for future use. Our system’s main goal is to save
your time and efforts in creating new accounts and login pages in day to day use as much as
possible.

2.1.2 Pros

Only advantage in existing system was data was managed manually under surveillance ,
so data manual trust factor was there but was time consuming to on the contrarily.

6
Chapter 2. Review Of Literature

2.1.3 Cons

 Takes Up a Lot of Space. The biggest downfall to manual document filing


is the amount of space it can take up. ...

 Prone to Damage and Being Misplaced.

 Hard to Make Changes.

 Access Time.

 Lack of Security.

 Higher Cost.

7
Chapter 2. Review Of Literature

2.1 Technological Review


Our project is not intended for any specific user, every person that use internet and has an account or
social media identity can get benefit from our project. Each person how is going through the hustle of
writing all the details by themselves for any login or creating account can use this to make it a way
easier.

DEVELOPMENT &DEPLOYMENT REQUIREMENTS

• PYTHON
Python is a widely used high-level programming language for general-purpose programming. Apart from
being open source programming language, python is a great object-oriented, interpreted, and interactive
programming language. Python combines remarkable power with very clear syntax. It has modules,
classes, exceptions, very high-level dynamic data types, and dynamic typing.

• DJANGO
Django is one of the web frameworks written in Python programming languages like Python, Django
also enable developers to build custom web applications without writing additional code. It further
makes it easier for developers to maintain and update the web applications by keeping the code base
clean and readable.

• SciKit-learn
Scikit-learn (also known as sklearn) is a free software machine learning library for the Python
programming language. It features various classification, regression and clustering algorithms including
support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to
interoperate with the Python numerical and scientific libraries NumPy and SciPy.

• Machine Learning
Machine learning is an application of artificial intelligence (AI) that provides systems the ability to
automatically learn and improve from experience without being explicitly programmed. Machine
learning focuses on the development of computer programs that can access data and use it learn for
themselves.
The process of learning begins with observations or data, such as examples, direct experience, or
instruction, in order to look for patterns in data and make better decisions in the future based on the
examples that we provide. The primary aim is to allow the computers learn automatically without human
intervention or assistance and adjust actions accordingly.

1.5 Development Tool
• MYSQL

MySQL pronounced either "My S-Q-L" or "My Sequel," is an open source relational database
management system. It is based on the structure query language (SQL), which is used for adding,
removing, and modifying information in the database. Standard SQL commands, such as ADD, DROP,
INSERT, and UPDATE can be used with MySQL.

• Pycharm

PyCharm is an integrated development environment (IDE) used in computer programming, specifically


for the Python language. It is developed by the Czech company JetBrains. It provides code analysis, a
graphical debugger, an integrated unit tester, integration with version control systems (VCSes), and
supports web development with Django as well as Data Science with Anaconda.
PyCharm is cross platform with Windows, macOS and Linux versions. The Community Edition is
released under the Apache License, and there is also Professional Edition with extra features – released
under a proprietary license.

9
Chapter 3

Requirement Analysis

To be used efficiently, all computer software needs certain hardware components or other soft- ware
resources to be present on a computer. These prerequisites are known as (computer) system
requirements and are often used as a guideline as opposed to an absolute rule. Most software
defines two sets of system requirements: minimum and recommended.

3.1 Platform Requirement :

3.1.1 Supportive Operating Systems :

The supported Operating Systems for client include: windows xp and later operating systems.
We as developer of our project will follow the given responsibilities:
1. Completing the tasks within the deadlines.
2. Informing our internal and external guides regularly about our performance.
3. Maintaining the logbook.
4. Implementing of project on well planned manner.
5. Dividing the tasks among project equally

3.2 Software Requirement :


The Software Requirements in this project include:
a. Python SDK
b. PyCharm IDE
c. SciKit-learn
d. MS Excel
e. MySQL, Machine Learning , Django

11
3.3. Hardware Requirement :

• Pycharm

PyCharm is an integrated development environment (IDE) used in computer programming,


specifically for the Python language. It is developed by the Czech company JetBrains. It
provides code analysis, a graphical debugger, an integrated unit tester, integration with
version control systems (VCSes), and supports web development with Django as well as Data
Science with Anaconda.
PyCharm is cross platform with Windows, macOS and Linux versions. The Community
Edition is released under the Apache License, and there is also Professional Edition with extra
features – released under a proprietary license.

3.3 Hardware Requirement :

The most common set of requirements defined by any operating system or software applica-
tion is the physical computer resources, also known as hardware, A hardware requirements
list is often accompanied by a hardware compatibility list (HCL), especially in case of
operating systems. An HCL lists tested, compatible, and sometimes incompatible hardware
devices for a particular operating system or application.

Components Minimum Recommended


Intel Core i3-2100 2nd Intel Core i5th gener-
Processor
generation ation
RAM 4GB 8GB
Full HD 1080p Web-
Camera HD 720p Webcam
cam
Disk 128Gb 512Gb

12
Chapter 3. Requirement Analysis

3.4 Database Requirement :

MYSQL

MySQL pronounced either "My S-Q-L" or "My Sequel," is an open source relational database
management system. It is based on the structure query language (SQL), which is used for
adding, removing, and modifying information in the database. Standard SQL commands, such
as ADD, DROP, INSERT, and UPDATE can be used with MySQL.

13
Chapter 4

System Design and Architecture

4.1 System Architecture

Figure 4.1: System Architecture

14
Chapter 4. System Design and Architecture

4.2 Usecase Diagram

15
8.1 Class Diagram 4.3. Class Diagram

Figure 4.3

Block Diagram

16
Figure 4.4

17
4.5 E-R Diagram

Figure 4.4

18
Chapter 5 Methodology

5.1 Modular Description


This marathon is for 3 main modules

First, you’ll build intuition for model-to-problem fit. Which models are robust to missing data? Which
models handle categorical features well? Yes, you can dig through textbooks to find the answers, but
you’ll learn better by seeing it in action.

Second, this project will teach you the invaluable skill of prototyping models quickly. In the real
world, it’s often difficult to know which model will perform best without simply trying them.

Finally, this exercise helps you master the workflow of model building. For example, you’ll get to -
practice…

The main modules of the marathon is to concurrently do the below objectives to achieve accurate
result:
• Importing data
• Cleaning data
• Splitting it into train/test or cross-validation sets
• Pre-processing
• Transformations
• Feature Modeling
• Technology Used: Python, Django and Sklearn

19
Chapter 6

Implementation

28
Details

6.1 Assumptions And Dependencies

6.1.1 Assumptions

It is assummed that our project is not intented for any specific user, every person
that use internet and has an account or social media identity can get benefit from
our project. Each person how is going through the hustle of writing all the details
by themself for any login or creating account can use this to make it a way easier.

6.1.2 Dependencies
The intended system is depended on below tools and technologies
PYTHON
Python is a widely used high-level programming language for general-purpose programming. Apart
from being open source programming language, python is a great object-oriented, interpreted, and
interactive programming language. Python combines remarkable power with very clear syntax. It has
modules, classes, exceptions, very high-level dynamic data types, and dynamic typing.

DJANGO
Django is one of the web frameworks written in Python programming languages like Python, Django
also enable developers to build custom web applications without writing additional code. It further
makes it easier for developers to maintain and update the web applications by keeping the code base
clean and readable.

SciKit-learn
Scikit-learn (also known as sklearn) is a free software machine learning library for the Python
programming language. It features various classification, regression and clustering algorithms including
support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to
interoperate with the Python numerical and scientific libraries NumPy and SciPy.

Machine Learning
Machine learning is an application of artificial intelligence (AI) that provides systems the ability to
automatically learn and improve from experience without being explicitly programmed. Machine
learning focuses on the development of computer programs that can access data and use it learn for
themselves. The process of learning begins with observations or data, such as examples, direct
experience, or instruction, in order to look for patterns in data and make better decisions in the future
based on the examples that we provide. The primary aim is to allow the computers learn automatically
without human intervention or assistance and adjust actions accordingly.

29
Chapter 6. Implementation Details

6.2 Implementation Methodologies

This system is implemented in three phase

First, you’ll build intuition for model-to-problem fit. Which models are robust to missing data? Which models handle categorical
features well? Yes, you can dig through textbooks to find the answers, but you’ll learn better by seeing it in action.

Second, this project will teach you the invaluable skill of prototyping models quickly. In the real world, it’s often difficult to know
which model will perform best without simply trying them.

Finally, this exercise helps you master the workflow of model building. For example, you’ll get to practice and follow the steps
that marathon is to concurrently do the below objectives to achieve accurate result:
• Importing data
• Cleaning data
• Splitting it into train/test or cross-validation sets
• Pre-processing
• Transformations
• Feature Modeling
• Technology Used: Python, Django and Sklearn

6.2.1 Step I implementation:


Home Page
Model Selection
Output
Chapter 7

Results and Analysis

7.1 Analysis and Result Discussion

47
Chapter 8

Conclusion and Future Scope

8.1 Conclusion

It concludes that the existing system was data was managed manually under surveillance , so
data manual trust factor was there but was time consuming to on the contrarily. All the process
where scattered, Lack of Data ,Number of Categories to process and preprocess ,Manual Data
Entry at some levels ,Data Importing was manual , Data Cleaning and Analysis was manual.

8.2 Limitations

 Takes Up a Lot of Space. The biggest downfall to manual document filing is the amount
of space it can take up. ...
 Prone to Damage and Being Misplaced. ...
 Hard to Make Changes. ...
 Access Time. ...
 Lack of Security. ...
 Higher Cost.

48
Chapter 8. Conclusion and Future Scope

8.3 Future Enhancement

 In Future the system will be designed SAS based.


 Modular Development is decided due to huge data problem.
 Cloud based solution we are planning as many people have server and storage issue while
processing.
 Cloud base server and GPU we are going to propose for future.
 Mobile App for data uploading and review result charts can be done in future.

49
References

[1] https://www.ion.org/publications/abstract.cfm?articleID=9289

[2] http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.259.1400&rep=rep1&type=pdf

[3] https://www.osti.gov/biblio/7097424

50
Chapter 9

Appendix A

9.1 Liner Regression

LinearRegression fits a linear model with coefficients w = (w1, …, wp) to minimize the
residual sum of squares between the observed targets in the dataset, and the targets predicted
by the linear approximation.

Parameters
fit_interceptbool, default=True
Whether to calculate the intercept for this model. If set to False, no intercept will be used in
calculations (i.e. data is expected to be centered).

normalizebool, default=False
This parameter is ignored when fit_intercept is set to False. If True, the regressors X will be
normalized before regression by subtracting the mean and dividing by the l2-norm. If you
wish to standardize, please use StandardScaler before calling fit on an estimator with
normalize=False.

copy_Xbool, default=True
If True, X will be copied; else, it may be overwritten.

n_jobsint, default=None
The number of jobs to use for the computation. This will only provide speedup for n_targets >
1 and sufficient large problems. None means 1 unless in a joblib.parallel_backend context. -1
means using all processors. See Glossary for more details.

positivebool, default=False
When set to True, forces the coefficients to be positive. This option is only supported for
dense arrays.

New in version 0.24.

Attributes

51
coef_array of shape (n_features, ) or (n_targets, n_features)
Estimated coefficients for the linear regression problem. If multiple targets are passed during
the fit (y 2D), this is a 2D array of shape (n_targets, n_features), while if only one target is
passed, this is a 1D array of length n_features.

rank_int
Rank of matrix X. Only available when X is dense.

singular_array of shape (min(X, y),)


Singular values of X. Only available when X is dense.

intercept_float or array of shape (n_targets,)


Independent term in the linear model. Set to 0.0 if fit_intercept = False.

9.2 Logistic Regression

Logistic Regression (aka logit, MaxEnt) classifier.

In the multiclass case, the training algorithm uses the one-vs-rest (OvR) scheme if the
‘multi_class’ option is set to ‘ovr’, and uses the cross-entropy loss if the ‘multi_class’ option
is set to ‘multinomial’. (Currently the ‘multinomial’ option is supported only by the ‘lbfgs’,
‘sag’, ‘saga’ and ‘newton-cg’ solvers.)

This class implements regularized logistic regression using the ‘liblinear’ library, ‘newton-
cg’, ‘sag’, ‘saga’ and ‘lbfgs’ solvers. Note that regularization is applied by default. It can
handle both dense and sparse input. Use C-ordered arrays or CSR matrices containing 64-bit
floats for optimal performance; any other input format will be converted (and copied).

The ‘newton-cg’, ‘sag’, and ‘lbfgs’ solvers support only L2 regularization with primal
formulation, or no regularization. The ‘liblinear’ solver supports both L1 and L2
regularization, with a dual formulation only for the L2 penalty. The Elastic-Net regularization
is only supported by the ‘saga’ solver.

Read more in the User Guide.

Parameters
penalty{‘l1’, ‘l2’, ‘elasticnet’, ‘none’}, default=’l2’
Used to specify the norm used in the penalization. The ‘newton-cg’, ‘sag’ and ‘lbfgs’ solvers
support only l2 penalties. ‘elasticnet’ is only supported by the ‘saga’ solver. If ‘none’ (not
supported by the liblinear solver), no regularization is applied.

New in version 0.19: l1 penalty with SAGA solver (allowing ‘multinomial’ + L1)

dualbool, default=False
Dual or primal formulation. Dual formulation is only implemented for l2 penalty with
liblinear solver. Prefer dual=False when n_samples > n_features.

52
tolfloat, default=1e-4
Tolerance for stopping criteria.

Cfloat, default=1.0
Inverse of regularization strength; must be a positive float. Like in support vector machines,
smaller values specify stronger regularization.

fit_interceptbool, default=True
Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.

intercept_scalingfloat, default=1
Useful only when the solver ‘liblinear’ is used and self.fit_intercept is set to True. In this case,
x becomes [x, self.intercept_scaling], i.e. a “synthetic” feature with constant value equal to
intercept_scaling is appended to the instance vector. The intercept becomes intercept_scaling
* synthetic_feature_weight.

Note! the synthetic feature weight is subject to l1/l2 regularization as all other features. To
lessen the effect of regularization on synthetic feature weight (and therefore on the intercept)
intercept_scaling has to be increased.

class_weightdict or ‘balanced’, default=None


Weights associated with classes in the form {class_label: weight}. If not given, all classes are
supposed to have weight one.

The “balanced” mode uses the values of y to automatically adjust weights inversely
proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)).

Note that these weights will be multiplied with sample_weight (passed through the fit
method) if sample_weight is specified.

New in version 0.17: class_weight=’balanced’

random_stateint, RandomState instance, default=None


Used when solver == ‘sag’, ‘saga’ or ‘liblinear’ to shuffle the data. See Glossary for details.

solver{‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’}, default=’lbfgs’


Algorithm to use in the optimization problem.

For small datasets, ‘liblinear’ is a good choice, whereas ‘sag’ and ‘saga’ are faster for large
ones.

For multiclass problems, only ‘newton-cg’, ‘sag’, ‘saga’ and ‘lbfgs’ handle multinomial loss;
‘liblinear’ is limited to one-versus-rest schemes.

‘newton-cg’, ‘lbfgs’, ‘sag’ and ‘saga’ handle L2 or no penalty

‘liblinear’ and ‘saga’ also handle L1 penalty

53
‘saga’ also supports ‘elasticnet’ penalty

‘liblinear’ does not support setting penalty='none'

Note that ‘sag’ and ‘saga’ fast convergence is only guaranteed on features with approximately
the same scale. You can preprocess the data with a scaler from sklearn.preprocessing.

New in version 0.17: Stochastic Average Gradient descent solver.

New in version 0.19: SAGA solver.

Changed in version 0.22: The default solver changed from ‘liblinear’ to ‘lbfgs’ in 0.22.

max_iterint, default=100
Maximum number of iterations taken for the solvers to converge.

multi_class{‘auto’, ‘ovr’, ‘multinomial’}, default=’auto’


If the option chosen is ‘ovr’, then a binary problem is fit for each label. For ‘multinomial’ the
loss minimised is the multinomial loss fit across the entire probability distribution, even when
the data is binary. ‘multinomial’ is unavailable when solver=’liblinear’. ‘auto’ selects ‘ovr’ if
the data is binary, or if solver=’liblinear’, and otherwise selects ‘multinomial’.

New in version 0.18: Stochastic Average Gradient descent solver for ‘multinomial’ case.

Changed in version 0.22: Default changed from ‘ovr’ to ‘auto’ in 0.22.

verboseint, default=0
For the liblinear and lbfgs solvers set verbose to any positive number for verbosity.

warm_startbool, default=False
When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just
erase the previous solution. Useless for liblinear solver. See the Glossary.

New in version 0.17: warm_start to support lbfgs, newton-cg, sag, saga solvers.

n_jobsint, default=None
Number of CPU cores used when parallelizing over classes if multi_class=’ovr’”. This
parameter is ignored when the solver is set to ‘liblinear’ regardless of whether ‘multi_class’ is
specified or not. None means 1 unless in a joblib.parallel_backend context. -1 means using all
processors. See Glossary for more details.

l1_ratiofloat, default=None
The Elastic-Net mixing parameter, with 0 <= l1_ratio <= 1. Only used if penalty='elasticnet'.
Setting l1_ratio=0 is equivalent to using penalty='l2', while setting l1_ratio=1 is equivalent to
using penalty='l1'. For 0 < l1_ratio <1, the penalty is a combination of L1 and L2.

Attributes
classes_ndarray of shape (n_classes, )
A list of class labels known to the classifier.

54
coef_ndarray of shape (1, n_features) or (n_classes, n_features)
Coefficient of the features in the decision function.

coef_ is of shape (1, n_features) when the given problem is binary. In particular, when
multi_class='multinomial', coef_ corresponds to outcome 1 (True) and -coef_ corresponds to
outcome 0 (False).

intercept_ndarray of shape (1,) or (n_classes,)


Intercept (a.k.a. bias) added to the decision function.

If fit_intercept is set to False, the intercept is set to zero. intercept_ is of shape (1,) when the
given problem is binary. In particular, when multi_class='multinomial', intercept_ corresponds
to outcome 1 (True) and -intercept_ corresponds to outcome 0 (False).

n_iter_ndarray of shape (n_classes,) or (1, )


Actual number of iterations for all classes. If binary or multinomial, it returns only 1 element.
For liblinear solver, only the maximum number of iteration across all classes is given.

Changed in version 0.20: In SciPy <= 1.0.0 the number of lbfgs iterations may exceed
max_iter. n_iter_ will now report at most max_iter.

55
99.3. Random Forest Classifier

9.3 Random Forest Classifier

A random forest classifier.

A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of
the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The sub-sample
size is controlled with the max_samples parameter if bootstrap=True (default), otherwise the whole dataset
is used to build each tree.

Read more in the User Guide.

Parameters
n_estimatorsint, default=100
The number of trees in the forest.

Changed in version 0.22: The default value of n_estimators changed from 10 to 100 in 0.22.

criterion{“gini”, “entropy”}, default=”gini”


The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and
“entropy” for the information gain. Note: this parameter is tree-specific.

max_depthint, default=None
The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all
leaves contain less than min_samples_split samples.

min_samples_splitint or float, default=2


The minimum number of samples required to split an internal node:

If int, then consider min_samples_split as the minimum number.

If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum
number of samples for each split.

Changed in version 0.18: Added float values for fractions.

min_samples_leafint or float, default=1


The minimum number of samples required to be at a leaf node. A split point at any depth will only be
considered if it leaves at least min_samples_leaf training samples in each of the left and right branches.
This may have the effect of smoothing the model, especially in regression.

If int, then consider min_samples_leaf as the minimum number.

If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum
number of samples for each node.

Changed in version 0.18: Added float values for fractions.

min_weight_fraction_leaffloat, default=0.0
The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a
leaf node. Samples have equal weight when sample_weight is not provided.

56
max_features{“auto”, “sqrt”, “log2”}, int or float, default=”auto”
The number of features to consider when looking for the best split:

If int, then consider max_features features at each split.

If float, then max_features is a fraction and round(max_features * n_features) features are considered at
each split.

If “auto”, then max_features=sqrt(n_features).

If “sqrt”, then max_features=sqrt(n_features) (same as “auto”).

If “log2”, then max_features=log2(n_features).

If None, then max_features=n_features.

Note: the search for a split does not stop until at least one valid partition of the node samples is found, even
if it requires to effectively inspect more than max_features features.

max_leaf_nodesint, default=None
Grow trees with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in
impurity. If None then unlimited number of leaf nodes.

min_impurity_decreasefloat, default=0.0
A node will be split if this split induces a decrease of the impurity greater than or equal to this value.

The weighted impurity decrease equation is the following:

N_t / N * (impurity - N_t_R / N_t * right_impurity


- N_t_L / N_t * left_impurity)
where N is the total number of samples, N_t is the number of samples at the current node, N_t_L is the
number of samples in the left child, and N_t_R is the number of samples in the right child.

N, N_t, N_t_R and N_t_L all refer to the weighted sum, if sample_weight is passed.

New in version 0.19.

min_impurity_splitfloat, default=None
Threshold for early stopping in tree growth. A node will split if its impurity is above the threshold,
otherwise it is a leaf.

Deprecated since version 0.19: min_impurity_split has been deprecated in favor of min_impurity_decrease
in 0.19. The default value of min_impurity_split has changed from 1e-7 to 0 in 0.23 and it will be removed
in 1.0 (renaming of 0.25). Use min_impurity_decrease instead.
bootstrapbool, default=True
Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each
tree.

oob_scorebool, default=False
Whether to use out-of-bag samples to estimate the generalization accuracy.

n_jobsint, default=None
The number of jobs to run in parallel. fit, predict, decision_path and apply are all parallelized over the trees.
None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for
more details.

57
random_stateint, RandomState instance or None, default=None
Controls both the randomness of the bootstrapping of the samples used when building trees (if
bootstrap=True) and the sampling of the features to consider when looking for the best split at each node (if
max_features < n_features). See Glossary for details.

verboseint, default=0
Controls the verbosity when fitting and predicting.

warm_startbool, default=False
When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble,
otherwise, just fit a whole new forest. See the Glossary.

class_weight{“balanced”, “balanced_subsample”}, dict or list of dicts, default=None


Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to
have weight one. For multi-output problems, a list of dicts can be provided in the same order as the
columns of y.

Note that for multioutput (including multilabel) weights should be defined for each class of every column
in its own dict. For example, for four-class multilabel classification weights should be [{0: 1, 1: 1}, {0: 1,
1: 5}, {0: 1, 1: 1}, {0: 1, 1: 1}] instead of [{1:1}, {2:5}, {3:1}, {4:1}].

The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class
frequencies in the input data as n_samples / (n_classes * np.bincount(y))

The “balanced_subsample” mode is the same as “balanced” except that weights are computed based on the
bootstrap sample for every tree grown.

For multi-output, the weights of each column of y will be multiplied.

Note that these weights will be multiplied with sample_weight (passed through the fit method) if
sample_weight is specified.

ccp_alphanon-negative float, default=0.0


Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost
complexity that is smaller than ccp_alpha will be chosen. By default, no pruning is performed. See Minimal
Cost-Complexity Pruning for details.

New in version 0.22.

max_samplesint or float, default=None


If bootstrap is True, the number of samples to draw from X to train each base estimator.

If None (default), then draw X.shape[0] samples.

If int, then draw max_samples samples.

If float, then draw max_samples * X.shape[0] samples. Thus, max_samples should be in the interval (0, 1).

New in version 0.22.

Attributes
base_estimator_DecisionTreeClassifier
The child estimator template used to create the collection of fitted sub-estimators.

58
estimators_list of DecisionTreeClassifier
The collection of fitted sub-estimators.

classes_ndarray of shape (n_classes,) or a list of such arrays


The classes labels (single output problem), or a list of arrays of class labels (multi-output problem).

n_classes_int or list
The number of classes (single output problem), or a list containing the number of classes for each output
(multi-output problem).

n_features_int
The number of features when fit is performed.

n_outputs_int
The number of outputs when fit is performed.

feature_importances_ndarray of shape (n_features,)


The impurity-based feature importances.

oob_score_float
Score of the training dataset obtained using an out-of-bag estimate. This attribute exists only when
oob_score is True.

oob_decision_function_ndarray of shape (n_samples, n_classes)


Decision function computed with out-of-bag estimate on the training set. If n_estimators is small it might
be possible that a data point was never left out during the bootstrap. In this case, oob_decision_function_
might contain NaN. This attribute exists only when oob_score is True.

9.4 Panda And Numpy

Python is increasingly being used as a scientific language. Matrix and vector manipulations are extremely
important for scientific computations. Both NumPy and Pandas have emerged to be essential libraries for
any scientific computation, including machine learning, in python due to their intuitive syntax and high-
performance matrix computation capabilities.

In this post, we will provide an overview of the common functionalities of NumPy and Pandas. We will
realize the similarity of these libraries with existing toolboxes in R and MATLAB. This similarity and
added flexibility have resulted in wide acceptance of python in the scientific community lately. Topic
covered in the blog are:

1. Overview of NumPy
2. Overview of Pandas
3. Using Matplotlib

This post is an excerpt from a live hands-on training conducted by CloudxLab on 25th Nov 2017. It was
attended by more than 100 learners around the globe. The participants were from countries namely; United
States, Canada, Australia, Indonesia, India, Thailand, Philippines, Malaysia, Macao, Japan, Hong Kong,
Singapore, United Kingdom, Saudi Arabia, Nepal, & New Zealand.

What is NumPy?

59
NumPy stands for ‘Numerical Python’ or ‘Numeric Python’. It is an open source module of Python which
provides fast mathematical computation on arrays and matrices. Since, arrays and matrices are an essential
part of the Machine Learning ecosystem, NumPy along with Machine Learning modules like Scikit-learn,
Pandas, Matplotlib, TensorFlow, etc. complete the Python Machine Learning Ecosystem.

NumPy provides the essential multi-dimensional array-oriented computing functionalities designed for
high-level mathematical functions and scientific computation. Numpy can be imported into the notebook
using

>>> import numpy as np

NumPy’s main object is the homogeneous multidimensional array. It is a table with same type elements,
i.e, integers or string or characters (homogeneous), usually integers. In NumPy, dimensions are called axes.
The number of axes is called the rank.

There are several ways to create an array in NumPy like np.array, np.zeros, no.ones, etc. Each of them
provides some flexibility.

Command to create an
Example
array

>>> a = np.array([1, 2, 3])

>>> type(a)

<type 'numpy.ndarray'>

np.array
>>> b = np.array((3, 4, 5))

>>> type(b)

<type 'numpy.ndarray'>

60
>>> np.ones( (3,4), dtype=np.int16 )  

array([[ 1,  1,  1,  1],

      [ 1,  1,  1,  1],


np.ones

      [ 1,  1,  1,  1]])

>>> np.full( (3,4), 0.11 )  

array([[ 0.11,  0.11,  0.11,  0.11],        

  [ 0.11,  0.11,  0.11,  0.11],        


np.full

  [ 0.11,  0.11,  0.11,  0.11]])

>>> np.arange( 10, 30, 5 )

array([10, 15, 20, 25])

>>> np.arange( 0, 2, 0.3 )             


np.arange

# it accepts float arguments

array([ 0. ,  0.3,  0.6,  0.9,  1.2,  1.5,  1.8])

61
>>> np.linspace(0, 5/3, 6)

array([0. , 0.33333333 , 0.66666667 , 1. , 1.33333333


np.linspace 1.66666667])

>>> np.random.rand(2,3)

array([[ 0.55365951,  0.60150511,  0.36113117],

np.random.rand(2,3)
      [ 0.5388662 ,  0.06929014,  0.07908068]])

>>> np.empty((2,3))

array([[ 0.21288689,  0.20662218,  0.78018623],

np.empty((2,3))
      [ 0.35294004,  0.07347101,  0.54552084]])

Some of the important attributes of a NumPy object are:

1. Ndim: displays the dimension of the array


2. Shape: returns a tuple of integers indicating the size of the array
3. Size: returns the total number of elements in the NumPy array
4. Dtype: returns the type of elements in the array, i.e., int64, character
5. Itemsize: returns the size in bytes of each item
6. Reshape: Reshapes the NumPy array

NumPy array elements can be accessed using indexing. Below are some of the useful examples:

 A[2:5] will print items 2 to 4. Index in NumPy arrays starts from 0


 A[2::2] will print items 2 to end skipping 2 items
 A[::-1] will print the array in the reverse order
62
 A[1:] will print from row 1 to end

The session covers these and some important attributes of the NumPy array object in detail.

Vectors and Machine learning

Machine learning uses vectors. Vectors are one-dimensional arrays. It can be represented either as a row or
as a column array.

What are vectors? Vector quantity is the one which is defined by a magnitude and a direction. For example,
force is a vector quantity. It is defined by the magnitude of force as well as a direction. It can be
represented as an array [a,b] of 2 numbers = [2,180] where ‘a’ may represent the magnitude of 2 Newton
and 180 (‘b’) represents the angle in degrees.

Another example, say a rocket is going up at a slight angle: it has a vertical speed of 5,000 m/s, and also a
slight speed towards the East at 10 m/s, and a slight speed towards the North at 50 m/s. The rocket’s
velocity may be represented by the following vector: [10, 50, 5000] which represents the speed in each of
x, y, and z-direction.

Similarly, vectors have several usages in Machine Learning, most notably to represent observations and
predictions.

For example, say we built a Machine Learning system to classify videos into 3 categories (good, spam,
clickbait) based on what we know about them. For each video, we would have a vector representing what
we know about it, such as: [10.5, 5.2, 3.25, 7.0]. This vector could represent a video that lasts 10.5 minutes,
but only 5.2% viewers watch for more than a minute, it gets 3.25 views per day on average, and it was
flagged 7 times as spam.

As you can see, each axis may have a different meaning. Based on this vector, our Machine Learning
system may predict that there is an 80% probability that it is a spam video, 18% that it is clickbait, and 2%
that it is a good video. This could be represented as the following vector: class_probabilities =
[0.8,0.18,0.02].

As can be observed, vectors can be used in Machine Learning to define observations and predictions. The
properties representing the video, i.e., duration, percentage of viewers watching for more than a minute are
called features. 

Since the majority of the time of building machine learning models would be spent in data processing, it is
important to be familiar to the libraries that can help in processing such data.

Why NumPy and Pandas over regular Python arrays?

In python, a vector can be represented in many ways, the simplest being a regular python list of numbers.
Since Machine Learning requires lots of scientific calculations, it is much better to use NumPy’s ndarray,
which provides a lot of convenient and optimized implementations of essential mathematical operations on
vectors.

Vectorized operations perform faster than matrix manipulation operations performed using loops in python.
For example, to carry out a 100 * 100 matrix multiplication, vector operations using NumPy are two orders
of magnitude faster than performing it using loops.
63
Some ways in which NumPy arrays are different from normal Python arrays are:

1. If you assign a single value to a ndarray slice, it is copied across the whole slice

NumPy Array Regular Python array

>>> a = np.array([1, 2, 5, 7, 8])


>>> b = [1, 2, 5, 7, 8]
>>> a[1:3] = -1
>>> b[1:3] = -1
>>> a
TypeError: can only assign an iterable
array([ 1, -1, -1,  7,  8])

 
 

So, it is easier to assign values to a slice of an array in a NumPy array as compared to a normal array
wherein it may have to be done using loops.

2. ndarray slices are actually views on the same data buffer. If you modify it, it is going to modify the original
ndarray as well.

NumPy array slice Regular python array slice

>>> a=[1,2,5,7,8]
>>> a = np.array([1, 2, 5, 7, 8])
>>> b=a[1:5]
>>> a_slice = a[1:5]
>>> b[1]=3
>>> a_slice[1] = 1000
>>> print(a)
>>> a
>>> print(b)
array([   1,    2, 1000, 7,    8])
[1, 2, 5, 7, 8]
# Original array was modified
[2, 3, 7, 8]

64
If we need a copy of the NumPy array, we need to use the copy method as another_slice = another_slice =
a[2:6].copy(). If we modify another_slice, a remains same

3. The way multidimensional arrays are accessed using NumPy is different from how they are accessed in
normal python arrays. The generic format in NumPy multi-dimensional arrays is:

Array[row_start_index:row_end_index, column_start_index: column_end_index]

NumPy arrays can also be accessed using boolean indexing. For example,

>>> a = np.arange(12).reshape(3, 4)

array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]])

>>> rows_on = np.array([True, False, True])

>>> a[rows_on , : ]      # Rows 0 and 2, all columns

array([[ 0,  1,  2,  3],

      [ 8,  9, 10, 11]])

NumPy arrays are capable of performing all basic operations such as addition, subtraction, element-wise
product, matrix dot product, element-wise division, element-wise modulo, element-wise exponents and
conditional operations.

An important feature with NumPy arrays is broadcasting.

65
In general, when NumPy expects arrays of the same shape but finds that this is not the case, it applies the
so-called broadcasting rules.

Basically, there are 2 rules of Broadcasting to remember:

1. For the arrays that do not have the same rank, then a 1 will be prepended to the smaller ranking arrays until
their ranks match. For example, when adding arrays A and B of sizes (3,3) and (,3) [rank 2 and rank 1], 1
will be prepended to the dimension of array B to make it (1,3) [rank=2]. The two sets are compatible when
their dimensions are equal or either one of the dimension is 1. 
2. When either of the dimensions compared is one, the other is used. In other words, dimensions with size 1
are stretched or “copied” to match the other. For example, upon adding a 2D array A of shape (3,3) to a 2D
ndarray B of shape (1, 3). NumPy will apply the above rule of broadcasting. It shall stretch the array B and
replicate the first row 3 times to make array B of dimensions (3,3) and perform the operation.

NumPy provides basic mathematical and statistical functions like mean, min, max, sum, prod, std, var,
summation across different axes, transposing of a matrix, etc.

A particular NumPy feature of interest is solving a system of linear equations. NumPy has a function to
solve linear equations. For example,

2x + 6y = 6

5x + 3y = -9

Can be solved in NumPy using

66
>>> coeffs  = np.array([[2, 6], [5, 3]])

>>> depvars = np.array([6, -9])

>>> solution = linalg.solve(coeffs, depvars)

>>> solution

array([-3.,  2.])

What is Pandas?

Similar to NumPy, Pandas is one of the most widely used python libraries in data science. It provides high-
performance, easy to use structures and data analysis tools. Unlike NumPy library which provides objects
for multi-dimensional arrays, Pandas provides in-memory 2d table object called Dataframe. It is like a
spreadsheet with column names and row labels.

Hence, with 2d tables, pandas is capable of providing many additional functionalities like creating pivot
tables, computing columns based on other columns and plotting graphs. Pandas can be imported into
Python using:

>>> import pandas as pd

Some commonly used data structures in pandas are:

1. Series objects: 1D array, similar to a column in a spreadsheet


2. DataFrame objects: 2D table, similar to a spreadsheet
3. Panel objects: Dictionary of DataFrames, similar to sheet in MS Excel

Pandas Series object is created using pd.Series function. Each row is provided with an index and by
defaults is assigned numerical values starting from 0. Like NumPy, Pandas also provide the basic
mathematical functionalities like addition, subtraction and conditional operations and broadcasting.

Pandas dataframe object represents a spreadsheet with cell values, column names, and row index labels.
Dataframe can be visualized as dictionaries of Series. Dataframe rows and columns are simple and intuitive
to access. Pandas also provide SQL-like functionality to filter, sort rows based on conditions. For example,

>>> people_dict = { "weight": pd.Series([68, 83, 112],index=["alice", "bob", "charles"]),   "birthyear":


pd.Series([1984, 1985, 1992], index=["bob", "alice", "charles"], name="year"),

67
"children": pd.Series([0, 3], index=["charles", "bob"]),

"hobby": pd.Series(["Biking", "Dancing"], index=["alice", "bob"]),}

>>> people = pd.DataFrame(people_dict)

>>> people

>>> people[people["birthyear"] < 1990]

New columns and rows can be easily added to the dataframe. In addition to the basic functionalities, pandas
dataframe can be sorted by a particular column.

Dataframes can also be easily exported and imported from CSV, Excel, JSON, HTML and SQL database.
Some other essential methods that are present in dataframes are:

1. head(): returns the top 5 rows in the dataframe object


2. tail(): returns the bottom 5 rows in the dataframe
3. info(): prints the summary of the dataframe
68
4. describe(): gives a nice overview of the main aggregated values over each column

What is matplotlib?

Matplotlib is a 2d plotting library which produces publication quality figures in a variety of hardcopy
formats and interactive environments. Matplotlib can be used in Python scripts, Python and IPython shell,
Jupyter Notebook, web application servers and GUI toolkits.

matplotlib.pyplot is a collection of functions that make matplotlib work like MATLAB. Majority of
plotting commands in pyplot have MATLAB analogs with similar arguments. Let us take a couple of
examples:

Example 1: Plotting a line graph Example 2: Plotting a histogram

>>> import matplotlib.pyplot as plt >>> import matplotlib.pyplot as plt

>>> plt.plot([1,2,3,4]) >>> x =


[21,22,23,4,5,6,77,8,9,10,31,32,33,34,3
5,36,37,18,49,50,100]
>>> plt.ylabel('some numbers')

>>> num_bins = 5
>>> plt.show()

>>> plt.hist(x, num_bins,


facecolor='blue')

>>> plt.show()

Summary

69
Hence, we observe that NumPy and Pandas make matrix manipulation easy. This flexibility makes them
very useful in Machine Learning model development.
.

10. Data-Dictionary

Chapter 10

Data-Dictionary

70

You might also like