Professional Documents
Culture Documents
by
Jainam Shah
Internal guide
(Prof. )
External guide
We declare that this written submission represents our ideas in our own words
and where others ideas or words have been included; we have adequately cited
and referenced the original sources. We also declare that we have adhered to all
principles of academic honesty and integrity and have not misrepresented or
fabricated or falsified any idea/data/fact/source in our submission. We
understand that any violation of the above will be cause for disciplinary action
by the Institute and can also evoke penal action from the sources which have
thus not been properly cited or from whom proper permission has not been taken
when needed.
Jainam Shah
Abstract
This project is about making user’s life easier using machine learning and python.
In this we are providing assistance to the user in filling up their forms and saving
them a lot of time in filling up their basic details in everyday using of the internet.
Our goal is to give user a platform where he/she has to fill-up the details once and
for all, then every time any that info is required our app will fill it up for you. We
want the the system to learn the basics of the user every time he/she fill anything
by them self and store for future use. Our system’s main goal is to save your time
and efforts in creating new accounts and login pages in day to day use as much as
possible.
Jainam Shah
BE CE
AIT College
Contents
1 Project Overview 1
1.1 Introduction...........................................................................................................1
1.1.1 Background Introduction..........................................................................2
1.1.2 Motivation................................................................................................2
1.2 Problem Definition................................................................................................3
1.3 Current Systems....................................................................................................3
1.4 The Problems with Current System......................................................................3
1.4.1 Advantages Over Current System............................................................4
1.5 Goals and Objectives.............................................................................................4
1.6 Scope and Applications.........................................................................................5
2 Review Of Literature 6
2.1 Machine Marathon................................................................................................6
2.1.1 Description...............................................................................................6
2.1.2 Pros...........................................................................................................6
2.1.3 Cons..........................................................................................................7
2.2 Technological Review...........................................................................................8
2.2.1 Python Django Scikit-Learn Machine Learning ................................8
2.2.2 MySQL......................................................................................................9
3 Requirement Analysis 11
3.1 Platform Requirement :.........................................................................................11
3.1.1 Supportive Operating Systems :...............................................................11
3.2 Software Requirement :.........................................................................................11
3.3 Hardware Requirement :.......................................................................................12
3.4 Database Requirement :........................................................................................13
v
4.4 Block Diagram……………………………………………………………….17
4.5 E-R Diagram…………………………………………………………………….18
5 Methodology 20
5.1 Modular Description.............................................................................................20
5.1.1
6 Implementation Details 28
6.1 Assumptions And Dependencies...........................................................................28
6.1.1 Assumptions.............................................................................................28
6.1.2 Dependencies............................................................................................28
6.2 Implementation Methodologies............................................................................29
6.2.1
6.3 Competitive Advantages of the project:................................................................34
References 38
9 Appendix A 39
9.1 Liner Regression...................................................................................................39
9.2 Logistic Regression...............................................................................................39
9.3 Random Forest Classifier......................................................................................40
9.4 Pandas and Numpy.................................................................................................40
10 Data-Dictionary 41
10.1 Data-Dictionary .................................................................................................. 41
v
i
Chapter 1
Project Overview
1.0 Introduction
The Machine Marathon is to take out-of-the-box models and apply them to different datasets.
First, you’ll build intuition for model-to-problem fit. Which models are robust to missing
data? Which models handle categorical features well? Yes, you can dig through textbooks to
find the answers, but you’ll learn better by seeing it in action.
Second, this project will teach you the invaluable skill of prototyping models quickly. In the
real world, it’s often difficult to know which model will perform best without simply trying
them.
Finally, this exercise helps you master the workflow of model building. For example, you’ll
get to practice….
1
8.1. Introduction
Our project is not intended for any specific user, every person that use internet and has an
account or social media identity can get benefit from our project. Each person how is going
through the hustle of writing all the details by themselves for any login or creating account can
use this to make it a way easier.
1.1.2 Motivation
Our motivation leads to give user a platform where he/she has to fill-up the details once and for
all, and then every time any that info is required our app will fill it up for you. We want the
system to learn the basics of the user every time he/she fill anything by them self and store for
future use. Our system’s main goal is to save your time and efforts in creating new accounts and
login pages in day to day use as much as possible.
2
Chapter 1. Project Overview
The Machine Marathon is to take out-of-the-box models and apply them to different datasets.
This marathon is for 3 main reasons:
First, you’ll build intuition for model-to-problem fit. Which models are robust to missing data?
Which models handle categorical features well? Yes, you can dig through textbooks to find the
answers, but you’ll learn better by seeing it in action.
Second, this project will teach you the invaluable skill of prototyping models quickly. In the
real world, it’s often difficult to know which model will perform best without simply trying
them.
Finally, this exercise helps you master the workflow of model building. For example, you’ll get
to practice
In Current System:
All the process where scattered
Lack of Data
Number of Categories to process and preprocess
Manual Data Entry at some levels
Data Importing was manual
Data Cleaning and Analysis was manual
3
1.5. Goals and Objectives
Only advantage in existing system was data was managed manually under surveillance , so data manual
trust factor was there but was time consuming to on the contrarily.
4
Chapter 1. Project Overview
Our project is not intended for any specific user, every person that use internet and has an
account or social media identity can get benefit from our project. Each person how is going
through the hustle of writing all the details by themselves for any login or creating account
can use this to make it a way easier .
5
Chapter 2
Review Of Literature
2.1.1 Description
ABSTRACT: This project is about making user’s life easier using machine learning and python.
In this we are providing assistance to the user in filling up their forms and saving them a lot of
time in filling up their basic details in everyday using of the internet. Our goal is to give user a
platform where he/she has to fill-up the details once and for all, and then every time any that info
is required our app will fill it up for you. We want the system to learn the basics of the user every
time he/she fill anything by them self and store for future use. Our system’s main goal is to save
your time and efforts in creating new accounts and login pages in day to day use as much as
possible.
2.1.2 Pros
Only advantage in existing system was data was managed manually under surveillance ,
so data manual trust factor was there but was time consuming to on the contrarily.
6
Chapter 2. Review Of Literature
2.1.3 Cons
Access Time.
Lack of Security.
Higher Cost.
7
Chapter 2. Review Of Literature
• PYTHON
Python is a widely used high-level programming language for general-purpose programming. Apart from
being open source programming language, python is a great object-oriented, interpreted, and interactive
programming language. Python combines remarkable power with very clear syntax. It has modules,
classes, exceptions, very high-level dynamic data types, and dynamic typing.
• DJANGO
Django is one of the web frameworks written in Python programming languages like Python, Django
also enable developers to build custom web applications without writing additional code. It further
makes it easier for developers to maintain and update the web applications by keeping the code base
clean and readable.
• SciKit-learn
Scikit-learn (also known as sklearn) is a free software machine learning library for the Python
programming language. It features various classification, regression and clustering algorithms including
support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to
interoperate with the Python numerical and scientific libraries NumPy and SciPy.
• Machine Learning
Machine learning is an application of artificial intelligence (AI) that provides systems the ability to
automatically learn and improve from experience without being explicitly programmed. Machine
learning focuses on the development of computer programs that can access data and use it learn for
themselves.
The process of learning begins with observations or data, such as examples, direct experience, or
instruction, in order to look for patterns in data and make better decisions in the future based on the
examples that we provide. The primary aim is to allow the computers learn automatically without human
intervention or assistance and adjust actions accordingly.
1.5 Development Tool
• MYSQL
MySQL pronounced either "My S-Q-L" or "My Sequel," is an open source relational database
management system. It is based on the structure query language (SQL), which is used for adding,
removing, and modifying information in the database. Standard SQL commands, such as ADD, DROP,
INSERT, and UPDATE can be used with MySQL.
• Pycharm
9
Chapter 3
Requirement Analysis
To be used efficiently, all computer software needs certain hardware components or other soft- ware
resources to be present on a computer. These prerequisites are known as (computer) system
requirements and are often used as a guideline as opposed to an absolute rule. Most software
defines two sets of system requirements: minimum and recommended.
The supported Operating Systems for client include: windows xp and later operating systems.
We as developer of our project will follow the given responsibilities:
1. Completing the tasks within the deadlines.
2. Informing our internal and external guides regularly about our performance.
3. Maintaining the logbook.
4. Implementing of project on well planned manner.
5. Dividing the tasks among project equally
11
3.3. Hardware Requirement :
• Pycharm
The most common set of requirements defined by any operating system or software applica-
tion is the physical computer resources, also known as hardware, A hardware requirements
list is often accompanied by a hardware compatibility list (HCL), especially in case of
operating systems. An HCL lists tested, compatible, and sometimes incompatible hardware
devices for a particular operating system or application.
12
Chapter 3. Requirement Analysis
MYSQL
MySQL pronounced either "My S-Q-L" or "My Sequel," is an open source relational database
management system. It is based on the structure query language (SQL), which is used for
adding, removing, and modifying information in the database. Standard SQL commands, such
as ADD, DROP, INSERT, and UPDATE can be used with MySQL.
13
Chapter 4
14
Chapter 4. System Design and Architecture
15
8.1 Class Diagram 4.3. Class Diagram
Figure 4.3
Block Diagram
16
Figure 4.4
17
4.5 E-R Diagram
Figure 4.4
18
Chapter 5 Methodology
First, you’ll build intuition for model-to-problem fit. Which models are robust to missing data? Which
models handle categorical features well? Yes, you can dig through textbooks to find the answers, but
you’ll learn better by seeing it in action.
Second, this project will teach you the invaluable skill of prototyping models quickly. In the real
world, it’s often difficult to know which model will perform best without simply trying them.
Finally, this exercise helps you master the workflow of model building. For example, you’ll get to -
practice…
The main modules of the marathon is to concurrently do the below objectives to achieve accurate
result:
• Importing data
• Cleaning data
• Splitting it into train/test or cross-validation sets
• Pre-processing
• Transformations
• Feature Modeling
• Technology Used: Python, Django and Sklearn
19
Chapter 6
Implementation
28
Details
6.1.1 Assumptions
It is assummed that our project is not intented for any specific user, every person
that use internet and has an account or social media identity can get benefit from
our project. Each person how is going through the hustle of writing all the details
by themself for any login or creating account can use this to make it a way easier.
6.1.2 Dependencies
The intended system is depended on below tools and technologies
PYTHON
Python is a widely used high-level programming language for general-purpose programming. Apart
from being open source programming language, python is a great object-oriented, interpreted, and
interactive programming language. Python combines remarkable power with very clear syntax. It has
modules, classes, exceptions, very high-level dynamic data types, and dynamic typing.
DJANGO
Django is one of the web frameworks written in Python programming languages like Python, Django
also enable developers to build custom web applications without writing additional code. It further
makes it easier for developers to maintain and update the web applications by keeping the code base
clean and readable.
SciKit-learn
Scikit-learn (also known as sklearn) is a free software machine learning library for the Python
programming language. It features various classification, regression and clustering algorithms including
support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to
interoperate with the Python numerical and scientific libraries NumPy and SciPy.
Machine Learning
Machine learning is an application of artificial intelligence (AI) that provides systems the ability to
automatically learn and improve from experience without being explicitly programmed. Machine
learning focuses on the development of computer programs that can access data and use it learn for
themselves. The process of learning begins with observations or data, such as examples, direct
experience, or instruction, in order to look for patterns in data and make better decisions in the future
based on the examples that we provide. The primary aim is to allow the computers learn automatically
without human intervention or assistance and adjust actions accordingly.
29
Chapter 6. Implementation Details
First, you’ll build intuition for model-to-problem fit. Which models are robust to missing data? Which models handle categorical
features well? Yes, you can dig through textbooks to find the answers, but you’ll learn better by seeing it in action.
Second, this project will teach you the invaluable skill of prototyping models quickly. In the real world, it’s often difficult to know
which model will perform best without simply trying them.
Finally, this exercise helps you master the workflow of model building. For example, you’ll get to practice and follow the steps
that marathon is to concurrently do the below objectives to achieve accurate result:
• Importing data
• Cleaning data
• Splitting it into train/test or cross-validation sets
• Pre-processing
• Transformations
• Feature Modeling
• Technology Used: Python, Django and Sklearn
47
Chapter 8
8.1 Conclusion
It concludes that the existing system was data was managed manually under surveillance , so
data manual trust factor was there but was time consuming to on the contrarily. All the process
where scattered, Lack of Data ,Number of Categories to process and preprocess ,Manual Data
Entry at some levels ,Data Importing was manual , Data Cleaning and Analysis was manual.
8.2 Limitations
Takes Up a Lot of Space. The biggest downfall to manual document filing is the amount
of space it can take up. ...
Prone to Damage and Being Misplaced. ...
Hard to Make Changes. ...
Access Time. ...
Lack of Security. ...
Higher Cost.
48
Chapter 8. Conclusion and Future Scope
49
References
[1] https://www.ion.org/publications/abstract.cfm?articleID=9289
•
[2] http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.259.1400&rep=rep1&type=pdf
[3] https://www.osti.gov/biblio/7097424
50
Chapter 9
Appendix A
LinearRegression fits a linear model with coefficients w = (w1, …, wp) to minimize the
residual sum of squares between the observed targets in the dataset, and the targets predicted
by the linear approximation.
Parameters
fit_interceptbool, default=True
Whether to calculate the intercept for this model. If set to False, no intercept will be used in
calculations (i.e. data is expected to be centered).
normalizebool, default=False
This parameter is ignored when fit_intercept is set to False. If True, the regressors X will be
normalized before regression by subtracting the mean and dividing by the l2-norm. If you
wish to standardize, please use StandardScaler before calling fit on an estimator with
normalize=False.
copy_Xbool, default=True
If True, X will be copied; else, it may be overwritten.
n_jobsint, default=None
The number of jobs to use for the computation. This will only provide speedup for n_targets >
1 and sufficient large problems. None means 1 unless in a joblib.parallel_backend context. -1
means using all processors. See Glossary for more details.
positivebool, default=False
When set to True, forces the coefficients to be positive. This option is only supported for
dense arrays.
Attributes
51
coef_array of shape (n_features, ) or (n_targets, n_features)
Estimated coefficients for the linear regression problem. If multiple targets are passed during
the fit (y 2D), this is a 2D array of shape (n_targets, n_features), while if only one target is
passed, this is a 1D array of length n_features.
rank_int
Rank of matrix X. Only available when X is dense.
In the multiclass case, the training algorithm uses the one-vs-rest (OvR) scheme if the
‘multi_class’ option is set to ‘ovr’, and uses the cross-entropy loss if the ‘multi_class’ option
is set to ‘multinomial’. (Currently the ‘multinomial’ option is supported only by the ‘lbfgs’,
‘sag’, ‘saga’ and ‘newton-cg’ solvers.)
This class implements regularized logistic regression using the ‘liblinear’ library, ‘newton-
cg’, ‘sag’, ‘saga’ and ‘lbfgs’ solvers. Note that regularization is applied by default. It can
handle both dense and sparse input. Use C-ordered arrays or CSR matrices containing 64-bit
floats for optimal performance; any other input format will be converted (and copied).
The ‘newton-cg’, ‘sag’, and ‘lbfgs’ solvers support only L2 regularization with primal
formulation, or no regularization. The ‘liblinear’ solver supports both L1 and L2
regularization, with a dual formulation only for the L2 penalty. The Elastic-Net regularization
is only supported by the ‘saga’ solver.
Parameters
penalty{‘l1’, ‘l2’, ‘elasticnet’, ‘none’}, default=’l2’
Used to specify the norm used in the penalization. The ‘newton-cg’, ‘sag’ and ‘lbfgs’ solvers
support only l2 penalties. ‘elasticnet’ is only supported by the ‘saga’ solver. If ‘none’ (not
supported by the liblinear solver), no regularization is applied.
New in version 0.19: l1 penalty with SAGA solver (allowing ‘multinomial’ + L1)
dualbool, default=False
Dual or primal formulation. Dual formulation is only implemented for l2 penalty with
liblinear solver. Prefer dual=False when n_samples > n_features.
52
tolfloat, default=1e-4
Tolerance for stopping criteria.
Cfloat, default=1.0
Inverse of regularization strength; must be a positive float. Like in support vector machines,
smaller values specify stronger regularization.
fit_interceptbool, default=True
Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.
intercept_scalingfloat, default=1
Useful only when the solver ‘liblinear’ is used and self.fit_intercept is set to True. In this case,
x becomes [x, self.intercept_scaling], i.e. a “synthetic” feature with constant value equal to
intercept_scaling is appended to the instance vector. The intercept becomes intercept_scaling
* synthetic_feature_weight.
Note! the synthetic feature weight is subject to l1/l2 regularization as all other features. To
lessen the effect of regularization on synthetic feature weight (and therefore on the intercept)
intercept_scaling has to be increased.
The “balanced” mode uses the values of y to automatically adjust weights inversely
proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)).
Note that these weights will be multiplied with sample_weight (passed through the fit
method) if sample_weight is specified.
For small datasets, ‘liblinear’ is a good choice, whereas ‘sag’ and ‘saga’ are faster for large
ones.
For multiclass problems, only ‘newton-cg’, ‘sag’, ‘saga’ and ‘lbfgs’ handle multinomial loss;
‘liblinear’ is limited to one-versus-rest schemes.
53
‘saga’ also supports ‘elasticnet’ penalty
Note that ‘sag’ and ‘saga’ fast convergence is only guaranteed on features with approximately
the same scale. You can preprocess the data with a scaler from sklearn.preprocessing.
Changed in version 0.22: The default solver changed from ‘liblinear’ to ‘lbfgs’ in 0.22.
max_iterint, default=100
Maximum number of iterations taken for the solvers to converge.
New in version 0.18: Stochastic Average Gradient descent solver for ‘multinomial’ case.
verboseint, default=0
For the liblinear and lbfgs solvers set verbose to any positive number for verbosity.
warm_startbool, default=False
When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just
erase the previous solution. Useless for liblinear solver. See the Glossary.
New in version 0.17: warm_start to support lbfgs, newton-cg, sag, saga solvers.
n_jobsint, default=None
Number of CPU cores used when parallelizing over classes if multi_class=’ovr’”. This
parameter is ignored when the solver is set to ‘liblinear’ regardless of whether ‘multi_class’ is
specified or not. None means 1 unless in a joblib.parallel_backend context. -1 means using all
processors. See Glossary for more details.
l1_ratiofloat, default=None
The Elastic-Net mixing parameter, with 0 <= l1_ratio <= 1. Only used if penalty='elasticnet'.
Setting l1_ratio=0 is equivalent to using penalty='l2', while setting l1_ratio=1 is equivalent to
using penalty='l1'. For 0 < l1_ratio <1, the penalty is a combination of L1 and L2.
Attributes
classes_ndarray of shape (n_classes, )
A list of class labels known to the classifier.
54
coef_ndarray of shape (1, n_features) or (n_classes, n_features)
Coefficient of the features in the decision function.
coef_ is of shape (1, n_features) when the given problem is binary. In particular, when
multi_class='multinomial', coef_ corresponds to outcome 1 (True) and -coef_ corresponds to
outcome 0 (False).
If fit_intercept is set to False, the intercept is set to zero. intercept_ is of shape (1,) when the
given problem is binary. In particular, when multi_class='multinomial', intercept_ corresponds
to outcome 1 (True) and -intercept_ corresponds to outcome 0 (False).
Changed in version 0.20: In SciPy <= 1.0.0 the number of lbfgs iterations may exceed
max_iter. n_iter_ will now report at most max_iter.
55
99.3. Random Forest Classifier
A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of
the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The sub-sample
size is controlled with the max_samples parameter if bootstrap=True (default), otherwise the whole dataset
is used to build each tree.
Parameters
n_estimatorsint, default=100
The number of trees in the forest.
Changed in version 0.22: The default value of n_estimators changed from 10 to 100 in 0.22.
max_depthint, default=None
The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all
leaves contain less than min_samples_split samples.
If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum
number of samples for each split.
If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum
number of samples for each node.
min_weight_fraction_leaffloat, default=0.0
The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a
leaf node. Samples have equal weight when sample_weight is not provided.
56
max_features{“auto”, “sqrt”, “log2”}, int or float, default=”auto”
The number of features to consider when looking for the best split:
If float, then max_features is a fraction and round(max_features * n_features) features are considered at
each split.
Note: the search for a split does not stop until at least one valid partition of the node samples is found, even
if it requires to effectively inspect more than max_features features.
max_leaf_nodesint, default=None
Grow trees with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in
impurity. If None then unlimited number of leaf nodes.
min_impurity_decreasefloat, default=0.0
A node will be split if this split induces a decrease of the impurity greater than or equal to this value.
N, N_t, N_t_R and N_t_L all refer to the weighted sum, if sample_weight is passed.
min_impurity_splitfloat, default=None
Threshold for early stopping in tree growth. A node will split if its impurity is above the threshold,
otherwise it is a leaf.
Deprecated since version 0.19: min_impurity_split has been deprecated in favor of min_impurity_decrease
in 0.19. The default value of min_impurity_split has changed from 1e-7 to 0 in 0.23 and it will be removed
in 1.0 (renaming of 0.25). Use min_impurity_decrease instead.
bootstrapbool, default=True
Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each
tree.
oob_scorebool, default=False
Whether to use out-of-bag samples to estimate the generalization accuracy.
n_jobsint, default=None
The number of jobs to run in parallel. fit, predict, decision_path and apply are all parallelized over the trees.
None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for
more details.
57
random_stateint, RandomState instance or None, default=None
Controls both the randomness of the bootstrapping of the samples used when building trees (if
bootstrap=True) and the sampling of the features to consider when looking for the best split at each node (if
max_features < n_features). See Glossary for details.
verboseint, default=0
Controls the verbosity when fitting and predicting.
warm_startbool, default=False
When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble,
otherwise, just fit a whole new forest. See the Glossary.
Note that for multioutput (including multilabel) weights should be defined for each class of every column
in its own dict. For example, for four-class multilabel classification weights should be [{0: 1, 1: 1}, {0: 1,
1: 5}, {0: 1, 1: 1}, {0: 1, 1: 1}] instead of [{1:1}, {2:5}, {3:1}, {4:1}].
The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class
frequencies in the input data as n_samples / (n_classes * np.bincount(y))
The “balanced_subsample” mode is the same as “balanced” except that weights are computed based on the
bootstrap sample for every tree grown.
Note that these weights will be multiplied with sample_weight (passed through the fit method) if
sample_weight is specified.
If float, then draw max_samples * X.shape[0] samples. Thus, max_samples should be in the interval (0, 1).
Attributes
base_estimator_DecisionTreeClassifier
The child estimator template used to create the collection of fitted sub-estimators.
58
estimators_list of DecisionTreeClassifier
The collection of fitted sub-estimators.
n_classes_int or list
The number of classes (single output problem), or a list containing the number of classes for each output
(multi-output problem).
n_features_int
The number of features when fit is performed.
n_outputs_int
The number of outputs when fit is performed.
oob_score_float
Score of the training dataset obtained using an out-of-bag estimate. This attribute exists only when
oob_score is True.
Python is increasingly being used as a scientific language. Matrix and vector manipulations are extremely
important for scientific computations. Both NumPy and Pandas have emerged to be essential libraries for
any scientific computation, including machine learning, in python due to their intuitive syntax and high-
performance matrix computation capabilities.
In this post, we will provide an overview of the common functionalities of NumPy and Pandas. We will
realize the similarity of these libraries with existing toolboxes in R and MATLAB. This similarity and
added flexibility have resulted in wide acceptance of python in the scientific community lately. Topic
covered in the blog are:
1. Overview of NumPy
2. Overview of Pandas
3. Using Matplotlib
This post is an excerpt from a live hands-on training conducted by CloudxLab on 25th Nov 2017. It was
attended by more than 100 learners around the globe. The participants were from countries namely; United
States, Canada, Australia, Indonesia, India, Thailand, Philippines, Malaysia, Macao, Japan, Hong Kong,
Singapore, United Kingdom, Saudi Arabia, Nepal, & New Zealand.
What is NumPy?
59
NumPy stands for ‘Numerical Python’ or ‘Numeric Python’. It is an open source module of Python which
provides fast mathematical computation on arrays and matrices. Since, arrays and matrices are an essential
part of the Machine Learning ecosystem, NumPy along with Machine Learning modules like Scikit-learn,
Pandas, Matplotlib, TensorFlow, etc. complete the Python Machine Learning Ecosystem.
NumPy provides the essential multi-dimensional array-oriented computing functionalities designed for
high-level mathematical functions and scientific computation. Numpy can be imported into the notebook
using
NumPy’s main object is the homogeneous multidimensional array. It is a table with same type elements,
i.e, integers or string or characters (homogeneous), usually integers. In NumPy, dimensions are called axes.
The number of axes is called the rank.
There are several ways to create an array in NumPy like np.array, np.zeros, no.ones, etc. Each of them
provides some flexibility.
Command to create an
Example
array
>>> type(a)
<type 'numpy.ndarray'>
np.array
>>> b = np.array((3, 4, 5))
>>> type(b)
<type 'numpy.ndarray'>
60
>>> np.ones( (3,4), dtype=np.int16 )
61
>>> np.linspace(0, 5/3, 6)
>>> np.random.rand(2,3)
np.random.rand(2,3)
[ 0.5388662 , 0.06929014, 0.07908068]])
>>> np.empty((2,3))
np.empty((2,3))
[ 0.35294004, 0.07347101, 0.54552084]])
NumPy array elements can be accessed using indexing. Below are some of the useful examples:
The session covers these and some important attributes of the NumPy array object in detail.
Machine learning uses vectors. Vectors are one-dimensional arrays. It can be represented either as a row or
as a column array.
What are vectors? Vector quantity is the one which is defined by a magnitude and a direction. For example,
force is a vector quantity. It is defined by the magnitude of force as well as a direction. It can be
represented as an array [a,b] of 2 numbers = [2,180] where ‘a’ may represent the magnitude of 2 Newton
and 180 (‘b’) represents the angle in degrees.
Another example, say a rocket is going up at a slight angle: it has a vertical speed of 5,000 m/s, and also a
slight speed towards the East at 10 m/s, and a slight speed towards the North at 50 m/s. The rocket’s
velocity may be represented by the following vector: [10, 50, 5000] which represents the speed in each of
x, y, and z-direction.
Similarly, vectors have several usages in Machine Learning, most notably to represent observations and
predictions.
For example, say we built a Machine Learning system to classify videos into 3 categories (good, spam,
clickbait) based on what we know about them. For each video, we would have a vector representing what
we know about it, such as: [10.5, 5.2, 3.25, 7.0]. This vector could represent a video that lasts 10.5 minutes,
but only 5.2% viewers watch for more than a minute, it gets 3.25 views per day on average, and it was
flagged 7 times as spam.
As you can see, each axis may have a different meaning. Based on this vector, our Machine Learning
system may predict that there is an 80% probability that it is a spam video, 18% that it is clickbait, and 2%
that it is a good video. This could be represented as the following vector: class_probabilities =
[0.8,0.18,0.02].
As can be observed, vectors can be used in Machine Learning to define observations and predictions. The
properties representing the video, i.e., duration, percentage of viewers watching for more than a minute are
called features.
Since the majority of the time of building machine learning models would be spent in data processing, it is
important to be familiar to the libraries that can help in processing such data.
In python, a vector can be represented in many ways, the simplest being a regular python list of numbers.
Since Machine Learning requires lots of scientific calculations, it is much better to use NumPy’s ndarray,
which provides a lot of convenient and optimized implementations of essential mathematical operations on
vectors.
Vectorized operations perform faster than matrix manipulation operations performed using loops in python.
For example, to carry out a 100 * 100 matrix multiplication, vector operations using NumPy are two orders
of magnitude faster than performing it using loops.
63
Some ways in which NumPy arrays are different from normal Python arrays are:
1. If you assign a single value to a ndarray slice, it is copied across the whole slice
So, it is easier to assign values to a slice of an array in a NumPy array as compared to a normal array
wherein it may have to be done using loops.
2. ndarray slices are actually views on the same data buffer. If you modify it, it is going to modify the original
ndarray as well.
>>> a=[1,2,5,7,8]
>>> a = np.array([1, 2, 5, 7, 8])
>>> b=a[1:5]
>>> a_slice = a[1:5]
>>> b[1]=3
>>> a_slice[1] = 1000
>>> print(a)
>>> a
>>> print(b)
array([ 1, 2, 1000, 7, 8])
[1, 2, 5, 7, 8]
# Original array was modified
[2, 3, 7, 8]
64
If we need a copy of the NumPy array, we need to use the copy method as another_slice = another_slice =
a[2:6].copy(). If we modify another_slice, a remains same
3. The way multidimensional arrays are accessed using NumPy is different from how they are accessed in
normal python arrays. The generic format in NumPy multi-dimensional arrays is:
NumPy arrays can also be accessed using boolean indexing. For example,
>>> a = np.arange(12).reshape(3, 4)
NumPy arrays are capable of performing all basic operations such as addition, subtraction, element-wise
product, matrix dot product, element-wise division, element-wise modulo, element-wise exponents and
conditional operations.
65
In general, when NumPy expects arrays of the same shape but finds that this is not the case, it applies the
so-called broadcasting rules.
1. For the arrays that do not have the same rank, then a 1 will be prepended to the smaller ranking arrays until
their ranks match. For example, when adding arrays A and B of sizes (3,3) and (,3) [rank 2 and rank 1], 1
will be prepended to the dimension of array B to make it (1,3) [rank=2]. The two sets are compatible when
their dimensions are equal or either one of the dimension is 1.
2. When either of the dimensions compared is one, the other is used. In other words, dimensions with size 1
are stretched or “copied” to match the other. For example, upon adding a 2D array A of shape (3,3) to a 2D
ndarray B of shape (1, 3). NumPy will apply the above rule of broadcasting. It shall stretch the array B and
replicate the first row 3 times to make array B of dimensions (3,3) and perform the operation.
NumPy provides basic mathematical and statistical functions like mean, min, max, sum, prod, std, var,
summation across different axes, transposing of a matrix, etc.
A particular NumPy feature of interest is solving a system of linear equations. NumPy has a function to
solve linear equations. For example,
2x + 6y = 6
5x + 3y = -9
66
>>> coeffs = np.array([[2, 6], [5, 3]])
>>> solution
array([-3., 2.])
What is Pandas?
Similar to NumPy, Pandas is one of the most widely used python libraries in data science. It provides high-
performance, easy to use structures and data analysis tools. Unlike NumPy library which provides objects
for multi-dimensional arrays, Pandas provides in-memory 2d table object called Dataframe. It is like a
spreadsheet with column names and row labels.
Hence, with 2d tables, pandas is capable of providing many additional functionalities like creating pivot
tables, computing columns based on other columns and plotting graphs. Pandas can be imported into
Python using:
Pandas Series object is created using pd.Series function. Each row is provided with an index and by
defaults is assigned numerical values starting from 0. Like NumPy, Pandas also provide the basic
mathematical functionalities like addition, subtraction and conditional operations and broadcasting.
Pandas dataframe object represents a spreadsheet with cell values, column names, and row index labels.
Dataframe can be visualized as dictionaries of Series. Dataframe rows and columns are simple and intuitive
to access. Pandas also provide SQL-like functionality to filter, sort rows based on conditions. For example,
67
"children": pd.Series([0, 3], index=["charles", "bob"]),
>>> people
New columns and rows can be easily added to the dataframe. In addition to the basic functionalities, pandas
dataframe can be sorted by a particular column.
Dataframes can also be easily exported and imported from CSV, Excel, JSON, HTML and SQL database.
Some other essential methods that are present in dataframes are:
What is matplotlib?
Matplotlib is a 2d plotting library which produces publication quality figures in a variety of hardcopy
formats and interactive environments. Matplotlib can be used in Python scripts, Python and IPython shell,
Jupyter Notebook, web application servers and GUI toolkits.
matplotlib.pyplot is a collection of functions that make matplotlib work like MATLAB. Majority of
plotting commands in pyplot have MATLAB analogs with similar arguments. Let us take a couple of
examples:
>>> num_bins = 5
>>> plt.show()
>>> plt.show()
Summary
69
Hence, we observe that NumPy and Pandas make matrix manipulation easy. This flexibility makes them
very useful in Machine Learning model development.
.
10. Data-Dictionary
Chapter 10
Data-Dictionary
70