Build ETL Using Python

UNDERSTANDING MACHINE LEARNING
AND HOW TO IMPLEMENT IT

Achmad Adyatma Ardi1
Programmer, Data Scientist, Engineer
*Corresponding author : achmad541997@gmail.com
Saturday, June 4, 2022
Abstract
Data science is a discipline where we try to answer questions that is needed instead of
using a mere assumption, we use data and logical thinking to answer that question
with a belief. In a business purposes, even though the value of uncertainty is always
there, it would be wise to make decisions based on the data you have. Data can be
your best friend in making important decisions. On this occasion, I will try to analyze
the data with the aim of answering important business-related questions using the
Python programming language with Pandas and Matplotlib libraries. Hope you enjoy
it!
Keyword : data science, python programming, sales analysis, python pandas, python
matplotlib, answer questions
Table of contents :
Ch. 1 Introduction....................................1
Ch. 2 Mean, median and mode................2
Ch. 3 Standard deviation..........................3
Ch. 4 Percentile........................................3
Ch. 5 Data distribution.............................3
Ch. 6 Scatter plot.....................................5
Ch. 7 Regression......................................5
Ch. 8 Scale...............................................5
Ch. 9 Train / test......................................5
Ch. 10 Decision tree................................5
Ch. 11 Confusion matrix..........................5
Ch. 12 Clustering.....................................5
Ch. 13 Grid search...................................5
Ch. 14 Categorical data............................5
Ch. 15 K – means....................................5
Ch. 16 Bootstrap aggregation..................5
Ch. 17 Cross validation............................5
1 | Achmad Adyatma Ardi

Ch. 18 AUC – ROC curve.......................5
Additional attachment :
Attachment. 1 Example of a database......7
Attachment. 2..........................................7
Attachment. 3..........................................7
Attachment. 4..........................................7
be anything from an array to a
complete database.
Ch. 1 Introduction
Example of an array :
Machine learning is a type of [99,86,87,88,111,86,103,87,94,78,7
artificial intelligence (AI) that allows 7,85,86]
software applications to become more
accurate at predicting outcomes Example of a database
without being explicitly programmed (see Attachment.1)
to do so. Machine learning algorithms
use historical data as input to predict In machile learning it is common
new output values. to work with very large data sets. In
The major difference between this tutorial we will try to make it as
machine learning and statistics is their easy as possible to understand the
purpose. Machine learning models are different concepts of machine
designed to make the most accurate learning, and we will work with
predictions possible. Statistical models small easy – to – understand data
are designed for inference about the sets.
relationships between variables.
I. 3 Data types
I. 1 Where to start ? To analyze data, it is important
We will go back to mathematics to know what type of data we are
and study statistics, and how to dealing with. We can split the data
calculate important numbers based types into three main categories :
on data sets 1. Numerical
We will also learn how to use Numerical data are
various Python modules to get the numbers, and can be split into
answers we need. Then, we will two numerical categories:
learn how to make functions that - Discrete data : limited to
are able to predict the outcome integers. Example : the
based on what we have learned number of cars passing by
- Continuous data : infinite
I. 2 Data set value. Example : the price
In the mind of a computer, a data of an item, or the size of
set is any collection of data. It can an item
2. Categorical

Categorical data are values
that cannot be measured up Standard deviation is a number that
against each other. Example : describes how spread out the values
a color value, or any yes/no are. A low standard deviation means
values
that most of the numbers are close to
3. Ordinal
Categorical data are like the mean (average) value. A high
categorical data, but can be standard deviation means that the
measured up against each values are spread out over a wider
other. Example : school range.
grades where A is better than
B and so on
By knowing the data type of
your data source, you will be able
to know what technique to use
when analyzing them
Ch. 2 Mean, median and mode
In machine learning (and in figure. 2 Example of calculating std. deviation

mathematics) there are often three of 2 data arrays namedly speed1 and
speed2 (Notepad++ v8.4.2)
values that interests us :
1. Mean – the average value
speed1 has a mean of 86.42 and a
2. Median – the mid point value
standard deviation of 0.90. It means
3. Mode – the most common value
the most of the values are within the
range of 0.9 from the mean value,
which is 86.4. Meanwhile speed2 has a
mean of 77.42 and a standard
deviation of 37.84. It means the most
of the values are within the range of
37.84 from the mean value, which is
77.42.
figure. 1 Example of calculating the mean,

median, and mode using Python Ch. 4 Percentile
programming language (Notepad++
v8.4.2)
Percentiles are used in statistics to
give you a number that describes the
Ch. 3 Standard deviation

value that a given percent of the values 1. Normal distribution
are lower than. Let’s say we have an (Gaussian)
array of the age of all the people that The normal distribution
lives in a stress. is one of the most
important distributions. It
Ages = [5,31,43,48,50,41,7,11,15,39,80, is also called the Gaussian
82,32,2,8,6,25,36,27,61,31] distribution after the
German mathematician
What is the 75 percentile ? the Carl Friedrich Gauss.
answer is 43, meaning that 75% of the It fits the probability
people are 43 or younger. distribution of many
events, eg. IQ scores,
heartbeat etc. The curve of
a normal distribution is
also know as the bell curve
because of the bell –
shaped curve. It has three
paramaters :
figure. 3 Example of calculating percentiles of a. loc – (mean)
array data namedly ages (Notepad+
+ v8.4.2)
where the peak of
the bell exists
b. scale – (standard
Ch. 5 Data distribution deviation) how
flat the graph
Data distribution is a function distribution
that determines the values of a variable should be
and quantifies relative frequency, it c. size – the shape of
transforms raw data into graphical the returned array
methods to give valuable information.
It becomes substantial to understand
the kind of distribution that a
population has that assists in applying
proper statistical techniques / methods.
V. 1 Types of Data Distribution /

Statistical Distribution

4. Uniform distribution
5. Logistic distribution
6. Multinomial distribution
7. Exponential distribution
8. Chi – square distribution
9. Rayleigh distribution
10. Pareto distribution
11. Zipf distribution
Ch. 6 Scatter plot

figure. 4 Typical normal (Gaussian) data
Ch. 7 Regression
distribution (Notepad++ v8.4.2)
Ch. 8 Scale
Ch. 9 Train / test
we use the array from Ch. 10 Decision tree
the numpy.random.normal Ch. 11 Confusion matrix
( ) method, with 100000 Ch. 12 Clustering
values, to draw a Ch. 13 Grid search
histogram with 100 bars. Ch. 14 Categorical data
We specify that the mean
value is 5.0, and the Ch. 15 K – means
Ch. 16 Bootstrap aggregation
standard deviation is 1.0.
Ch. 17 Cross validation
Meaning that the values Ch. 18 AUC – ROC curve
should be concentrated
around 5.0, and rarely
further away than 1.0 from
the mean. And as you can What is Regression ?
see from the histogram, What is Regression ?
most values are between What is Regression ?
4.0 and 6.0, with a top at What is Regression ?
approximately 5.0. What is Regression ?
2. Binomial distribution What is Regression ?

Is a discrete distribution. It
describes the outcome of Regression analysis is a reliable
binary scenarios, e.g. toss method of identifying which variables
of a coin, it will eiher be have impact on a topic of interest. The
head or tails. It has process of performing a regression
3. Poisson distribution allows you to confidently determine

which factors matter most, which 5. Ridge regression
factors can be ignored, and how these 6. Lasso regression
factors influence each other. 7. Polynomial regression
The term regression is used 8. Bayesian linear regression
when you try to find the relationship
between variables. In machine
learning, and in statistical modeling,
that relationship is used to predict the
outcome of future events.
References
There are many types of
regression analysis techniques, and the
use of each method depends upon the https://pynative.com/python-mysql-
number of factors. These factors database-connection/
include the type of target variable,
shape of the regression line, and the
number of independent variables.
1. Linear regression
Linear regression is the
practice of statistically
calculating a straight line
that demonstrated a
relationship between two
different variables.
2. Polynomial regression
Is a form of regression
analysis in which the
relationship between the
independent variable x and
the dependent variable y is
modelled as an nth degree
polynomial in x.
3. Logistic regression
Estimates the probability
of an event occurring, such
as voted or didn’t vote,
based on a given dataset of
independent variables.
Since the outcome is a
probability, the dependent
variable is bounded
between 0 and 1.
4. Multiple regression

Attachment. 1 Example of a database
Carname Color Age Speed AutoPass

BMW red 5 99 Y
Volvo black 7 86 Y
VW gray 8 87 N
VW white 7 88 Y
Ford white 2 111 Y
VW white 17 86 Y
Tesla red 2 103 Y
BMW black 9 87 Y
Volvo gray 4 94 N
Ford white 11 78 N
Toyota gray 12 77 N
VW white 9 85 N
Toyota blue 6 86 Y
Attachment. 2
Attachment. 3
Attachment. 4

Build ETL Using Python

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Build ETL Using Python

Uploaded by

Copyright:

Available Formats

UNDERSTANDING MACHINE LEARNING

AND HOW TO IMPLEMENT IT

1 | Achmad Adyatma Ardi

2 | Achmad Adyatma Ardi

Ch. 2 Mean, median and mode

In machine learning (and in figure. 2 Example of calculating std. deviation

figure. 1 Example of calculating the mean,

3 | Achmad Adyatma Ardi

V. 1 Types of Data Distribution /

4 | Achmad Adyatma Ardi

Ch. 6 Scatter plot

2. Binomial distribution What is Regression ?

5 | Achmad Adyatma Ardi

6 | Achmad Adyatma Ardi

Carname Color Age Speed AutoPass

7 | Achmad Adyatma Ardi

You might also like