Sessiya 2-Python

1
Python ilə Data

Science Sessiya 2
Təlimçi: Etibar Hüseynli
www.qss.az
QSS Analytics/Tədqiqat və İnkişaf Mərkəzi. Bütün hüquqlar qorunur.
2
Dərs 2 :
Xülasə
Əhatə ediləcək mövzular
Mövzu 2:datatypes in python,

numpy and pandas packages,git
and github
Keyz diskussiya : Marketing

campaign for bank
www.qss.az
Probability Distribution 3
Probability can be used for more than calculating the likelihood of one event; it
can summarize the likelihood of all possible outcomes
A thing of interest in probability is called a random variable, and the

relationship between each possible outcome for a random variable and their
probabilities is called a probability distribution.
The structure and type of the probability distribution varies based on the
properties of the random variable, such as continuous or discrete, and this, in
turn, impacts how the distribution might be summarized or how to calculate the
most likely outcome and its probability.
www.qss.az
QSSAnalytics/Tədqiqat və İnkişaf Mərkəzi. Bütün hüquqlar qorunur.
Random Variable 4
A random variable is a quantity that is produced by a random process.
In probability, a random variable can take on one of many possible values,

e.g. events from the state space. A specific value or set of values for a
random variable can be assigned a probability.
A random variable can be either discrete or continuous.
www.qss.az
Discrete Random Variable 5
Example: Coin toss
Discrete random variables take on a

countable
number of distinct values.
Consider an experiment where a coin is
tossed three times. If X represents the
number of times that the coin comes up
heads, then X is a discrete random variable
that can only have the values 0, 1, 2, 3 (from
no heads in three successive coin tosses to all
heads). No other value is possible for X.

www.qss.az
Continuos Random Variable 6
Example: Time
A continuous random variable is a random

variable where the data can take infinitely many
values.
For example, a random variable measuring the time
taken for something to be done is continuous since
there are an infinite number of possible times that
can be taken.
www.qss.az
Discrete Random Variable 7
The two types of discrete random variables most commonly used in

machine learning are binary and categorical.
A binary random variable is a discrete random variable where the finite set
of outcomes is in {0, 1}.
A categorical random variable is a discrete random variable where the finite
set of outcomes is in {1, 2, …, K}, where K is the total number of unique
outcomes.
Each outcome or event for a discrete random variable has a probability.
www.qss.az
Probability of discrete random variable 8
www.qss.az
Probability Distribution 9
A probability distribution is a summary of probabilities for the

values of a random variable.
Important properties of a probability distribution are:
• the expected value (The average value of a random variable.)
• the variance (The average spread of values around the
expected value.)
• skewness
• kurtosis
www.qss.az
Skewness 10
www.qss.az
Kurtosis 11
www.qss.az
Discrete random variable distribution 12
www.qss.az
Discrete random variable distribution 13
Probability 
2 3 4 5 6 7 8 9 10 11 12
Sum 
www.qss.az
Continuous Probability Distributions 14
A continuous probability distribution summarizes the probability for

a continuous random variable.
The probability distribution function, or PDF, defines the probability

distribution for a continuous random variable.
 The probabilities of the heights of humans form a Normal

distribution.
 The probabilities of movies being a hit form a Power-law
distribution.
 The probabilities of income levels form a Pareto distribution.
www.qss.az
Continuous Probability Distributions 15
Temperature 33.3%
3
30.6
22.22% 22.22%
31.4 2
11.1% 11.1%
31.2 1
32.1 30 – 31
31 – 32
32 – 33
33 – 34
34 – 35
32.2
32 – 33
33 – 34
34 – 35
30 – 31
31 – 32
32.7
33.4 Frequency Distribution with Bins Probability of the Bins Probability Density
33.8
34.6
www.qss.az
Normal Distribution 16
www.qss.az
Normal Distribution 17
www.qss.az
Example of Normal Distribution 18
• Diastolic Blood Pressure •Manufacturing • Arrival Time at office
50 82 110 94 mm 100 mm 106 mm 7:45 AM 8:00 AM 8:15 AM
www.qss.az
Positively Skewed Distribution 19
www.qss.az
Computing Normal Probabilities 20
To compute normal probabilities, you first convert a normally distributed

random variable, X, to a standardized normal random variable, Z,
using the transformation formula.
www.qss.az
Time to download a video is normally distributed, with a mean of 7 seconds

and a standard deviation of 2 seconds. Therefore, a download time of 9
seconds is equivalent to 1 standardized unit (1 standard deviation) above
the mean because.
www.qss.az
A download time of 1 second is equivalent to –3 standardized units

(3 standard deviations) below the mean because.
www.qss.az
The standard deviation is the unit of measurement. In other words, a time

of 9 seconds is 2 seconds (1 standard deviation) higher, or slower, than
the mean time of 7 seconds. Similarly, a time of 1 second is 6 seconds (3
standard deviations) lower, or faster, than the mean time.
www.qss.az
To further illustrate the transformation formula, suppose that another website

has a download time for a video that is normally distributed, with a mean
seconds m = 4 and a standard deviation s = 1 second.
www.qss.az
• Comparing these results with previous one, you see that a

download time of 5 seconds is 1 standard deviation above the
mean download time because
• A time of 1 second is 3 standard deviations below the mean

download time because
www.qss.az
With the Z value computed, you look up the normal probability using a
table of values from the cumulative standardized normal distribution.
Suppose you wanted to find the probability that the download time for first
example is less than 9 seconds. Recall that transforming to standardized
Z units, given a mean 7 seconds and a standard deviation seconds, leads
to a Z value of +1.00.
www.qss.az
With this value, you use Table to find the cumulative area under the
normal curve less than (to the left of) Z = +1.0 To read the probability or
area under the curve less than Z=+1.0 . You scan down the Z column in
Table until you locate the Z value of interest(in 10ths) in the Z row for 1.0.
www.qss.az
www.qss.az
However, for the other website, you see that a time of 5 seconds is
1 standardized unit above the mean time of 4 seconds. Thus, the
probability that the download time will be less than 5 seconds is also
0.8413.
www.qss.az
Challenge 30
 What is the probability that the video download time for the
first website will be more than 9 seconds?
 What is the probability that the video download time for the
first website will be under 7 seconds or over 9 seconds?
 What is the probability that video download time for the first
website will be between 5 and 9 seconds?
www.qss.az
Golden rule 31
 For any normal distribution, 68.26% of the values will fall within +-
1 standard deviation of the mean.
 95.44% of the values will fall within +-2 standard deviations of the
mean. Thus, 95.44% of the download times are between 3 and
11 seconds.
 99.73% of the values are within +-3 standard deviations above or
below the mean. Thus, 99.73% of the download times are between
1 and 13 seconds.
www.qss.az
Example 32
Therefore, it is unlikely (0.0027, or only 27 in 10,000) that a download

time will be so fast or so slow that it will take under 1 second or more
than 13 seconds. In general, you can use 6std (that is, 3 standard
deviations below the mean to 3 standard deviations above the mean)
as a practical approximation of the range for normally distributed data.
www.qss.az
33
www.qss.az
Rule 34
 Approximately 68.26% of the values fall within +-1 standard deviation

of the mean.
 Approximately 95.44% of the values fall within +-2 standard
deviations of the mean
 Approximately 99.73% of the values fall within +-3 standard
deviations of the mean.
www.qss.az
Example 35
How much time (in seconds) will elapse before the fastest 10% of
the downloads of an first example video are complete?
Because 10% of the videos are expected to download in under X
seconds, the area under the normal curve less than this value is
0.1000. Using the body of Table, you search for the area or probability
of 0.1000. The closest result is 0.1003, as shown in Table
www.qss.az
Example 36
Working from this area to the margins of the table, you find that the Z
value corresponding to the particular Z row (-1.2) and Z column (.08)
is
-1.28
www.qss.az
Example 37
www.qss.az
Çay fasiləsi
www.qss.az
Finding outliers 39
Outliers are stragglers — extremely high or extremely low values — in

a data set that can throw off your stats. For example, if you were
measuring children’s nose length, your average value might be thrown off if
Pinocchio was in the class.
An outlier is a piece of data that is an abnormal distance from other

points. In other words, it’s data that lies outside the other values in the
set.
www.qss.az
Outliers 40
In this set of random numbers, 1 and 201 are outliers:

1, 99, 100, 101, 103, 109, 110, 201
“1” is an extremely low value and “201” is an extremely high value.
Outliers aren’t always that obvious. Let’s say you received the
following paychecks last month:
$225, $250, $25, $235.
Your average paycheck is $135. But that small paycheck ($25) might be
because you went on vacation, so a weekly paycheck average of $135 isn’t
a true reflection of how much you earned. Your average is actually closer to
$237 if you take the outlier ($25) out of the set.
www.qss.az
Outliers 41
Of course, trying to find outliers isn’t always that simple. Your data set
may look like this:
61, 10, 32, 19, 22, 29, 36, 14, 49, 3.
You could take a guess that 3 might be an outlier and perhaps 61. But
you’d be wrong: 61 is the only outlier in this data set.
www.qss.az
Boxplot 42
A box and whiskers chart (boxplot)

often shows outliers
www.qss.az
Finding outliers 43
The most effective way to find all of your outliers is by using the interquartile
range (IQR). The IQR contains the middle bulk of your data, so outliers can
be easily found once you know the IQR.
An outlier is defined as being any point of data that lies over 1.5 IQRs below
the first quartile (Q1) or above the third quartile (Q3)in a data set.
High = (Q3) + 1.5

IQR Low = (Q1) –
1.5 IQR
www.qss.az
Finding outliers 44
Sample Question: Find the outliers for the following data set: 3, 10, 14, 22, 19,
29, 70, 49, 36, 32.
Step 1: Find the Q1, Q3 and

IQR. Step 2: Multiply IQR by 1.5
Step 3: Add the amount you found in Step 2 to Q3 from Step
1 Step 4: Subtract Step 2 from Q1
Step 5: Insert your low and high values
Step 6: Highlight any number below or above the numbers
you inserted in step 5 www.qss.az
Finding IQR 45
-Sort data: 3,10,14,19,22,29,32,36,49,70

-Find the median: (3,10,14,19,22,|29,32,36,49,70)
-Find Q1 = (3,10,14,19,22 ) = 14
-Find Q3 = (29,32,36,49,70) = 36
-Find IQR = Q2-Q1 = 36-14 = 22
www.qss.az
Finding outliers 46
Step 2: Multiply IQR by 1.5

22*1.5 = 33
Step 3: Add the amount you found in Step 2 to Q3 from Step

1
33+36 = 69
This is your upper limit.
www.qss.az
Finding outliers 47
Step 4: Subtract the amount you found in Step 2 from Q1 from Step
1:
14 – 33 = -19.
This is your lower limit. Set this number aside for a moment.
Step 5 : Insert your low and high values into your data set, in order:
-19, 3, 10, 14, 19, 22, 29, 32, 36, 49, 69, 70
Step 6: Highlight any number below or above

-19, 3, 10, 14, 19, 22, 29, 32, 36, 49, 69, 70 www.qss.az
Numpy 48
Using NumPy, a developer can perform the following operations

• Mathematical and logical operations on arrays.
• Fourier transforms and routines for shape manipulation.
• Operations related to linear algebra. NumPy has in-built functions for
linear algebra and random number generation.
NumPy is often used along with packages like SciPy (Scientific Python)
and Mat−plotlib (plotting library). This combination is widely used as a
replacement for MatLab, a popular platform for technical computing.
www.qss.az
Numpy 49
NumPy is suited to many applications:

• Image processing
• Signal processing
• Linear algebra
• A plethora of others
www.qss.az
Numpy: Arrays 50
Array in Numpy is a table of elements (usually numbers), all of the same

type, indexed by a tuple of positive integers. In Numpy, number of dimensions
of the array is called rank of the array.A tuple of integers giving the size of the
array along each dimension is known as shape of the array. An array class in
Numpy is called as ndarray. Elements in Numpy arrays are accessed by
using square brackets and can be initialized by using nested Python Lists.
www.qss.az
Numpy: Arrays 51
The key difference between an array and a list is, arrays are designed to
handle vectorized operations while a python list is not.
That means, if you apply a function it is performed on every item in the array,
rather than on the whole array object.
Another characteristic is that, once a numpy array is created, you cannot

increase its size. To do so, you will have to create a new array. But such a
behavior of extending the size is natural in a list.
A numpy array must have all items to be of the same data type, unlike lists.
This is another significant difference.
www.qss.az
Pandas 52
This tool is essentially your data’s home. Through pandas, you get acquainted with your data by
cleaning, transforming, and analyzing it.
For example, say you want to explore a dataset stored in a CSV on your computer. Pandas will
extract the data from that CSV into a DataFrame — a table, basically — then let you do things like:
• Calculate statistics and answer questions about the data, like
• What's the average, median, max, or min of each column?
• Does column A correlate with column B?
• What does the distribution of data in column C look like?
• Clean the data by doing things like removing missing values and filtering rows or columns by
some criteria
• Visualize the data with help from Matplotlib. Plot bars, lines, histograms, bubbles, and more.
• Store the cleaned, transformed data back into a CSV, other file or database
www.qss.az
Pandas 53
It takes data (like a CSV or TSV file, or a SQL database) and creates a
Python object with rows and columns called data frame that looks very similar
to table in a statistical software (think Excel or SPSS for example.)Pandas is
built on top of the NumPy package, meaning a lot of the structure of NumPy
is used or replicated in Pandas. Data in pandas is often used to feed
statistical analysis in SciPy, plotting functions from Matplotlib, and machine
learning algorithms in Scikit-learn.
www.qss.az
Pandas 54
 When you want to use Pandas for data analysis, you’ll usually use it in one of
three different ways:
 Convert a Python’s list, dictionary or Numpy array to a Pandas data frame
 Open a local file using Pandas, usually a CSV file, but could also be a
delimited text file (like TSV), Excel, etc
 Open a remote file or database like a CSV or a JSONon a website through a
URL or read from a SQL table/database
www.qss.az
Pandas 55
There are different filetypes Pandas can work with, so you would replace
“filetype” with the actual, well, filetype (like CSV). You would give the path,
filename etc inside the parenthesis. Inside the parenthesis you can also pass
different arguments that relate to how to open the file. There are numerous
arguments and in order to know all you them, you would have to read the
documentation (for example, the documentation for pd.read_csv() would
contain all the arguments you can pass in this Pandas command).
www.qss.az
Pandas 56
In order to convert a certain Python object (dictionary, lists etc) the basic
command is:
Inside the parenthesis you would specify the object(s) you’re

creating the data frame from.
You can also save a data frame you’re working with/on to different
kinds of files (like CSV, Excel, JSON and SQL tables). The general
code for that is:
www.qss.az
Pandas: Data viewing 57
Viewing the data:

• Running the name of the data frame would give you the entire table, but
you can also get the first n rows with df.head(n) or the last n rows with
df.tail(n).
• df.shape would give you the number of rows and columns.
• df.info() would give you the index, datatype and memory information.
• The command s.value_counts(dropna=False) would allow you to view
unique values and counts for a series (like a column or a few columns).
• A very useful command is df.describe() which inputs summary statistics
for numerical columns.
www.qss.az
Pandas: Statistics 58
• df.mean() Returns the mean of all columns

• df.corr() Returns the correlation between columns in a data frame
• df.count() Returns the number of non-null values in each data frame
column
• df.max() Returns the highest value in each column
• df.min() Returns the lowest value in each column
• df.median() Returns the median of each column
• df.std() Returns the standard deviation of each column
www.qss.az
Pandas: Data selection 59
Selection of Data
• You can select a column (df[col]) and return column with label col
as Series or a few columns (df[[col1, col2]]) and returns columns
as a new DataFrame. You can select by position (s.iloc[0]), or by
index (s.loc['index_one']). In order to select the first row you can
use df.iloc[0,:] and in order to select the first element of the first
column you would run df.iloc[0,0] .
www.qss.az
Pandas 60
You can use different conditions to filter columns. For example,

df[df[year] > 1984] would give you only the column year is greater
than 1984. You can use & (and) or | (or) to add different conditions to
your filtering. This is also called boolean filtering.
It is possible to sort values in a certain column in an ascending order

using df.sort_values(col1) ; and also in a descending order using
df.sort_values(col2,ascending=False). Furthermore, it’s possible to
sort values by col1 in ascending order then col2 in descending order
by using df.sort_values([col1,col2],ascending=[True,False]).
www.qss.az
Pandas 61
The last command in this section is groupby. It involves splitting the

data into groups based on some criteria, applying a function to each
group independently and combining the results into a data structure.
df.groupby(col) returns a groupby object for values from one column
while df.groupby([col1,col2]) returns a groupby object for values
from multiple columns.
www.qss.az
Pandas: Data cleaning 62
Check for missing values in the data by running pd.isnull() which

checks for null Values, and returns a boolean array (an array of true
for missing values and false for non-missing values).
In order to get a sum of null/missing values, run pd.isnull().sum().
pd.notnull() is the opposite of pd.isnull()
After you get a list of missing values you can get rid of them, or drop
them by using df.dropna() to drop the rows or df.dropna(axis=1) to
drop the columns.
www.qss.az
Pandas: Data cleaning 63
A different approach would be to fill the missing values with other

values by using df.fillna(x) which fills the missing values with x (you
can put there whatever you want) or s.fillna(s.mean()) to replace all
null values with the mean (mean can be replaced with almost any
function from the statistics section).
www.qss.az
Pandas: join and combine 64
The last set of basic Pandas commands are for joining or combining
data frames or rows/columns. The three commands are:
• df1.append(df2)— add the rows in df1 to the end of df2 (columns
should be identical)
• df.concat([df1, df2],axis=1) — add the columns in df1 to the end of
df2 (rows should be identical)
• df1.join(df2,on=col1,how='inner') — SQL-style join the columns in
df1 with the columns on df2 where the rows for colhave identical
values. how can be equal to one of: 'left', 'right', 'outer', 'inner'
www.qss.az
Çay
fasiləsi
www.qss.a z
Matplotlib 66
#pip install matplotlib
www.qss.az
Scatter plot 67
www.qss.az
Clustering 68
www.qss.az
Density Plot 69
www.qss.az
Github 70
www.github.com
 Largest web based git repository hosting service
 Allows code collaboration
 Allows open source projects and documentation
www.qss.az
Git 71
• Git is a distributed revision control and source code management system

• Version Control System (VCS) is a software that helps software
developers to work together and maintain a complete history of their work.
• Git is a version-control system for tracking changes in computer files and
coordinating work on those files among multiple people.
• Git helps you keep track of the changes you make to your code.
www.qss.az
Git 72
Install git
• https://git-scm.com/downloads
Create Github account

• https://github.com/
www.qss.az
Git 73
 A system that keeps records of your changes

 Allows for collaborative development
 Allows you to know who made what changes and when
 Allows you to revert changes
www.qss.az
Git 74
git
config
git git
pull init
git git
push clone
Git
git git
remote add
git git
status commit
git
diff
www.qss.az
Git 75
• git config
Usage: git config –global user.name “[name]”
Usage: git config –global user.email “[email address]”
www.qss.az
Git 76
• git init
This command is used to start a new repository.
• git clone
This command is used to obtain a repository from an existing URL.
www.qss.az
Git 77
• git remote
Usage: git remote add [variable name] [Remote Server Link]
This command is used to connect your local repository to the remote server.
• git push
Usage: git push [variable name] master
This command sends the committed changes of master branch to your
remote repository.
www.qss.az
78
GƏLDİYİNİZ ÜÇÜN TƏŞƏKKÜRLƏR!
www.qss.az

Sessiya 2-Python

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sessiya 2-Python

Uploaded by

Copyright:

Available Formats

1

Python ilə Data

Mövzu 2:datatypes in python,

Keyz diskussiya : Marketing

A thing of interest in probability is called a random variable, and the

A random variable is a quantity that is produced by a random process.

In probability, a random variable can take on one of many possible values,

A random variable can be either discrete or continuous.

Example: Coin toss

Discrete random variables take on a

heads). No other value is possible for X.

A continuous random variable is a random

The two types of discrete random variables most commonly used in

A probability distribution is a summary of probabilities for the

A continuous probability distribution summarizes the probability for

The probability distribution function, or PDF, defines the probability

 The probabilities of the heights of humans form a Normal

• Diastolic Blood Pressure •Manufacturing • Arrival Time at office

50 82 110 94 mm 100 mm 106 mm 7:45 AM 8:00 AM 8:15 AM

To compute normal probabilities, you first convert a normally distributed

Time to download a video is normally distributed, with a mean of 7 seconds

A download time of 1 second is equivalent to –3 standardized units

The standard deviation is the unit of measurement. In other words, a time

To further illustrate the transformation formula, suppose that another website

• Comparing these results with previous one, you see that a

• A time of 1 second is 3 standard deviations below the mean

Therefore, it is unlikely (0.0027, or only 27 in 10,000) that a download

 Approximately 68.26% of the values fall within +-1 standard deviation

Outliers are stragglers — extremely high or extremely low values — in

An outlier is a piece of data that is an abnormal distance from other

In this set of random numbers, 1 and 201 are outliers:

A box and whiskers chart (boxplot)

High = (Q3) + 1.5

Step 1: Find the Q1, Q3 and

-Sort data: 3,10,14,19,22,29,32,36,49,70

Step 2: Multiply IQR by 1.5

Step 3: Add the amount you found in Step 2 to Q3 from Step

Step 6: Highlight any number below or above

Using NumPy, a developer can perform the following operations

NumPy is suited to many applications:

Array in Numpy is a table of elements (usually numbers), all of the same

Another characteristic is that, once a numpy array is created, you cannot

Inside the parenthesis you would specify the object(s) you’re

Viewing the data:

• df.mean() Returns the mean of all columns

You can use different conditions to filter columns. For example,

It is possible to sort values in a certain column in an ascending order

The last command in this section is groupby. It involves splitting the

Check for missing values in the data by running pd.isnull() which

In order to get a sum of null/missing values, run pd.isnull().sum().

pd.notnull() is the opposite of pd.isnull()

A different approach would be to fill the missing values with other

#pip install matplotlib

• Git is a distributed revision control and source code management system

Create Github account

 A system that keeps records of your changes

GƏLDİYİNİZ ÜÇÜN TƏŞƏKKÜRLƏR!

You might also like