You are on page 1of 38

Data Science and Analytics

Statistical & Probabilistic Data Analytics

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Kinds of Data

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk 2
isma.farah@teacher.muet.edu.pk
Two Kinds of Data
• Quantitative
• Data that is numerical, counted, or compared on a scale
o Demographic data
o Answers to closed-ended survey items
o Attendance data
o Scores on standardized instruments

• Qualitative
• Narratives, logs, experience
o Focus groups
o Interviews
o Open-ended survey items
o Diaries and journals
o Notes from observations

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Statistical Data Analysis
• Data Ingredients :
• Univariate variables
• Multivariate variables

• Types of Data
• Continuous data
• One that cannot be counted
• Example: Intensity of a light can be measured but cannot be counted

• Discrete data
• One that can be counted
• Example: The number of light bulbs can be counted.

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Data Types
• Continuous data
• Distributed under continuous distribution function called the probability density
function, or simply pdf.
• The word ‘density’ in continuous data is used because:
• Density cannot be counted, but can be measured.

• Discrete data
• Distributed under discrete distribution function called the probability mass
function or simple pmf.
• The word ‘mass’ in discrete data is used because:
• Mass cannot be counted.

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Data Types
• We find various pdf’s and pmf’s in statistical data analysis.
• Example: Poisson distribution is the commonly known pmf, and normal
distribution is the commonly known pdf.

• These distributions help us to understand which data falls under which


distribution:
• If the data is about the intensity of a light bulb, then the data would be falling in
Poisson distribution.

• Statistical data analysis major task:


• Statistical Inference (Conclusion on evidence)

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Data Types
• Statistical Inference
• Estimation
• Tests of Hypothesis

• Estimation:
• Involves parametric data such as the data that consists of parameters

• Tests of Hypothesis:
• Involve non parametric data such as the data that consists of no parameters.

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Distributions
• Binomial
• Consider N independent experiments: outcome of each is ‘success’ or ‘failure’ and probability of
success on any given trial.
• Multinomial
• Like binomial but now m outcomes instead of two
• Uniform
• Consider a continuous random variable x with -∞ < x < ∞. For any r.v., x with
cumulative distribution F(x), y = F(x) is uniform in [0,1]
• Gaussian
• Almost any random variable that is a sum of a large number of small
contributions follows it

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Histograms
• pdf = histogram with infinite data
sample, zero bin width, normalized
to unit area.

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Statistics in Data Analytics

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk10
isma.farah@teacher.muet.edu.pk
Statistical Data Analysis
• It is a procedure of performing various statistical operations.

• It is a kind of quantitative research, which:


• Seeks to quantify the data,
• Applies some form of statistical analysis.

• Quantitative data basically involves:


• Descriptive data, such as survey data and
observational data.

• Tools:
• Statistical Analysis System (SAS),
• Statistical Package for the Social Sciences (SPSS),
• StatSoft
Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Basic terminology and concepts
• Statistical terms
• Ratio
• Proportion
• Percentage
• Rate
• Mean
• Median

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Ratio
• Comparison of two numbers expressed as:
• a to b, a per b, a:b
• Used to express such comparisons as clinicians to patients or beds to
clients
• Calculation a/b
• Example – In district X, there are 600 nurses and 200 clinics. What is the
ratio of nurses to clinics? 600
200
= 3 nurses per clinic, a ratio of 3:1

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Calculating ratios
• In Jamshoro district hospital, there are 160 nurses and 40 clinics
• What is the nurse-to-clinic ratio?
160
=4
40

4:1 or 4 nurses to 1 clinic

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Proportion
• A ratio in which all individuals in the numerator are also in the
denominator.
• Used to compare part of the whole, such as proportion of all clients who
are less than 15 years old
• Example: If 20 of 100 clients on treatment are less than 15 years of age,
what is the proportion of young clients in the clinic?
20 1
=
100 5

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Calculating proportions
• Example: If a clinic has 12 female clients and 8 male clients, then the
proportion of male clients is 8/20, or 2/5
• 12 + 8 = 20
• 8/20
• Reduce this, multiple of 4 = 2/5 of clients = male

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Percentage
• A way to express a proportion (proportion multiplied by 100)
• Expresses a number in relation to the whole
• Example: Males comprise 2/5 of the clients, or 40% of the clients are
male (0.40 x 100)
• Allows us to express a quantity relative to another quantity. Can
compare different groups, facilities, countries that may have different
denominators

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Rate
• Measured with respect to another measured quantity during the same
time period
• Used to express the frequency of specific events in a certain time period
(fertility rate, mortality rate)
• Numerator and denominator must be from same time period
• Often expressed as a ratio (per 1,000)

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Infant Mortality Rate
• Calculation
• # of deaths ÷ population at risk in same time period x 1,000
• Example – 75 infants (less than one year) died out of 4,000 infants
born that year
• 75/4,000 = .0187 x 1,000 = 18.7
19 infants died per 1,000 live births

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Calculating mortality rate
In 2009, Jamshoro clinic had 31,155 patients on CVD. During that same
time period, 1,536 CVD clients died.

49 clients died
1,536 = .049 x 1,000 = 49 (mortality rate) per
31,155 1,000 clients on CVD

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Rate of increase
• Calculation
• Total number of increase ÷ time of increase
• Used to calculate monthly, quarterly, yearly increases.
• Example: increase in # of new clients, commodities distributed
• Example: Mobile phone Purchase in Jan. = 200; as of June = 1,100.
What is the rate of increase?
• 1,100 - 200 = 900/6 = 150 (150 mobiles per mo)

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Calculating rate of increase
In Q1, there were 50 new Ufone service users, and in Q2 there were 75.
What was the rate of increase from Q1 to Q2?

Example: 75 - 50 = 25 /3 = 8.33 new clients/mo

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Central tendency
Measures of the location of the middle or the center of a distribution of
data
• Mean
• Median

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Mean
• The average of your dataset
• The value obtained by dividing the sum of a set of quantities by the
number of quantities in the set
• Example: (22+18+30+19+37+33) = 159 ÷ 6 = 26.5
• The mean is sensitive to extreme values

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Calculating the mean
• Average number of clients counseled per month

– January: 30
– February: 45 (30+45+38+41+37+40) = 231÷ 6 = 38.5
– March: 38 Mean or average = 38.5
– April: 41
– May: 37
– June: 40

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Median
• The middle of a distribution (when numbers are in order: half of the
numbers are above the median and half are below the median)
• The median is not as sensitive to extreme values as the mean
• Odd number of numbers, median = the middle number
• Median of 2, 4, 7 = 4
• Even number of numbers, median = mean of the two middle numbers
• Median of 2, 4, 7, 12 = (4+7) /2 = 5.5

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Calculating the median
• Client 1 – 2
• Client 2 – 134
• Client 3 – 67
• Client 4 – 10
• Client 5 – 221
= 67
= 67+134 = 201/2 = 100.5

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Use the mean or median?
CD4 count
Client 1 9 We can see that there are a few outliers that may skew the
data, so we want to use the median.
Client 2 11 If we rank the values in the table, we get: 9.0, 11.0, 92, 92, 95,
Client 3 100 100, 100, 101, 104, 206
Client 4 95
Since there is an even number of observations, the median is
Client 5 92 calculated as: 95+100 = 195/2 =97.5
Client 6 206
Client 7 104 We are choosing the 2 middle numbers (95 and 100), adding
them together to get 195, and then dividing by 2.
Client 8 100
Client 9 101
Client 10 92

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Conclusion
• Purpose of this analysis is to provide answers to programmatic questions
• Descriptive analyses describe the sample/target population
• Descriptive analyses do not define causality – that is, they tell you what, not
why

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Conditional Probability

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk30
isma.farah@teacher.muet.edu.pk
A definition of probability
• Consider a set S with subsets A, B, ...

Kolmogorov axioms (1933)


• From these axioms we can derive further properties, e.g.

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Conditional probability, independence
• Also define conditional probability of A given B (with P(B) ≠ 0):

• E.g. rolling dice:

• Subsets A, B independent if:

• If A, B independent,

• N.B. do not confuse with disjoint subsets, i.e.,

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Bayes’ theorem
• From the definition of conditional probability we have,

and

• but , so

Bayes’ theorem

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
• You are planning a picnic today, but the morning is cloudy Example 1
• Oh no! 50% of all rainy days start off cloudy!
• But cloudy mornings are common (about 40% of days start cloudy)
• And this is usually a dry month (only 3 of 30 days tend to be rainy, or 10%)
• What is the chance of rain during the day?
• We will use Rain to mean rain during the day, and Cloud to mean cloudy morning.
• The chance of Rain given Cloud is written P(Rain|Cloud)
• So let's put that in the formula:

• P(Rain) is Probability of Rain = 10%


• P(Cloud|Rain) is Probability of Cloud, given that Rain happens = 50%
• P(Cloud) is Probability of Cloud = 40%

• Or a 12.5% chance of rain.

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk34
isma.farah@teacher.muet.edu.pk
Example 2
• SpamAssassin works by having users train the system. It looks for
patterns in the words in emails marked as spam by the user. For
example, it may have learned that the word “free” appears in 20% of
the emails marked as spam. Assuming 0.1% of non-spam mail
includes the word “free” and 50% of all emails received by the user is
spam, find the probability that a mail is a spam if the word “free”
appears in it.

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk35
isma.farah@teacher.muet.edu.pk
• Solution
• Data Given:
• P(Free | Spam) = 0.20
• P(Free | Non Spam) = 0.001
• P(Spam) = 0.50 => P(Non Spam) = 0.50
• P(Spam | Free) = ?
• Using Bayes’ Theorem:
• P(Spam | Free) = P(Spam) * P(Free | Spam) / P(Free)
• P(Spam | Free) = 0.50 * 0.20 / (0.50 * 0.20 + 0.50 * 0.001)
• P(Spam | Free) = 0.995

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk36
isma.farah@teacher.muet.edu.pk
Random Sample and Statistics
• Population: is used to refer to the set or universe of all entities under study.
• However, looking at the entire population may not be feasible, or may be too expensive.
• Instead, we draw a random sample from the population, and compute appropriate statistics from the
sample, that give estimates of the corresponding population parameters of interest.

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Solution:
Example:

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk38
isma.farah@teacher.muet.edu.pk

You might also like