You are on page 1of 45

DMA: Review of

Probability
Prof. Mangal S, Dandekar
What is Population?
What is Population?
In statistics, population is the entire set of items from which you draw data for a statistical study. It can be a
group of individuals, a set of items, etc. It makes up the data pool for a study. Generally, population refers to
the people who live in a particular area at a specific time. But in statistics, population refers to data on your
study of interest. It can be a group of individuals, objects, events, organizations, etc. You use populations to
draw conclusions.

An example of a population would be the entire student body at a school. It would contain all the students
who study in that school at the time of data collection. Depending on the problem statement, data from
each of these students is collected.
What is Population?
For the above situation, it is easy to collect data. The population is small and willing to provide data and can
be contacted The data collected will be complete and reliable .

If you had to collect the same data from a larger population, say the entire country of India, it would be
impossible to draw reliable conclusions because of geographical and accessibility constraints, not to
mention time and resource constraints.

A lot of data would be missing or might be unreliable. Furthermore, due to accessibility issues, marginalized
tribes or villages might not provide data at all, making the data biased towards certain regions or groups.
What is a Sample?
What is a Sample?
A sample is defined as a smaller and more manageable representation of a larger group. A subset of a larger
population that contains characteristics of that population. A sample is used in statistical testing when the
population size is too large for all members or observations to be included in the test.

The sample is an unbiased subset of the population that best represents the whole data.

To overcome the restraints of a population, you can sometimes collect data from a subset of your
population and then consider it as the general norm. You collect the subset information from the groups
who have taken part in the study, making the data reliable. The results obtained for different groups who
took part in the study can be extrapolated to generalize for the population
Samples are used when :

● The population is too large to collect data.


● The data collected is not reliable.

A sample should generally :

● Satisfy all different variations present in the population as well as a well-


defined selection criterion.
● Be utterly unbiased on the properties of the objects being selected.
● Be random to choose the objects of study fairly.
Data: Both population and sample involve data.
Similarities Between Population and Sample
Descriptive Statistics: Descriptive statistics can be used to analyze both populations and
samples.

Probability: Probability theory can be used to analyze both populations and samples.

Inferential Statistics: Inferential statistics can be used to draw conclusions about the
population based on the sample.

Sampling Error: Sampling error is a potential source of error in both populations and
samples.
How Population and Sample are Used in Statistical Inference

In statistical inference, population and sample are used to estimate population parameters using sample
statistics. The sample is used as a representation of the population, and probability theory and statistical
methods are applied to draw conclusions or make predictions about the population based on the sample
data.
Examples of Statistical Inference Using Population and Sample Data
Medical Research: In medical research, clinical trials are conducted on a sample of
the population to estimate the effects of a drug or treatment. Statistical inference is
used to estimate the effect size and determine the probability.

Market Research: In market research, a sample of customers is surveyed to


estimate the demand for a product or service. Statistical inference is used to
estimate the proportion of the population that would be interested in the product or
service.
Quality
ExamplesControl: In quality control,
of Statistical a sample
Inference UsingofPopulation
products is tested to estimate
and Sample Datathe proportion
of defective items in the population. Statistical inference is used to determine whether the
proportion of defects in the sample is significantly different from the population.

Political Polling: In political polling, a sample of voters is surveyed to estimate the proportion
of voters who support a candidate or party. Statistical inference is used to estimate the
margin of error and determine the probability of a candidate winning the election.
Skewness
Skewness
Skewness is a measure of the asymmetry of a
distribution.

A distribution is asymmetrical when its left and right side


are not mirror images.

Right skew = Positive skew


Left skew = Negative skew
How to calculate skewness
What to do if your data is skewed

1. Do nothing. Many statistical tests, including t tests, ANOVAs, and linear regressions, aren’t

very sensitive to skewed data. Especially if the skew is mild or moderate, it may be best to

ignore it.

2. Use a different model. You may want to choose a model that doesn’t assume a normal

distribution.

3. Transform the variable. Another option is to transform a skewed variable so that it’s less

skewed.
Kurtosis
Kurtosis and Excessive Kurtosis
Kurtosis is a measure of the tailedness of a distribution. Tailedness is how often outliers occur.

Excess kurtosis is the tailedness of a distribution relative to a normal distribution.

● Distributions with medium kurtosis (medium tails) are mesokurtic.

● Distributions with low kurtosis (thin tails) are platykurtic.

● Distributions with high kurtosis (fat tails) are leptokurtic.

Tails are the tapering ends on either side of a distribution. They represent the probability or

frequency of values that are extremely high or low compared to the mean. In other words, tails

represent how often outliers occur.


Excess Kurtosis = Kurtosis - 3

Normal distribution has a kurtosis of 3.

Platykurtosis is sometimes called negative kurtosis,


since the excess kurtosis is negative.

Leptokurtosis is sometimes called positive kurtosis,


since the excess kurtosis is positive.
Platykurtic distributions have a low frequency of
outliers.

A leptokurtic distribution is fat-tailed, meaning that


there are a lot of outliers.
Covariance

Covariance is a measure of the relationship between two random variables and to what extent, they change together.
Or we can say, in other words, it defines the changes between the two variables, such that change in one variable is
equal to change in another variable.
Types of Covariance

Positive Covariance
If the covariance for any two variables is positive, that means, both the variables move in the same direction.

Negative Covariance
If the covariance for any two variables is negative, that means, both the variables move in the opposite
direction.
Covariance Formula
Where,

● xi = data value of x
● yi = data value of y
● x̄ = mean of x
● ȳ = mean of y
● N = number of data values.
● COV(x,y) = Covariance of X and Y
Covariance

If cov(X, Y) is greater than zero, then we can say that the covariance for any two variables is positive and
both the variables move in the same direction.

If cov(X, Y) is less than zero, then we can say that the covariance for any two variables is negative and both
the variables move in the opposite direction.

If cov(X, Y) is zero, then we can say that there is no relation between two variables.
Correlation Coefficient Formula

Correlation estimates the depth of the relationship between variables. It is dimensionless. In other words, the
correlation coefficient is a constant value always and does not have any units.

The relationship between the correlation coefficient and covariance is given by;
Correlation Matrix
A matrix is an array of numbers arranged in rows and columns.

A correlation matrix is simply a table showing the correlation coefficients between variables.

Here, the variables are represented in the first row, and in the first column:
Correlation Matrix
Correlation Matrix

The table above has used data from the full health data set.

Observation:

● We observe that Duration and Calorie_Burnage are closely related, with a correlation
coefficient of 0.89. This makes sense as the longer we train, the more calories we burn
Using a Heatmap
·

We can use a Heatmap to Visualize the Correlation Between Variables:

The closer the correlation coefficient is to 1, the greener the squares get.

The closer the correlation coefficient is to -1, the browner the squares get.
What is sampling distribution?
Sampling distribution is a statistic that determines the probability of an event based on data from a small
group within a large population. Its primary purpose is to establish representative results of small samples
of a comparatively larger population. Since the population is too large to analyze, you can select a smaller
group and repeatedly sample or analyze them. The gathered data, or statistic, is used to calculate the likely
occurrence, or probability, of an event. Using a sampling distribution simplifies the process of making
inferences, or conclusions, about large amounts of data.
What is sampling distribution?
Each random sample selected may have a different value assigned to the statistic being studied. For
example, if you randomly sample data three times and determine the mean, or the average, of each sample,
all three means are likely to be different and fall somewhere along the graph. That's variability. You do that
many times, and eventually the data you plot may look like a bell curve . That process is a sampling
distribution.
Factors that influence sampling distribution

● The number observed in a population: The symbol for this variable is "N." It is the measure of observed activity
in a given group of data.

● The number observed in the sample: The symbol for this variable is "n." It is the measure of observed activity
in a random sample of data that is part of the larger grouping.

● The method of choosing the sample: How you chose the samples can account for variability in some cases.
Types of sampling distributions
1. Sampling distribution of mean

2. Sampling distribution of proportion


Central limit theorem
The central limit theorem says that the sampling distribution of the mean will always be normally
distributed, as long as the sample size is large enough. Regardless of whether the population has a normal,
Poisson, binomial, or any other distribution, the sampling distribution of the mean will be normal.

By convention, we consider a sample size of 30 to be “sufficiently large.”

When n < 30, the central limit theorem doesn’t apply. The sampling distribution will follow a similar
distribution to the population. Therefore, the sampling distribution will only be normal if the population is
normal.

When n ≥ 30, the central limit theorem applies. The sampling distribution will approximately follow a normal
distribution.
Sample size and standard deviations

When n is low, the standard deviation is high. There’s a lot of spread in the samples’ means because they
aren’t precise estimates of the population’s mean.

When n is high, the standard deviation is low. There’s not much spread in the samples’ means because
they’re precise estimates of the population’s mean.
Conditions of Central Limit Theorem

1. Sample size N >= 30


2. Samples are independent and identically distributed ( iid) random variables.
Central Limit Theorem
The central limit theorem helps in constructing a sampling distribution. The theorem says a normal distribution
depends on the sample size. As the number of sample groups increases, the number of variables or standard error
decreases.

To determine if your sample group is large enough, consider the following:

Accuracy requirements: The most accurate sampling distributions have enough mean data samplings to create a bell
curve. The closer to a normal distribution the visualization looks when plotted, the more accurate. More data is better
for the accuracy of sampling distributions.

The starting population's shape: If the starting population closely resembles a normal distribution bell curve, fewer
samplings may be required to plot the shape in a sampling distribution. However, if the population is abnormal, for
example, skewed one way or another, you may need more samples to get the desired result from the sampling
distribution.

You might also like