You are on page 1of 9

UNIT 1

1. Explain types of statistical inference.

There are different types of statistical inferences that are extensively used for
making conclusions. They are:

 One sample hypothesis testing


 Confidence Interval
 Pearson Correlation
 Bi-variate regression
 Multi-variate regression
 Chi-square statistics and contingency table
 ANOVA or T-test

For inferential statistics, we need to define the population and then draw a
random sample from that population.
Population: 9th-grade students in public schools in the Pune city. Use random
sampling to help ensure a representative sample. Assume that we are provided
a list of names for the entire population and draw a random sample of 100
students from it and obtain their test scores.
Students may be from many different schools across the city.

2. Why statistical inference is important in Machine


Learning?
Statistical Inference is the branch of Statistics which is concerned with using
probability concepts to deal with uncertainty in decision-making. The process
involves selecting and using a sample statistic to draw inferences about a
population parameter based on a subset of it -- the sample drawn from
population.

Statistical inference deals with two classes of situations,

 Hypothesis Testing
 Estimation
Statistical methods are required to find answers to the questions that we have
about data.

We can see that in order to both understand the data used to train a machine
learning model and to interpret the results of testing different machine
learning models, that statistical methods are required.

3. Differentiate Between ANOVA and ANOCOVA

BASIS FOR
ANOVA ANCOVA
COMPARISON

Meaning ANOVA is a process of ANCOVA is a technique that


examining the difference remove the impact of one or
among the means of more metric-scaled undesirable
multiple groups of data for variable from dependent variable
homogeneity. before undertaking research.

Uses Both linear and non-linear Only linear model is used.


model are used.

Includes Categorical variable. Categorical and interval variable.

Covariate Ignored Considered

BG variation Attributes Between Group Divides Between Group (BG)


(BG) variation, to treatment. variation, into treatment and
covariate.

WG variation Attributes Within Group Divides Within Group (WG)


(WG) variation, to individual variation, into individual
differences. differences and covariate.
4. What do you mean by Hypothesis Testing? Explain with
example.

A hypothesis is a calculated prediction or assumption about a population


parameter based on limited evidence.
Hypothesis testing is an assessment method that allows researchers to
determine the plausibility of a hypothesis. It involves testing an
assumption about a specific population parameter to know whether it's true or
false. These population parameters include variance, standard deviation, and
median.
ypically, hypothesis testing starts with developing a null hypothesis and then
performing several tests that support or reject the null hypothesis. The
researcher uses test statistics to compare the association or relationship
between two or more variables.
To successfully confirm or refute an assumption, the researcher goes through
five (5) stages of hypothesis testing; 

1. Determine the null hypothesis


2. Specify the alternative hypothesis
3. Set the significance level
4. Calculate the test statistics and corresponding P-value
5. Draw your conclusion.

5. Define Regression and explain Bivariate and


Multivariate regression with example.
Bivariate and multivariate analyses are statistical methods to investigate
relationships between data samples. Bivariate analysis looks at two paired
data sets, studying whether a relationship exists between them. Multivariate
analysis uses two or more variables and analyzes which, if any, are correlated
with a specific outcome. The goal in the latter case is to determine which
variables influence or cause the outcome.
Bivariate Analysis

Bivariate analysis investigates the relationship between two data sets, with a
pair of observations taken from a single sample or individual. However, each
sample is independent. You analyze the data using tools such as t-tests and
chi-squared tests, to see if the two groups of data correlate with each other.
If the variables are quantitative, you usually graph them on a scatterplot.
Bivariate analysis also examines the strength of any correlation.

Examples:

One example of bivariate analysis is a research team recording the age of


both husband and wife in a single marriage. This data is paired because both
ages come from the same marriage, but independent because one person's
age doesn't cause another person's age. You plot the data to showing a
correlation: the older husbands have older wives.

A second example is recording measurements of individuals' grip strength


and arm strength. The data is paired because both measurements come from
a single person, but independent because different muscles are used. You
plot data from many individuals to show a correlation: people with higher
grip strength have higher arm strength.

Multivariate Analysis

Multivariate analysis examines several variables to see if one or more of them


are predictive of a certain outcome. The predictive variables are independent
variables and the outcome is the dependent variable. The variables can be
continuous, meaning they can have a range of values, or they can be
dichotomous, meaning they represent the answer to a yes or no question.
Multiple regression analysis is the most common method used in multivariate
analysis to find correlations between data sets. Others include logistic
regression and multivariate analysis of variance.

Example:

Multivariate analysis was used in by researchers in a 2009 Journal of


Pediatrics study to investigate whether negative life events, family
environment, family violence, media violence and depression are predictors
of youth aggression and bullying. In this case, negative life events, family
environment, family violence, media violence and depression were the
independent predictor variables, and aggression and bullying were the
dependent outcome variables. Over 600 subjects, with an average age of 12
years old, were given questionnaires to determine the predictor variables for
each child. A survey also determined the outcome variables for each child.
Multiple regression equations and structural equation modelling was used to
study the data set. Negative life events and depression were found to be the
strongest predictors of youth aggression.

6. What is Chi-square test. Write formula to compute t-


test and elaborate.
A chi-square (χ2) statistic is a test that measures how a model compares to
actual observed data. The data used in calculating a chi-square statistic must
be random, raw, mutually exclusive, drawn from independent variables, and
drawn from a large enough sample. For example, the results of tossing a fair
coin meet these criteria.
Chi-square tests are often used in hypothesis testing. The chi-square statistic
compares the size of any discrepancies between the expected results and the
actual results, given the size of the sample and the number of variables in the
relationship.

where:

c=Degrees of freedom

O=Observed value(s)

E=Expected value(s)

In statistics, the term “t-test” refers to the hypothesis test in which the test
statistic follows a student’s t-distribution. It is used to check whether two data
sets are significantly different from each other or not.

7. What is null Hypothesis and alternative Hypothesis?


Elaborate with example

The actual test begins by considering two hypotheses. They are called the null
hypothesis and the alternative hypothesis. These hypotheses contain opposing
viewpoints.

H0: 

 The null hypothesis


 It is a statement of no difference between sample means or proportions
 Or, no difference between a sample mean or proportion and a
population mean or proportion.
 In other words, the difference equals 0.

Ha: 

 The alternative hypothesis


 It is a claim about the population that is contradictory to H0 and what we
conclude when we reject H0.

Since the null and alternative hypotheses are contradictory, you must examine
evidence to decide if you have enough evidence to reject the null hypothesis or
not. The evidence is in the form of sample data.

After you have determined which hypothesis the sample supports, you make
a decision. There are two options for a decision. They are “reject H0” if the
sample information favors the alternative hypothesis or “do not reject H0” or
“decline to reject H0” if the sample information is insufficient to reject the null
hypothesis.
Mathematical Symbols Used in H0 and Ha:

8. What is Type I error and Type II error? Elaborate with


example
Type 1 error occurs when the null hypothesis is actually true, but was
rejected as false by the testing. Type I error, or false positive, is asserting
something as true when it is actually false.  This false positive error is
basically a "false alarm" – a result that indicates a given condition has been
fulfilled when it actually has not been fulfilled. It means concluding that
results are statistically significant when, in reality, they came about purely
by chance or because of unrelated factors.

Type 2 error occurs when the null hypothesis is actually false, but was
accepted as true by the testing. A type II error, or false negative, is where a
test result indicates that a condition failed, while it actually was successful.
A Type II error is committed when we fail to believe a true condition. This is
not quite the same as “accepting” the null hypothesis, because hypothesis
testing can only tell you whether to reject the null hypothesis.
9. How will you differentiate between descriptive statistics
and inferential statistics?
S.No. Descriptive Statistics Inferential Statistics
It gives information about raw data It makes inference about
which describes the data in some population using data drawn from
1. manner. the population.

It helps in organizing, analyzing and to It allows us to compare data, make


2. present data in a meaningful manner. hypothesis and predictions.

It is used to explain the chance of


3. It is used to describe a situation. occurrence of an event.

It explain already known data and


limited to a sample or population It attempts to reach the conclusion
4. having small size. about the population.

It can be achieved with the help of


5. charts, graphs, tables etc. It can be achieved by probability.

10. What does a measure of central tendency indicate?


Describe the important measures of central tendency
pointing out the situation when one measure is
considered relatively appropriate in comparison to
other measures.
A measure of central tendency (also referred to as measures of centre or
central location) is a summary measure that attempts to describe a whole set
of data with a single value that represents the middle or centre of its
distribution.
Measures of central tendency – mean, mode, median.

Summary of when to use the mean, median and mode


1.  Mean is the most frequently used measure of central tendency
and generally considered the best measure of it. However, there
are some situations where either median or mode are preferred.
2. Median is the preferred measure of central tendency when:
0.  There are a few extreme scores in the distribution of
the data. (NOTE: Remember that a single outlier can
have a great effect on the mean). b.
1. There are some missing or undetermined values in your
data. c.
2. There is an open ended distribution (For example, if you
have a data field which measures number of children
and your options are 00, 11, 22, 33, 44, 55 or “66 or
more,” than the “66 or more field” is open ended and
makes calculating the mean impossible, since we do not
know exact values for this field).
3. You have data measured on an ordinal scale.
3. Mode is the preferred measure when data are measured in a
nominal (and even sometimes ordinal) scale.

Please use the following summary table to know what the best measure of
central tendency is with respect to the different types of variable.

Type of Variable Best measure of central tendency


Nominal Mode
Ordinal Median
Interval/Ratio (not skewed) Mean
Interval/Ratio (skewed) Median

You might also like