You are on page 1of 8

Assignment

All questions all are mandatory.

Total Marks: 50 Marks are given in [ ]

Q1. Differentiate between data and information [2]. Define variable and its different types, levels of
measurements and give suitable examples [1+3+1+3].

Ans.
Data is an individual unit that contains raw materials which do not carry any specific meaning. Information is a
group of data that collectively carries a logical meaning. Data doesn't depend on information. Information
depends on data.

A variable refers to a particular character on which a set of data recorded.


There are two types of variables:

Quantitative/Numerical
Qualitative/Categorical

Levels of measurement: -
Discrete measurement variables: You can count discrete variables and they belong to a finite set. Example:
how old you are in years: 21, 23, 59, and so on. Note that the set will have an end (probably 100 or so at most)

Continuous measurement variables: Continuous variables are not countable, and go on until infinity. The


weight of an egg is one example: an egg could weigh 2.01 oz, 2.0031 oz, 2.0000000000002 oz, and so on.
Subtypes of continuous variables:

Interval variables: also continuous, but they have meaningful intervals. As an example, a thermometer might
measure in intervals of 0.1 degree.

Ratio variables: also, interval, with a meaningful zero. For example, 0 pounds means that you weigh nothing.

Nominal (Categorical) variables: can be placed into categories like “under 10s” and “65 or older”.

Ranked variables: variables that have an order like 1st, 2nd, 3rd

Q2. Define Population and Sample in Statistics and explain how they different to each other [1+1+2]. Explain
what is meant by descriptive statistics and inferential statistics with suitable examples [2+2].
Ans.

A population is the entire group that you want to draw conclusions about. A sample is the specific group that
you will collect data from. The size of the sample is always less than the total size of the population.

Descriptive statistics describes data (for example, a chart or graph) and inferential statistics allows you to
make predictions (“inferences”) from that data. With inferential statistics, you take data from samples and
make generalizations about a population.
For example, you might stand in a mall and ask a sample of 100 people if they like shopping at Sears. You could
make a bar chart of yes or no answers (that would be descriptive statistics) or you could use your research
(and inferential statistics) to reason that around 75-80% of the population (all shoppers in all malls) like
shopping at

Let’s say you have some sample data about a potential new cancer drug. You could use descriptive statistics to
describe your sample, including:
 Sample mean
 Sample standard deviation
 Making a bar chart or boxplot
 Describing the shape of the sample probability distribution

Q3. We have learned about measures of central tendency [MCT], mainly, Mean, Median and Mode. Explain
their uses [1], strengths [1] and limitations [1]. Also mention the different type of variables associated with
the preferred use of different type of MCT with examples.
Ans.
The mode is the most commonly occurring value in a distribution.

Consider this dataset showing the retirement age of 11 people, in whole years:

54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60

This table shows a simple frequency distribution of the retirement age data.

Age Frequency
54 3
55 1
56 1
57 2
58 2
60 2

The most commonly occurring value is 54, therefore the mode of this distribution is 54 years.

Advantage of the mode:

The mode has an advantage over the median and the mean as it can be found for both numerical and
categorical (non-numerical) data.

Limitations of the mode:

The are some limitations to using the mode. In some distributions, the mode may not reflect the centre of the
distribution very well. When the distribution of retirement age is ordered from lowest to highest value, it is
easy to see that the centre of the distribution is 57 years, but the mode is lower, at 54 years.

54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60

It is also possible for there to be more than one mode for the same distribution of data, (bi-modal, or multi-
modal). The presence of more than one mode can limit the ability of the mode in describing the centre or
typical value of the distribution because a single value to describe the centre cannot be identified.

In some cases, particularly where the data are continuous, the distribution may have no mode at all (i.e., if all
values are different).

In cases such as these, it may be better to consider using the median or mean, or group the data in to
appropriate intervals, and find the modal class

The median is the middle value in distribution when the values are arranged in ascending or descending order.

The median divides the distribution in half (there are 50% of observations on either side of the median value).
In a distribution with an odd number of observations, the median value is the middle value.

Looking at the retirement age distribution (which has 11 observations), the median is the middle value, which
is 57 years:

54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60

When the distribution has an even number of observations, the median value is the mean of the two middle
values. In the following distribution, the two middle values are 56 and 57, therefore the median equals 56.5
years:

52, 54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60

Advantage of the median:

The median is less affected by outliers and skewed data than the mean, and is usually the preferred measure
of central tendency when the distribution is not symmetrical.

Limitation of the median:

The median cannot be identified for categorical nominal data, as it cannot be logically ordered.

The mean is the sum of the value of each observation in a dataset divided by the number of observations. This
is also known as the arithmetic average.

Looking at the retirement age distribution again:

54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60

The mean is calculated by adding together all the values (54+54+54+55+56+57+57+58+58+60+60 = 623) and
dividing by the number of observations (11) which equals 56.6 years.

Advantage of the mean:

The mean can be used for both continuous and discrete numeric data.

Limitations of the mean:

The mean cannot be calculated for categorical data, as the values cannot be summed.

As the mean includes every value in the distribution the mean is influenced by outliers and skewed
distributions.
(3) What do you mean by skewness of given data set [1]? What are different types of skewness [2]? What is
relationship between the MCT for a given symmetrical data [1]?
Ans.
Skewness refers to a distortion or asymmetry that deviates from the symmetrical bell curve, or normal
distribution, in a set of data. If the curve is shifted to the left or to the right, it is said to be skewed.

Types of Skewness

Broadly speaking, there are two types of skewness: They are (1) Positive skewness and (2) Negative skewness.

Positive skewness
A series is said to have positive skewness when the following characteristics are noticed:

 Mean > Median > Mode.


 The right tail of the curve is longer than its left tail, when the data are plotted through a
histogram, or a frequency polygon.
 The formula of Skewness and its coefficient give positive figures.

Negative skewness
A series is said to have negative skewness when the following characteristics are noticed:

 Mode> Median > Mode.


 The left tail of the curve is longer than the right tail, when the data are plotted through a
histogram, or a frequency polygon.
 The formula of skewness and its coefficient give negative figures.

When you have a symmetrical distribution for continuous data, the mean, median, and mode are equal. In this
case, analysts tend to use the mean because it includes all of the data in the calculations. However, if we have
a skewed distribution, the median is often the best measure of central tendency.

Q4. We have also learned about measures of dispersion [MD], mainly, Range, variance, and standard
deviation.

 Explain their uses [1], strengths [1] and limitations [1].

 Why standard deviation is preferred measure of dispersion over variance though it is calculated
from variance [1]?

 What is a Statistic [1] and a Parameter [1]?

 What is a sampling distribution [1]?

 What is standard error in statistics [1]? and how it useful in estimation of parameter [2]?

Ans.

1)Merits of Range

 It is the simplest of the measure of dispersion

 Easy to calculate

 Easy to understand

 Independent of change of origin


Demerits of Range

 It is based on two extreme observations. Hence, get affected by fluctuations

 A range is not a reliable measure of dispersion

 Dependent on change of scale

Merits of Variance

 All the drawbacks of Range are overcome by quartile deviation

 It uses half of the data

 Independent of change of origin

 The best measure of dispersion for open-end classification



Demerits of Variance

 It ignores 50% of the data

 Dependent on change of scale

 Not a reliable measure of dispersion

Merits of Standard Deviation

 Squaring the deviations overcomes the drawback of ignoring signs in mean deviations

 Suitable for further mathematical treatment

 Least affected by the fluctuation of the observations

 The standard deviation is zero if all the observations are constant

 Independent of change of origin

Demerits of Standard Deviation

 Not easy to calculate

 Difficult to understand for a layman

 Dependent on the change of scale

2)Variance is the square of the standard deviation. Being a squared term, it is non-negative. The unit of
variance is squared unit, thereby making it less intuitive. Moreover, standard deviation is preferred over
variance because standard deviation can be compared with the mean.

3) A parameter is a number describing a whole population (e.g., population mean), while a statistic is a number
describing a sample (e.g., sample mean).
The goal of quantitative research is to understand characteristics of populations by finding parameters. In
practice, it’s often too difficult, time-consuming or unfeasible to collect data from every member of a
population. Instead, data is collected from samples.

4) A sampling distribution is a probability distribution of a statistic obtained from a larger number of samples
drawn from a specific population. The sampling distribution of a given population is the distribution of
frequencies of a range of different outcomes that could possibly occur for a statistic of a population.

5) The standard error (SE) of a statistic (usually an estimate of a parameter) is the standard deviation of its
sampling distribution or an estimate of that standard deviation. This is because as the sample size increases,
sample means cluster more closely around the population mean.

Q5. We have discussed type of random variables about different probability distributions, namely, Binomial
Distribution [BD], Poisson Distribution [PD] and Normal Distribution [ND].

1. Please mention the type of random variable associated with BD, PD and ND [1.5].

2. What is the relationship between mean and variance in BD, PD, ND [1.5]?

3. What is a standardized normal variate [0.5] and what is a test statistic [0.5]?

4. Why Normal Distribution is important in statistics and mention its 2 key properties [0.5+0.5]?

Ans.

1) A random variable is a numerical description of the outcome of a statistical experiment. A random variable
that may assume only a finite number or an infinite sequence of values is said to be discrete; one that may
assume any value in some interval on the real number line is said to be continuous. For instance, a random
variable representing the number of automobiles sold at a particular dealership on one day would be discrete,
while a random variable representing the weight of a person in kilograms (or pounds) would be continuous.

2) The mean of the binomial distribution is always equal to p, and the variance is always equal to pq/N.
Moreover, for reasonable sample sizes and for values of p between about. 20 and. 80, the distribution is
roughly normally distributed.

3) What is Standard Normal Variate (SNV)? A standard normal variate is a normal variate with mean µ=0 and
standard deviation σ =1 with a probability density function is. The probability that the variate would take is
denoted by the shaded area in the figure. The variate would take a value between 0 and z.

The test statistic is a number calculated from a statistical test of a hypothesis. It shows how closely your
observed data match the distribution expected under the null hypothesis of that statistical test.

The test statistic is used to calculate the p-value of your results, helping to decide whether to reject your null
hypothesis.

4) As with any probability distribution, the normal distribution describes how the values of a variable are
distributed. It is the most important probability distribution in statistics because it accurately describes the
distribution of values for many natural phenomena. Characteristics that are the sum of many independent
processes frequently follow normal distributions. For example, heights, blood pressure, measurement error,
and IQ scores follow the normal distribution

Properties of Normal distribution: -


1)      The normal curve is bell shaped in appearance.
2)      There is one maximum point of normal curve which occur at mean.
3)      As it has only one maximum curve so it is unimodal.
4)      In binomial and possion distribution the variable is discrete while in this it is continuous.
5)      Here mean= median =mode.

Q6. We have also discussed the different type of variables (categorical etc) and different type of statistical
test based on these types of variables. Please mention which type of variables you will use for these tests
[0.5 each]

Example:

Binomial Proportion test: One categorical variable with two levels

1. Chi-Square test for independence


2. One Sample T-test
3. One sample Z-test
4. Independent Sample T-test
5. Paired Sample T-test
6. ANAOVA
7. ANACOVA
8. Linear Regression: type of dependent and independent variables
9. Logistic Regression: type of dependent and independent variables
10. Chi-Square goodness of fit.

Ans.

1)Two categorical variables

2) One continuous, numeric variable

3) One numerical continuous variable

4)Two numerical Continuous variable

5)One or more numerical variable

6)Dependent variable is numerical and factor variable should be integers

7) One categorical and one numerical (Continuous)variable

8)Dependent and independent variable should be numerical (Quantitative)

9)Dependent variable is DICHOTOMOUS and independent variable can be categorical or interval level

10)One Categorical variable

Q7. We have also discussed two types of error in statistical hypothesis testing. Explain them with examples
[1]. What is confidence interval estimation and its importance in statistical inferences [1]?

Ans.
A type I error (false-positive) occurs if an investigator rejects a null hypothesis that is actually true in the
population; a type II error (false-negative) occurs if the investigator fails to reject a null hypothesis that is
actually false in the population.
A Confidence interval provides a range of population values with which a sample statistic is consistent at a
given level of confidence (usually 95%). Conventional hypothesis testing serves to either reject or retain a null
hypothesis.

Truth
(for population studied)
Null Hypothesis True Null Hypothesis False
Decision  Reject Null Hypothesis Type I Error Correct Decision
(based on
sample) Fail to reject Null Hypothesis Correct Decision Type II Error

You might also like