This action might not be possible to undo. Are you sure you want to continue?
) Note: Each question carries 10 Marks. Answer all the questions.
Q1. Explain the following terms with respect to Statistics: (i) Sample, (ii) Variable, (iii) Population.
A.1 (i) Sample In statistics, a sample is a subset of a population. Typically, the population is very large, making a census or a complete enumeration of all the values in the population impractical or impossible. The sample represents a subset of manageable size. Samples are collected and statistics are calculated from the samples so that one can make inferences or extrapolations from the sample to the population. This process of collecting information from a sample is referred to as sampling. A complete sample is a set of objects from a parent population that includes ALL such objects that satisfy a set of well-defined selection criteria. For example, a complete sample of Australian men taller than 2m would consist of a list of every Australian male taller than 2m. But it wouldn't include German males, or tall Australian females, or people shorter than 2m. So to compile such a complete sample requires a complete list of the parent population, including data on height, gender, and nationality for each member of that parent population. In the case of human populations, such a complete list is unlikely to exist, but such complete samples are often available in other disciplines, such as complete magnitude-limited samples of astronomical objects. An unbiased sample is a set of objects chosen from a complete sample using a selection process that does not depend on the properties of the objects. For example, an unbiased sample of
Australian men taller than 2m might consist of a randomly sampled subset of 1% of Australian males taller than 2m. But one chosen from the electoral register might not be unbiased since, for example, males aged under 18 will not be on the electoral register. In an astronomical context, an unbiased sample might consist of that fraction of a complete sample for which data are available, provided the data availability is not biased by individual source properties. The best way to avoid a biased or unrepresentative sample is to select a random sample, also known as a probability sample. A random sample is defined as a sample where each individual member of the population has a known, non-zero chance of being selected as part of the sample. Several types of random samples are simple random samples, systematic samples, stratified random samples, and cluster random samples.
A variable is a characteristic that may assume more than one set of values to which a numerical measure can be assigned. Height, age, amount of income, province or country of birth, grades obtained at school and type of housing are all examples of variables. Variables may be classified into various categories, some of which are outlined in this section. Categorical variables: A categorical variable (also called qualitative variable) is one for which each response can be put into a specific category. These categories must be mutually exclusive and exhaustive. Mutually exclusive means that each possible survey response should belong to only one category, whereas, exhaustive requires that the categories should cover the entire set of possibilities. Categorical variables can be either nominal or ordinal. Nominal variables: A nominal variable is one that describes a name or category. Contrary to ordinal variables, there is no 'natural ordering' of the set of possible names or categories.
Ordinal variables: An ordinal variable is a categorical variable for which the possible categories can be placed in a specific order or in some 'natural' way. Numeric variables: A numeric variable, also known as a quantitative variable, is one that can assume a number of real values—such as age or number of people in a household. However, not all variables described by numbers are considered numeric. For example, when you are asked to assign a value from 1 to 5 to express your level of satisfaction, you use numbers, but the variable (satisfaction) is really an ordinal variable. Numeric variables may be either continuous or discrete. Continuous variables: A variable is said to be continuous if it can assume an infinite number of real values. Examples of a continuous variable are distance, age and temperature. The measurement of a continuous variable is restricted by the methods used, or by the accuracy of the measuring instruments. For example, the height of a student is a continuous variable because a student may be 1.6321748755... metres tall. Discrete variables: As opposed to a continuous variable, a discrete variable can only take a finite number of real values. An example of a discrete variable would be the score given by a judge to a gymnast in competition: the range is 0 to 10 and the score is always given to one decimal (e.g., a score of 8.5). (iii) Population A statistical population is a set of entities concerning which statistical inferences are to be drawn, often based on a random sample taken from the population. For example, if we were interested in generalizations about crows, then we would describe the set of crows that is of interest. Notice that if we choose a population like all crows, we will be limited to observing crows that exist now or will exist in the future. Probably, geography will also constitute a limitation in that our resources for studying crows are also limited. Population is also used to refer to a set of potential measurements or values, including not only cases actually observed but those that are potentially observable. Suppose, for example, we are interested in the set of all adult crows now alive in the county of
Cambridge shire, and we want to know the mean weight of these birds. For each bird in the population of crows there is a weight, and the set of these weights is called the population of weights.
A subset of a population is called a subpopulation. If different subpopulations have different properties, the properties and response of the overall population can often be better understood if it is first separated into distinct subpopulations.
For instance, a particular medicine may have different effects on different subpopulations, and these effects may be obscured or dismissed if such special subpopulations are not identified and examined in isolation.
Similarly, one can often estimate parameters more accurately if one separates out subpopulations: distribution of heights among people is better modeled by considering men and women as separate subpopulations, for instance.
Populations consisting of subpopulations can be modeled by mixture models, which combine the distributions within subpopulations into an overall population distribution.
Q2. What are the types of classification of data? A.2 According to Nature 1. Quantitative data- information obtained from numeral variables(e.g. age, bills, etc) 2. Qualitative Data- information obtained from variables in the form of categories, characteristics names or labels or alphanumeric variables (e.g. birthdays, gender etc.) According to Source 1. Primary data- first- hand information (e.g. autobiography, financial statement) 2. Secondary data- second-hand information (e.g. biography, weather forecast from news papers) According to Measurement 1. Discrete data- countable numerical observation.-Whole numbers only has an equal whole number interval obtained through counting(e.g. corporate stocks, etc.)
2. Continuous data-measurable observations. -decimals or fractions obtained through measuring(e.g. bank deposits, volume of liquid etc.)
QUALITATIVE DATA Qualitative data is a categorical measurement expressed not in terms of numbers, but rather by means of a natural language description. In statistics, it is often used interchangeably with "categorical" data. For example: favorite color = "yellow" height = "tall" Although we may have categories, the categories may have a structure to them. When there is not a natural ordering of the categories, we call these nominal categories. Examples might be gender, race, religion, or sport. When the categories may be ordered, these are called ordinal variables. Categorical variables that judge size (small, medium, large, etc.) are ordinal variables. Attitudes (strongly disagree, disagree, neutral, agree, strongly agree) are also ordinal variables, however we may not know which value is the best or worst of these issues. Note that the distance between these categories is not something we can measure. QUANTITATIVE DATA Quantitative data is a numerical measurement expressed not by means of a natural language description, but rather in terms of numbers. However, not all numbers are continuous and measurable. For example, the social security number is a number, but not something that one can add or subtract. For example: favorite color = "450 nm"
height = "1.8 m" Quantitative data always are associated with a scale measure. Probably the most common scale type is the ratio-scale. Observations of this type are on a scale that has a meaningful zero value but also have an equidistant measure (i.e., the difference between 10 and 20 is the same as the difference between 100 and 110). For example, a 10 yearold girl is twice as old as a 5 year-old girl. Since you can measure zero years, time is a ratio-scale variable. Money is another common ratio-scale quantitative measure. Observations that you count are usually ratio-scale (e.g., number of widgets). A more general quantitative measure is the interval scale. Interval scales also have a equidistant measure. However, the doubling principle breaks down in this scale. A temperature of 50 degrees Celsius is not "half as hot" as a temperature of 100, but a difference of 10 degrees indicates the same difference in temperature anywhere along the scale. The Kelvin temperature scale, however, constitutes a ratio scale because on the Kelvin scale zero indicates absolute zero in temperature, the complete absence of heat. So one can say, for example, that 200 degrees Kelvin is twice as hot as 100 degrees Kelvin. PRIMARY DATA Primary data means original data that has been collected specially for the purpose in mind. It means when an authorized organization, investigator or an enumerator collects the data for the first time from the original source. Data collected this way is called primary data. SECONDARY DATA Secondary data is data that has been collected for another purpose. When we use Statistical Method with Primary Data from another purpose for our purpose we refer to it as Secondary Data. It means that one purpose's Primary Data is another purpose's Secondary Data. Secondary data is data that is being reused. Usually in a different context.
Q3. Find the (i) arithmetic mean and (ii) range of the following data: 15, 17, 22, 21, 19, 26, 20. A.3 Arithmetic mean= (15+77+22+21+19+26+20)/7 =140/7 =20 Range = highest number- lowest number/2 = 58/2 =29
Q4. Suppose two houses in a thousand catch fire in a year and there are 2000 houses in a village. What is the probability that: (i) none of the houses catch fire and (ii) At least one house catch fire? A.4 Given the probability of a house catching fire is: and
Therefore, the required probabilities are calculated as follows: i. The probability that none catches fire is given by:
Therefore, the probability that none of the houses catches fire is 0.01832. ii. The probability that at least one catches fire is given by:
Therefore, the probability that at least one house catches fire is 0.98168.
Q5. (i) What are the characteristics of Chi-square test? (ii) The data given in the below table shows the production in three shifts and the number of defective goods that turned out in three weeks. Test at 5% level of significance whether the weeks and shifts are independent. Shift I II III Total A.5 1. It is not symmetric 2. The shape of the chi-square distribution depends upon the degrees of freedom, just like Student’s t-distribution. 3. As the number of degrees of freedom increases, the chi-square distribution becomes more symmetric as is illustrated in Figure 1. 1st Week 15 20 25 60 2nd Week 5 10 15 30 3rd Week 20 20 20 60 Total 40 50 60 150
4. The values are non-negative. That is, the values of are greater than or equal to 0. 5. This is not a test, but a distribution. The Chi-square distribution, is derived from the Normal distribution. It is the distribution of a sum of squared Normal distributed variables. That is, if all Xi are independent and all have an identical, standard Normal distribution then X^2 = X1*X1 + X2*X2 + X3*X3 + ... + Xv*Xv is Chi-square distributed with v degrees of freedom with mean = v and variance = 2*v. The importance of the Chi-square distribution stems from the fact that it describes the distribution of the Variance of a sample taken from a Normal distributed population. 6. Chi-square is non-negative. Is the ratio of two non-negative values, therefore must be non-negative itself 7. There are many different chi-square distributions, one for each degree of freedom 8. The degrees of freedom when working with a single population variance is n-1. since the chi-square distribution isn't symmetric, the method for looking up left-tail values is different from the method for looking up right tail values.
Area to the right - just use the area given. Area to the left - the table requires the area to the right, so subtract the given area from one and look this area up in the table.
Area in both tails - divide the area by two. Look up this area for the right critical value and one minus this area for the left critical value.
(ii) The data given in the below table shows the production n three shifts and the number of defective goods that turned out in three weeks. Test at 5% level of significance whether the weeks and shifts are independent. Shift I II III Total 1st week 15 20 25 60 2ndweek 5 10 15 30 3rd week 20 20 20 60 Total 40 50 60 150
Table b. Observed and expected values for data of above problem (ii) Observed Value (O) 15 Expected Value (E) 40 x 60 /150 = 16 (O – E)2 1 0.0625
20 25 5 10 15 20 20 20
50 x 60/150 = 20 60 x 60/150 = 24 40 x 30/150 = 8 50 x 30/150 = 10 60 x 30/150 = 12 40 x 60/150 = 16 50 x 60 /150 = 20 60 x 60/150 = 24
0 1 9 0 9 16 0 16
0.0000 0.0417 1.1250 0.0000 0.7500 1.0000 0.0000 0.6667 3.6459
The steps followed to calculate c2 are described below. 1. Null hypothesis ‘Ho’: The week and shifts are independent Alternate hypothesis ‘HA’: The week and shifts are dependent 2. Level of Significance is 5% and D.O.F (3 – 1) (3 – 1) = 4
3. Test Statistics
4. Test c2cal = 3.6459 5. Conclusion: Since c2cal (3.6459) < c2tab (9.49), ‘Ho’ is accepted. Hence, the attributes ‘week’ and ‘shifts’ are independent.
Q6. Find Karl Pearson’s correlation co-efficient for the data given in the below table:
A.6 The table in question displays the sums calculated for the data represented in table below : X 20 16 12 8 4 åX = 60 Y 22 14 4 12 8 åY = 60 XY 440 224 48 96 32 åXY = 840
X2 400 256 144 64 16
Y2 484 196 16 144 64
åX2 = 880
åY2 = 904
Solution: Applying the formula for ‘r’ and substituting the respective values from the table we get r as:
Hence, Karl Pearson’s correlation coefficient is 0.70.
This action might not be possible to undo. Are you sure you want to continue?
We've moved you to where you read on your other device.
Get the full title to continue reading from where you left off, or restart the preview.