Statistical Techniques

Inferential Statistics

Term paper - Inferential Statistics (Rajarajan & Shoaib)

1

Statistical Techniques
Table of Contents 1.0 WHAT IS INFERENTIAL STATISTICS?.................................................................................................3 2.0 BRIEF TIMELINE OF INFERENTIAL STATISTICS .............................................................................4 3.0 TEST OF WHAT? ..........................................................................................................................................6 3.1 HYPOTHESIS – INTRODUCTION ...............................................................................................6 3.1.2 ERRORS IN SAMPLING.................................................................................................7 3.1.3 STUDENT’s T-TEST ........................................................................................................9 3.1.4 CHI-SQUARE TEST.......................................................................................................10 3.2 REGRESSION? ...............................................................................................................................12 3.2.1 REGRESSION MODELS...............................................................................................12 3.2.2 SCATTER-PLOTS ..........................................................................................................12 3.2.3 REGRESSION EQUATION ..........................................................................................12 3.2.4 REGRESSION INTERPRETATION............................................................................15 3.2.5 R SQUARRED .................................................................................................................15 4.0 BIBLIOGRAPHY.........................................................................................................................................16

Term paper - Inferential Statistics (Rajarajan & Shoaib)

2

Statistical Techniques
1.0 WHAT IS INFERENTIAL STATISTICS?

The vital key to the difference between descriptive and inferential statistics are the capitalized words in the description: CAN DESCRIBE, COULD NOT CONCLUDE, AND REPRESENTATIVE OF. Descriptive statistics can only describe the actual sample you study. But to extend your conclusions to a broader population, like all such classes, all workers, all women, you must use inferential statistics, which means you have to be sure the sample you study is representative of the group you want to generalize to. Allow me to exemplify: i. The study at the local mall and cannot be used to claim that what you find is valid for all shoppers and all malls. ii. Another example would be a study conducted on an intermediate college can’t claim that what you find is valid for the colleges of all levels (i.e. General Population). iii. Also visualize a survey conducted at a women's club that includes a majority of a particular single ethnic group cannot claim that what you find is valid for women for all ethnic groups. As you can see, descriptive statistics are useful and serviceable if you don't need to extend your results to whole segments of the population. But the social sciences tend to esteem studies that give us more or less "universal" truths, or at least truths that apply to large segments of the population, like all teenagers, all parents, all women, all perpetrators, all victims, or a fairly large segment of such groups. Leaving aside the theoretical and mechanical soundness of such an investigation for some kind of broad conclusion, various statistical approaches are to be utilized if one aspires to generalize. And the primary distinction is that of SAMPLING. One must choose a sample that is REPRESENTATIVE OF THE GROUP TO WHICH YOU PLAN TO GENERALIZE. To round up, Descriptive statistics are for describing data on the group you study, While Inferential statistics are for generalizing your findings to a broader population group.

Term paper - Inferential Statistics (Rajarajan & Shoaib)

3

Statistical Techniques
2.0 BRIEF TIMELINE OF INFERENTIAL STATISTICS

1733

1733 - In the 1700s, it was Thomas Bayer who gave birth to the concept of inferential statistics. The normal distribution was discovered in 1733 by a Huguenot refugee de Moivre as an approximation to the binomial distribution when the number of trials is too large. Today, not only do scientists but also many professions rely on statistics to understand behaviour and ideally make predictions about what circumstances relate to or cause these behaviours.

1796

Historical Note: In 1796, Adophe Quetelet investigated the characteristics of French conscripts to determine the "average man." Florence Nightingale was so influenced by Quetelet's work that she began collecting and analyzing medical records in the military hospitals during the Crimean War. Based on her work hospitals began keeping accurate records on their patients, to provide better follow-up care.

1894

1894 - At the inception of the social survey, research results were confronted with the developments in inferential statistics. In 1894, Booth wrote The Aged Poor in England and Wales: Conditions? In this volume Booth claimed that there was no relationship between the ratio of welfare (out-of-doors relief) and workhouse relief (in-relief and the incidence of poverty by parish (or poor law union).

1896

Dec 1896 - Walker died rather suddenly at the age of 56, just days after giving the address opening the first meeting of ASA outside Boston—in Washington, DC in December 1896. That meeting led to the founding of the Washington Statistical Society. His achievements in developing major federal data systems, in promoting the organizational development of statistics, and of bringing statistical ideas to a wide audience, left the field much richer than he found it.

1899

1899 - Since inferential social statistics are primarily concerned with correlation and regression. To prove this Yule published his paper on poverty in London in 1899, this concern has occurred in a context of establishing causality. Often investigators seem to view statistical modeling as being equivalent to a regression model. The reader is cautioned that my critique of regression analysis is not necessarily equivalent to denying the value of empirical research.

Term paper - Inferential Statistics (Rajarajan & Shoaib)

4

Statistical Techniques
1925 1925 – First, Sigmund Freud had developed a theory that self-explained the reasons for aggression and juvenile criminal behaviours in terms of childhood experiences. Second, in 1925, RA Fisher published Statistical Methods for Research Workers in which he identified an effective experimental paradigm that included control groups and inferential statistics. Freud's theory and Fisher's paradigm provided a basis so that mental health professionals could initiate studies to identify many mental behaviours 1930 IN 1930, THE YEAR the CH Stoelting Co. of Chicago published what was to be the largestever catalog of psychological apparatus, there was virtually no use of inferential statistics in psychology, in spite of the fact that William Sealey Cosset had long since presented the T-test and Sir Ronald Fisher had presented the general logic of null hypothesis testing. Only after Fisher's epochal introduction to analysis of variance procedures did psychologists even notice the procedure. 1930 1930 - The fiducial argument, which Fisher produced in 1930, generated much controversy and did not survive the death of its creator. Fisher created many terms in everyday use, eg statistic and sampling distribution and so there are many references to his work on the Words pages. Symbols in Statistics are his contributions to notation. 1935 1935 - In the two decades following the publication of Ronald Aylmer Fisher's Design of Experiments in 1935, Fisher's link between experimental design and inferential statistics became institutionalized in American experimental psychology. 1936 Apr 27, 1936 - . Pearson founded the journal Biometrics and was the editor of Annals of Eugenics. Because of his fundamental work in the development of modern statistics, many scholars today regard Pearson as the founder of 20th-century statistics. He died in Coldharbour, England, on April 27, 1936. 1977 1977 - The youth violence prevention landscape has changed drastically in the last quarter century. In 1977, Wright and Dixon published a review of “Juveniles delinquency prevention program” reports. The results were disappointing. From approximately 6600 program abstracts, empirical data were available from only 96 . Of the 96 empirical reports, only 9 used random assignment of subjects, inferential statistics, outcomes measure of delinquency, and a follow-up period of at least six months. Of those 9, only 3 reported positive outcomes, and these three were based on the three smallest sample sizes among the 9 reports. The authors concluded that the literature was low in both scientific and policy utility. By contrast today dozens of summaries of research on prevention practices are available.

Term paper - Inferential Statistics (Rajarajan & Shoaib)

5

Statistical Techniques
1981 Jun 3, 1981 - Education Practical Statistics for Educators An introduction to the basic ideas of descriptive and inferential statistics as applied to the work of the classroom teachers counselors and administrators In the public schools Emphasis Is upon practical applications of statistics to problems. 1986 Dec 17, 1986 - Koop acknowleged that the proof of these smoker’s deaths was "inferential, of course," based on analyses of statistics gathered in past studies, including several in Japan, Hong Kong, Taiwan, Europe and the United States. 1995 Jan 1995 - A jury trial on compensatory damages was held in January 1995. Dannemiller testified that the selection of the random sample met the standards of inferential statistics, that the successful efforts to locate and obtain testimony from the claimants in the random sample "were of the highest standards " in his profession, that the procedures followed conformed to the standards of inferential statistics.

3.0 TEST OF WHAT?

Tests of significance are helpful in problems of generalization. A Chi-Square or a T-Test tells you the probability that the results you found in the group under study represent the population of the chosen group. It can be frequently observed, Chi-Square or a t-test gives you the probability that the results found could have occurred by chance when there is really no relationship at all between the variables you studied in the population. A known method used in inferential statistics is estimation. In estimation, the sample is used to estimate a parameter, and a confidence interval about the estimate is constructed. Other examples of inferential statistics methods include i. ii. Hypothesis testing Linear regression

3.1 HYPOTHESIS – INTRODUCTION Hyptothesis is a statement about the population parameter or about a population distribution. The testing of hypothesisis conducted in two phases. In the first phase, a test is designed where we decide as to when can the null hypothesis be rejected. In the second phase, the designed test is used to draw the conclusion. Hypothesis testing is to test some hypothesis about parent population from which the sample is drawn. DEFINITIONS PARAMETER - The statistical constants of the population namely mean (µ) , variance are usually referred to as parameters. STATISTIC - Statistical measures computed from the sample observations alone namely mean X Variance S2 have been termed as Statistic.

Term paper - Inferential Statistics (Rajarajan & Shoaib)

6

Statistical Techniques
UNBIASED ESTMATE - A statistic t = t (X1, X2, …..Xn), a function of the sample values X1, X2,…….Xn is an unbiased estimate of the population parameter 0, if E(t) = 0. In other words, if E(Statistic) = Paramater, then statistic is said to be an unbiased estimate of the parameter. SAMPLING DISTRIBUTION OF A STATISTIC - If we draw a sample of size n from a given finite population of size N, then the total number of possible samples is /n!(N-n)! = K

STANDARD ERROR - The standard deviation of the sampling distribution of a statistic is known as it standard error. NULL HYPOTHESIS - A definite statement about the population parameter which is usually a hypothesis of no difference is called Null Hypothesis an is usually denoted by Ho ALTERNATIVE HYPOTHESIS - Any hypothesis which is complementary to the null hypothesis is called an alternative hypothesis usually denoted by H1. For example, if we want to test the null hypothesis that the population has a specified mean Mo (say) is Ho : µ - µo then the alternative hypothesis could be a) H1 : µ ≠ µ o b) H1 : µ > µ o c) H1 : µ < µ o 3.1.1 PROCEDURE FOR TESTING OF HYPOTHESIS Various steps in testing of a statistical hypothesis in a systematic manner : 1. Null hypothesis : Set up the null hypothesis H0 2. Alternative Hypothesis : Set up the alternative hypothesis H1. This will be enable us to decide whether we have to use a single tailed(right or left) test of two-tailed test. 3. Level of Significance : To choose the appropriate level of significance (x) 4. Test Statistic : To compute the test statistic : Z = t-E(t)/S1E(t) , under Ho 5. Conclusion : We compare the computed value of Z with the significant value Z2, at the given level of significance, if │z │ <z2 we say it not significant, it │z│ >z is then we say that it is significant and the null hypothesis is rejected at level of significance.

3.1.2 ERRORS IN SAMPLING The main objective in sampling theory is to draw valid inference about the population parameters on the basis of the sample results. In practice, we decide to accept or reject the lot after examining a sample from it. As such we are able to commit the following two types of errors : TYPE 1 ERROR : Reject Ho when it is true, TYPE II ERROR : Accept Ho when it is wrong, ie. Accept Ho when H1 is true If we mention P(Accept Ho when it is wrong) = P(Accept Ho/H1) = β and P(Reject Ho when it is true) = P(Reject Ho/H1) = x then 2 and β are called the sizes of type 1 error and type II error, respectively. In practice, type I error amounts to rejecting a lot when it is good and type II error may be

Term paper - Inferential Statistics (Rajarajan & Shoaib)

7

Statistical Techniques
regarded as accepting the lot when it is bad. Thus P(Reject a lot when it is good) = α and P (Accept a lot when it is bad) = β where α and β are referred to as producer’s risk and consumer’s risk respectively. CRITICAL REGION A region in the sample splace S which amounts to rejection of Ho is termed as critical region of rejection. ONE – TAILED AND TWO-TAILED TESTS Ho: µ > µ o (Right-tailed), the critical region lies entirely in the right tail H1: µ < µ o (left-tailed), the critical region lies entirely in the left tail. A test of statistical hypothesis where the alternative hypothesis is two – tailed tests such as Ho:µ=µ o against the alternative hypothesis H1:µ=µ o isknown as two tailed test and in such a case the critical region is given by the portion of the area lying on both tailsof the probability curve of the test statistic. CRITICAL VALUE OR SIGNIFICANT VALUES The value of test statistic which separates the critical (or rejection) region and the acceptance region is called the critical value or significant value. It depends on : 1) The level of significance used, and 2) The alternative hypothesis, whether it is two-tailed of single-tailed The standardized variable corresponding to the statistic t namely Z = The value of z above under the null hypothesis is known as test statistic. The critical value of the test statistic at level of significance 2 for a two-tailed test is given by Z, where Z is determined by the equation : P(1Z1>Z o ) = α i.e., Zα is the value so that the total area of the critical region on both tails is 2. Since normal probability curve is a symmetrical curve. In case of a single-tail alternative, the critical value of Zα is determined so that total area to the right of it (for right-tailed test) is α and for left-tailed test the total area to the left of (-Zα) is α Thus the significant or critical value of Z for a single-tailed test (left or right) at level of significance α is same as the critical value of Z for a two-tailed test at level of significance ‘α’. Please find below the critical values of Z at commonly used levels of significance for both two-tailed and single-tailed tests

Critical Value Z2 1% Two tailed test Right tailed test Left tailed test │Zα│ = 2.58 Zα = 2.33 Zα = 2.33

LEVEL OF SIGNIFICANCE 5% │Zα│ = 1.96 Zα = 1.645 Zα = 1.645 10% │Zα│ = 1.645 Zα = 1.28 Zα = 1.28

TEST OF SIGNIFICANCE OF A SINGLE MEAN If X1, X2, …….Xn, in a random sample of size n from a normal population with mean M and variance 2, then the sample mean is distributed normally with mean M and variance .

Term paper - Inferential Statistics (Rajarajan & Shoaib)

8

Statistical Techniques
Null Hypothesis, Ho - The sample has been drawn from a population with mean M and variance , ie there is no significance difference between the sample mean(X) and population mean(M), the test statistic (for large samples), is Z = If the population Standard Deviation is unknown, then we use its estimate provided by the sample variance given by (for large samples)

TEST OF SIGNIFICANCE FOR DIFFERENCE OF MEANS The mean of random sample of size n, from a population with Mean M, and Variance and let be the mean of an independent random sample of size n2 from another population with mean M2 and variance ? then, since sample size are large. TEST OF SIGNIFICANCE FOR THE DIFFERENCE OF STANDARD DEVIATION If S1 and S2 are the standard deviation of two independent samples, then under null hypothesis, Ho : the sample standard deviations don’t differ significantly. 1) (for large samples)
1= 2

i.e

But in case of large samples, the S.E of the difference of the sample standard deviations is given by

3.1.3 STUDENT’s T-TEST The entire large sample theory was based on the application of “normal test”. However if the sample size n is small, the distribution of the various statistics are far from normally and as such ‘normal test’ cannot be applied if n is mall. In such cases exact sample tests, pioneered by W.S.Gosst(1908) who wrote under the pen name-of student, and later on developed and extended by Prof.R.A.Fisher(1926) are used. Applications Of T-Distribution The t-distribution has a wide number of applications in statistics, and some of which are 1) To test if the sample mean( mean. 2) To test the significance of the difference between two sample means. 3) To test the significance of an observed sample correlation and sample regression coefficient. 4) To test the significance of observed partial correlation coefficient. T-Test For Single Mean All hypothesis testing is done under the assumption the null hypothesis is true ) differs significantly from the hypothetical value µ of the population

Population Standard Deviation Known If the population standard deviation, sigma, is known, then the population mean has a normal distribution, and you will be using the z-score formula for sample means. The test Term paper - Inferential Statistics (Rajarajan & Shoaib) 9

Statistical Techniques statistic is the standard formula you've seen before. The critical value is obtained from the normal table, or
the bottom line from the t-table. Population Standard Deviation Unknown If the population standard deviation, sigma, is unknown, then the population mean has a student's T-distribution, and you will be using the t-score formula for sample means. The test statistic is very similar to that for the z-score, except that sigma has been replaced by s and z has been replaced by t. The critical value is obtained from the t-table. The degree of freedom for this test is n-1. If you're performing a t-test where you found the statistics on the calculator (as opposed to being given them in the problem), then use the VARS key to pull up the statistics in the calculation of the test statistic. This will save you data entry and avoid round off errors. General Pattern Notice the general pattern of these test statistics is (observed - expected) / standard deviation.

3.1.4 CHI-SQUARE TEST A chi-square test (also chi squared test or χ2 test) is any statistical hypothesis test in which the sampling distribution of the test statistic is a chi-square distribution when the null hypothesis is true, or any in which this is asymptotically true, meaning that the sampling distribution (if the null hypothesis is true) can be made to approximate a chi-square distribution as closely as desired by making the sample size large enough. Chi-Square Test In Contigency Table CHI-SQUARE distribution is utlised to determine the critical value of the chi-square variate at various level of significance. Properties : (1) The value of chi-square varies from 0 to α. (3) Chi-square can never be negative (2) When each Oi = Ei, the value of chi-square is zero.

Term paper - Inferential Statistics (Rajarajan & Shoaib)

10

Statistical Techniques
CONTIGENCY TABLE : The test if independence of attributes when the frequencies are presented in a two way table according to two attributes classified to various categories known as the contigency table. Test of hypothesis in a contingency table. A contingency is a rectangular array having rows and colums ascertaining to the categories of the attributes of A & B. The null hypothesis : H0 : Two attributes are independent vs H1 : two attributes are dependant on each other

. Statistics X2 has (p-1) (q-1) d.f

Under Ho, the indepdendence of attributes, the expected frequency, Eij = ith row total x jith column N = Ri x Cj n Decision : The calculated value compared with tabulated value of X2 for (P-1) (Q-1) d.f. & prefixed level of significance α. Calculation X2 > reject Ho, if Calculation < X2 – tab – accept Ho. CONTIGENCY TABLE OF ORDER 2X2 DIRECT FORMULAR FOR 2x2 = n(ad-bc)2 (a+b) (c+d) (a+c) (b+d) B1 A1 A2 A C (cell) a+c Calculation X2>X2α1, reject Ho Calculation X2<x2α1, accept Ho B2 B D B+d a+b c+d a+b+c+d = n X2 has 1 d.f.

Term paper - Inferential Statistics (Rajarajan & Shoaib)

11

Statistical Techniques

3.2 REGRESSION? Regression analysis is a method for determining the relationship between two variables. The regression statistical skeleton is at the core of observed social and political science research. Regression analysis works as a statistical substitute for controlled experiments, and can be used to make causal inferences.

3.2.1 REGRESSION MODELS Researchers render verbal theories, hypothesis, even intuition into models. A model illustrates how and under what circumstances two (or more) variables are linked. A regression model with a dependent variable and one independent variable is known as a bi-variate regression model. A regression model with a dependent variable and two or more independent variables and/or control variables is known as a multivariate regression model. Example: The dataset "Televisions, Doctors, and Smokers" contains, among other variables, the number of smokers per television set and the number of smokers per physician for 50 countries.

3.2.2 SCATTER-PLOTS The X axis normally depicts the values of the independent variable, while the Y axis represents the value of the dependent variable. Scatter-plots allow you to study the flow of the dots, or the relationship between the two variables Scatter-plots allow political scientists to identify : • • Positive or negative relationships Monotonic or linear relationships

3.2.3 REGRESSION EQUATION The linear equation is specified as follows: Y = a + bX Where Y = dependent variable X = independent variable a = constant (value of Y when X = 0) b = is the slope of the regression line “a” can be positive or negative. Referred to “a” as the intercept, “a” is the point at which the slope line passes through the Y axis. “b” (the slope coefficient) can be positive or negative. A positive coefficient denotes a positive relationship and a negative coefficient denotes a negative relationship. The significant interpretation of the slope coefficient depends on the variables involved, how they are coded and the dimension of the variables. Larger coefficients may indicate a solid relationship, but not necessarily.

Term paper - Inferential Statistics (Rajarajan & Shoaib)

12

Statistical Techniques
The goal of regression analysis is to find an equation which “best fits” the data.

In regression, the equation is found in such a manner such that its graph is a line that reduces the squared vertical distances amid the data points and the lines drawn. “d1” and “d2” illustrate the distances of observed data points from an approximate regression line. Regression analysis bring into play a mathematical equation that locates the single line that reduces the squared distances from the line. The standard regression equation is the same as the linear equation with one exception: the error factor. Y = α + βX + ε Where Y = dependent variable α = constant term β = slope or regression coefficient X = independent variable ε = error term This regression process is called ordinary least squares (OLS). α (the constant term) interpreted the same as earlier β (the regression coefficient) tells how much Y changes if X changes by one unit. The regression coefficient indicates the inclination and strength of the relationship between the two quantitative variables. The error (ε) denotes that observed data does not follow a tidy pattern that can be summarized with a straight line. A observation's score on Y can be split as the following two parts: α + βX is due to the independent variable ε is due to error Observed value = Predicted value (α + βX) + error (ε) The error is the difference between the predicted value of Y and the observed value of Y. This is known as the residual.

Term paper - Inferential Statistics (Rajarajan & Shoaib)

13

Statistical Techniques

Example: Lets take an example to clarify what we theoretically know:

In above data on the scatterplot: Y (dependent variable) = telephone lines for 1,000 people X (independent variable) = Infant mortality We will utilize regression to look at the relationship connecting communication capacity (measured here as telephone lines per capita) and infant mortality. In this example, the intercept and regression coefficient are as follows: α (or constant) = 121 Means that when X (infant deaths) is 0 deaths, there are 121 phone lines per 1,000 population. β = -1.25 Means that when X (deaths) increases by 1, there is a predicted or estimated decrease of 1.25 phone lines.

Term paper - Inferential Statistics (Rajarajan & Shoaib)

14

Statistical Techniques

3.2.4 REGRESSION INTERPRETATION These computations can be helpful because they allow us to make useful predictions about the data. For example, an increase from 1 to 10 mortalities per 1,000 live births is related with a drop of 119.75 – 108.5 = 11.25 telephone lines. Interpreting the meaning of a coefficient could be a bit fiddly. What does a coefficient of -1.25 mean? Well, it means a negative association between infant mortality and phone lines. It means for every extra infant death there is a reduction of 1.25 phone lines. This is useful info, however is there a gauge that tells us how good we do predicting the observed values? Yes, the measure is known as R-squared. 3.2.5 R SQUARRED As stated earlier, there are two components of the total deviation from the mean, which is calculated by the addition of squares (or total variance).The difference between the mean and the predicted value of Y, this is the explained part of the deviation, or (Regression Sum of Squares). The second component is the residual sum of squares (Residual Sum of Squares), which measures prediction errors. The is the unexplained part of the deviation. Total SS = Regression SS + Residual SS In other words, the total sum of squares is the sum of the regression sum of squares and the residual sum of squares. R2 = Regression SS/TSS The more variance the regression model explains, the higher the R2.

Term paper - Inferential Statistics (Rajarajan & Shoaib)

15

Statistical Techniques 4.0 BIBLIOGRAPHY 1. Inferential statistics Timeline: http://www.google.ae/search?q=inferential+statistics&hl=en&tbo=1&rls=com.microsoft:enus:IE-SearchBox&output=search&source=lnt&tbs=tl:1&sa=X&ei=HMsFTqyKInIrQej2LSmDA&ved=0CBEQpwUoAw&biw=1366&bih=596 [Online]. [Accessed: 23th June 2011].
2. Handbook of Injury and Violence Prevention By Lynda S. Doll, E. N. Haas Chapter 9.3 – Brief history of youth violence prevention efforts, pg#159

Term paper - Inferential Statistics (Rajarajan & Shoaib)

16