You are on page 1of 29
JIGSAW ACADEMY ‘Analyies for Professionals BASIC STATISTICAL CONCEPTS INFERENTIALSTATISTICS = We have reviewed many types of descriptive sta deviation, range etc. ics— mean, standard help summarize or describe information about that particular dataset or sample Inferential statistics are used to make inferences about the popul: sample data available — Samplevs Population * Population: All possible data for a situation or case *+ Sample: a subset if the population, usually all that is available — Parameters and Statistics + Population statistics are called parameters, whereas parameters for samples are always referred to as Statistics HYPOTHESIS TESTING = Let's take the airline example. As the sales manager you believe that the average no show rate is 5%. You base this on average no shows for every flight in the last S months Your GM wants to check for the next 10 randomly chosen flights. No shows (out of 100) are: 3,3.4,7.5.4,5,3.1.2 What would you conclude? The average for these 10 flights is This is different from 5%, but canyou conclude that it is so different that you need to disregard the 5%? HYPOTHESIS TESTING & Where does statistics come into the picture here? — The % of no shows over all flights for this airline Is arandom variable — We are now working with asample and trying to determine ifthe sample shows the same characteristics as the population — We determine a sample statistic (in this case, the mean) — We are now trying to determine what are the chances of seeing a sample mean of 3.7% ifthe population mean is actually 5% If the chances are very high, we cannot reject the hypothesis that the population mean is 5% If the chances are low, we are more confident about rejecting the hypothesis that the population mean is 5% CENTRAL LIMIT THEOREM = ‘We will introduce one important concept here that will aid usin calculating the probability or chance The Central Limit Theorem Similar to 2 probability distribution, sampling distributions are probability distributions of a particular parameter — If we take 10 day observations 30 times, we are creating how many samples? * The distribution of the mean of each 10-day set is a sampling distribution of the mean isnot normal] CENTRAL LIMIT THEOREM = + When we select simple random samples of sizen, the sample means we find will vary from sample to sample. We can model the distribution of these sample means with a probability model thatis (5) * For the purpose of applying the central limit theorem, we will considera sample size to be large when n > 30. HYPOTHESIS TESTING € Given the CLM, if the sampling distribution of means is approximately normal, wwe already know how to calculate probability of any certain value under 2 So, if we have a sample mean, we can calculate the probability of obtaining that value given our knowledge of the true population mean 3 2 a ° 1 2 $ The Normal Distribution HYPOTHESIS TESTING Basic Set up of a Hypothesis Test: Null Hypothesis (H°): ‘example: Sample mean = 5% Alternative Hypothesis (H*): Sample Mean <5% ‘Test Statistic = ‘A function of the random sample, usually measuredas standardized distance from mean Test Distribution : Dependingon sample size, T-distribution, ora Standard Normal Dist , or other appropriate distributions Significance Level Criterion used for rejecting the null hypothesis P-Value: The probabilityof the observed outcome, assuming the null is true Rejection Region: Values of the test statistic that ere unlikely if null is true, associated with the ability distribution HYPOTHESIS TESTING € IQ Testing: ‘We are testing the IQs ofa group of people (say Jigsaw students), and we find that, average IQis 121 fora random sample of students [sayNovbatch). In general, IQforthe population are standardized such that mean =100 and SD =15. ‘Would you be able to state with certainty that this group of people (Jigsaw) are more intelligent than average? Two options of Null (andtherefore At) 1. Null: Thisgroupisno different than the average Alt: This group is different than the average 2. Null: This group's IQisnot> than the average Alt: This group’sIQis> than the average: HYPOTHESIS TESTING & Essentially, what we need to findout is: ‘What is the probability that purely by random chance we picked a sample that gave us an average of 121 when the actual population averageis 100? We know that IQs distributed normally— \Wecan look up the probability of anyoutcome usinga probability table HYPOTHESIS TESTING € In order to read theprobability of a particularoutcome from the table on the previous slide, we will need to standardize the distance of this outcome: In this case: X=121 Mean = 100 sp: So, welookat the table to determine what is the probability ofaZ score of 1.4, HYPOTHESIS TESTING Tables of the Normal Distribution Probability Content _ from -00 t Z 4 HYPOTHESIS TESTING VJ So, the Z score of 1.4 impliesthat probability of seeinga Zscoreof 1.4 or higher, ifthe sampletruly ‘came froma populationwith amean of 1001s critical value, then you wouldrejectthe ull. Forthis exemple, syppace Idecidedthat lwantto be extra sure of mycenclusion, 30! went to rejectthe nullonly of probabifty ofthis outcome islessthan.01 What ismy critical vauer HYPOTHESIS TESTING € Basic Set up of a Hypothesis Test: Null Hypothesis (H®): ‘example: Sample mean = 5% Alternative Hypothesis (H*): ‘Sample Mean < 5% Test Statistic : ‘A function of the random sample, usually measured as standardized distance from Test Distribution = Depending on sample size, T-distribution, or a Standard Normal Dist ,or other appropriate distributions Significance Level: Criterion used for rejecting the null hypothesis, P-Value: The probability of the observed outcome, assuming the null is true Rejection Region: ‘Values of the test statisticthat are unlikely if null is true, associated with the ability distribution HYPOTHESIS TESTING € ‘An insurance company isreviewingits current policy rates. When originally setting the rates they believed that the average claim amountwas$1,800. They are concernedthatthe true mean is actuallyhigher than this, because they could potentially lose a lot ofmoney. They randomly select 40 claims,andcalculatea samplemean of $1,950. Assuming thatthe pooulation standard devietionof claims wos$250, and setAlgha=:05, testto see Ifthe insurance companyshouldbeconcerned. Nuit ‘yg claim iz<=$1800 Alt ‘Avg claim>1809 ‘Test Statistic (2Score)= 0.6 Significance Level Criticalvalve: Conclusion? HYPOTHESIS TESTING € In many cases, population std deviation maybe unavailable. n that case, we use the samplestd deviation divided by thesq roctof the sample size (std error) ‘The stderror is an unbiased estimate ofthe (unknown) population std deviation As samplesize Increased, thestd error will? Inthe previous example, supposing population std deviation of claim amountwasunknown, but samplestd deviationwas$500 X-u © vn ze1ss Significance level Criticalvalue=165 Conclusion? HYPOTHESIS TESTING $ T-TESTS If sample size is <30, yourdata may notbe normally distributed. n reallife there are many instances of sample sizes being lessthan 30, s0 inorder to compute probability ofan observed ‘outcome when sample size<30, use t- distribution or attest What isa t-distribution? ‘Suppose wehave a simple randomsample of sizen drawn froma Normal pepulation with mean pand standard deviation sigma. LetXBAR denote the samplemean and s, the sample standard deviation. Then ‘the quantity a t a) hasa tdistribution with n-t degrees of freedom ‘As sample size increases and approaches 30, thet dist ‘courte (weoet 30) ‘aporoximates a normaldistribution HYPOTHESIS TESTING 6 SIGNIFICANCE LEVEL & CRITICAL VALUES ‘Remember you can neve be 1008 sure oan cuttome. You ese _:gnvcancs evel conden eves to sa baeone tof beyond ‘which you irc the null Thats te probaty of seeing 2 aur ths oweme ormorssezethanas aon o Os yuan hen Value ofthe “he evelzsubjecive depends onbusne: problem andhiow test statistic servtewebeapoee Hy isrejected Hfyoudece you cu p-value s.05,youarewsing 2 sgnscanceleveiot 5% dent the est sett assosated wih pvalue of 0.05 thats your otal vane oftetenmtsie ‘once you cane your sample tact tai, wl yoursjctha all if the Sample tas igheror lower tan the cal value? HYPOTHESIS TESTING T-TESTS € If sample size is <30, your data maynot be normally distributed. In that case (or ifyou do nat have population std deviation) use at test Whats a t-distribution? ‘Suppose we havea simple random sample of sizen drawn froma Normal population with mean and standard deviation sigma. Let BAK denote thesample meanand , the samplestandarddeviation. then thequantity; =H a) hasa tdistribution withn-1 degreesof freedom ‘As samplessize increases and approaches30, the dist, vee ‘approximates a normaldisiribution HYPOTHESIS TESTING T-TESTS iJ ‘Supposing you wantto test if college students sleep a lot less than the general population ‘Average sleep hours for the population's (supposed) to be hours. Say you take arandom sample ‘of LOcollege students and youask themhow many hours they sleep. Here are the date: Basedon this data, canyou conclude that students sleep 1 7 less than the general population? 2 6B : . ull? : Ps Alternate? 5 ss —> ample sie heres 10, populations deviations not 7 3s e 13 ‘Sample mean = 6.64;samplestd.dev=1.00 ° = Test stat:=(6.1-8)/(0.73/10"0.5))=-39 Criticalvalue fora tistrbution with 9df at95%significance level? Let's check inthe next slide HYPOTHESIS TESTING € Now, youhavea tes: statistic, and have decided on a significance level (s2ya=0.05) ‘The next step isto assess what isthe corresponding protabiliy associated with thecalculstedtest statistic, How would you do that? Usinga probabil distribution rable forthe typeof eistributionchasen. HYPOTHESIS TESTING € ‘The t-distribution probabilytablehas two valueoptions lon he column label: one til vs two ale VF your hypothesis test inset up totes ‘Mean nct equal toz2ro ‘Weight loss not equal to ‘you are essenuallysayingtnatmean couldbe greater or lessthanzero, medicine couldlead 10 weightloss higher or lower than 6 Kgs and z000n. ‘Thisimplioe that you ae ooking at probabilayaf avent= thatare on both sides ofthe mean However ifyour hypothesis test is set upto test wean tas< efeclsthar 3 pers000 You are essentially calculating probebiliyof anevert higher oriowertnanthe mean HYPOTHESIS TESTING € In cur sleep example, what kind of tailare we testing? ‘One Tall, because ouralternatehypothesisisthatsleep time < 8hrs —— 2 oti ont ‘We HYPOTHESIS TESTING cee ee 9 0 08 0% 08 OM ON 08 ON 0% ON 0B OR 08 07 0 Om Om Om eee| Dor oss oas Oss OSs oe Om OST Ost om om Os) US) Om OMT OST Oss ToT Tee 190 1a 194 195 197 190 203 208 212216 249 9M 291 242 2a 256 390 108 ie ta tafe fa ta tate ae fat ow so 3 ae 3s At es He [a Randi ate meetin To omar eS ae idee uid esse sie eels s cles alee 3| RSP eee Ree sisaubey sae on 8 (87 Turi KY ay * ul ot Saas seb Vi 02. “Geroeesinenteoo ice 4 HYPOTHESIS TESTING € ‘What aboutthe negative valueon the test statistic? Because the t-distributionissymmetrical,p( 3.9) = pit < -3.9) So, if calculated teststatisticisgreater than cut-off of critical probability based on the pre determined zignificance level you rejectthe null If not greater, you failto rejectthe null (never accept the null!) Finally, if you areusing a two tailed test. if you want a signficarce level of 0.05, then your critical values will correspondto 0.025 on both sides ofthe distribution HYPOTHESIS TESTING $ ONE TAILED VS TWO TAILED TESTS Ifyou aretestingX> AorX 6 Minutes What about the test-stati Test sta — (Sample Mean — Population Mean)/(Sample Standard Deviation/sartin) 4 ONE SAMPLE T-TESTS € What distribution table should we use to look up probability values? — We are assuming that call times are normally distributed, so should we use a normal distribution table? — Because the sample size here is less than 30, we should actually use the t-distribution — There are many different t-distributions that depend on the degrees of freedom (df = sample size-1) ibution table for a particular degree of So the next step is to look up the t-di freedom 4 ONE SAMPLE T-TESTS € For the example we are reviewing — Sample Mean =7.3 — Population Mean = 6 — DF=9 Type of test — One Tailed (WHY?) « T-DISTRIBUTION TABLE € Significance level = Ge 37 Gta) eta oa ecto &. 0 oa ent) 2eee ‘so | Ver oe 26a vera | See eo Beso ta | 1380 ‘sot 2esa 1788 1337 “650 2e52 1734 11330 ‘sas, ae ver | 413 2327 aes |i 2ae 4 ONE SAMPLE T-TESTS € So, the critical region for rejecting the null hypothesis is: t<-1.83,t > 1.83 The statistic we have calculated is -0.17. Therefore we cannot reject the null hypothesis at the 95% significance value What — That the average call time is 6 minutes with 95% confidence ill you report back? 4 TWO-SAMPLET-TESTS € The two-sample t-test is used when means across two groups (or sample) are compared Extend the previous example: average call times are recorded before the project is started, and average call times recorded after the projectis implemented Can we use the data to show a significant improvement (that is, reduction) in average call time post project implementation? TWO-SAMPLET-TESTS The test statisticis: = R- | my My ‘Assuming equal variance between the two samples, If variancesare unequal Soya +h sy? 4 TWO-SAMPLET-TESTS € If we calculate the Test Stat 1. Mean of group 1= 8.75 Mean of group 2= 7.3 Std Deviation Group 1 = 2.16 Std Deviation Group 2 = 2.41 DF=10+10-2=18 Testat= 141 assuming equal variance: 2 3. 4, 5. 6. Critical value for t-distribution at 18 df for 95% significance level —two tailed 10 4 TWO-SAMPLET-TESTS € In the previous example, we compared two samples that had equal observations If there are unequal observations, we can still use the ttest, but the degrees of freedom used should be 1 less the small sample size We can also assume r variance across the two samples, t-stat will be simplified 4 PAIRED DIFFERENCE T-TESTS € + Inthe previousexample, welooked at ne ‘comparisons of average call timefora random 10 calls beforethe project was impleneatceandargndoni0 casera MAMSIAUMRS CIS solution was designed 1 162 168 + Insomecases, we may wanttotest 2 170 158 ‘observations that are pairedto see ifthere 3 184 186 isa truedifference in their means before and after the experiment 4 164 155, + Forexample let's say we are testing the 5 172 183 efficacy of aparticular drug that claims will 6 176. 164 help patients lose weight T 759 760 + We record average weight for& 8 470 135 respondents beforeandafter they teke the drug for20 weeks $ PAIRED DIFFERENCE T-TESTS Test statistic is: Where dis the difference in scores. We need to calculate the mean and standard error of the pre-post differences for each pair and then use that for the test statistic Let’s say our hypothesis is that the drug does have a positive impact on weight loss, and we would like to use a 95% confidence level. How would you test the hypothesis? PAIRED DIFFERENCE T-TESTS € If we calculate the Test Statisti 1. Mean of differences =-11.37 2. Stddeviation of difterences=14.74 = Therefore, standard error= std deviation /sqrtof n 3. T-statistic= -11.37/5.21=-218 Critical value of t-distribution for 7 degrees of freedom for 95% one-tailed significance Is—1.89 CriticalRegion: Tstal 18 So do we rejector fail toreject the null? T-TESTS IN EXCEL € + Wecan run all the discussed T-Testsin Excel usingthe Too|- Data Analysis edd- in + Forexemple, forthe pairedt-test pee 4 examplethatwejustreviewed,the — [=2=— 4 outputwillbe: Psssonconsain Tai thesved Mean Diference fi li 7 critical for one tailed andtwoteiled, [tS ‘+ Thecriticalvaluesto look atarethet- foetal 00373] [Critical one tal 29eer9 f=) ¥o wh [:Crtcal w otal and the p-values 4 T-TESTS IN SAS € Use Proc T-test ‘Sample output HYPOTHESIS TESTING 1. Used totesta hypothesis about a population using a sample parameter + If thave asemple witha mean of 10, whatis the probatility thatit could havecome from a population that has a mean 15 and std deviation of 2.5, 2. We cannever be 100% confident of an outcome when using a hypothesis testset up because there avaysexists some non-zero probabilty of a possible cutcome [however extreme tis) purely by chance 3. Ifyou are testing a hypothesis about the meanof a population when population standard deviation is known, thenstandardizedtest statistic will be: HYPOTHESIS TESTING $ ONE SAMPLE MEANS Population Std Deviation is | Population Std Deviation s cee Unknown Sample Size>30 Normal Distribution ZTable Normal Distrioution ZTable SampleSize <30 (but Normal Distribution ZTable_T- Distribution ttable ‘assume pop is approx normally distriouted) HYPOTHESIS TESTING $ ONE SAMPLE PROPORTIONS ‘When dealing with proportions, thebinomial distribution is the correct distribution to However,as sample size increases, binomial distribution approaches normal, andso we can usenormalto approximate the binomialas longas np andngareeach at least Test statistic forproportionsis: HYPOTHESIS TESTING $ TWO SAMPLE MEANS Ifsample size is large, use: & -2) ~(4-tho zs etl; +h ing Ifsample size is small, use t-distribution. Test stat is (bu H) 1d vmtn With nl +n2-2dF HYPOTHESIS TESTING $ TWO SAMPLE PROPORTIONS Ze JIGSAW ACADEMY Analytics for Professionals www.jigsawacademy.com

You might also like