0 Up votes0 Down votes

44 views17 pagesJul 14, 2012

© Attribution Non-Commercial (BY-NC)

DOC, PDF, TXT or read online from Scribd

Attribution Non-Commercial (BY-NC)

44 views

Attribution Non-Commercial (BY-NC)

- Probability Cheatsheet 140718
- Basic Statistics[1]
- Chapter 7
- Age and Height
- Econ1203 Notes
- Matlab - Clustering Data Outputs Irregular Plot Graph - Stack Overflow
- Manov a Newest
- Lecture 14 - Curve Fitting (Intro&Regression)
- HKSS Exam Syllabus (2008)
- Process parametric optimization of Impact test using Taguchi and ANOVA approach
- Distribusi Binomial
- Introducing Bandlines[1]
- Tapajyoti Ghosh Problem 6.7
- ASWchapter6
- NTCP Section 1
- Homework 1 Solutions
- review1-5
- Chapter_4
- chapter3project-katieolivia
- Sampling and Statistics

You are on page 1of 17

Checking for Non-normality and Outliers in ANOVA and MANOVA Briefing Pager and Tutorial France Goulard APSY 607

Abstract This briefing paper and tutorial presents a short review of ANOVA and MANOVA and explores various ways to check for non-normality and outliers in both of these types of analyses. A definition of non-normality and outlier will be introduced as well as how outliers affect normality in research, the importance of identifying outliers, and types of practices for testing outliers. Furthermore, different methods on how to deal with outliers, as well as different types of outlier approaches will be discussed.

Checking for Non-normality and Outliers in ANOVA and MANOVA This briefing paper and tutorial will discuss the importance of checking for outliers and non-normality in an analysis of variance (ANOVA) and a multivariate analysis of variance (MANOVA). An ANOVA tests the difference in means between two or more groups, while a MANOVA tests for the difference in two or more vectors of means. MANOVA is simply an ANOVA with several dependent variables. This paper will Now, take a closer look at what normality and outliers are, as well as their roles and importance while doing an analysis. Normality Normality is utilized when researchers are interested in how well the data is normally distributed and how it follows the bell-shaped curve. The normal distribution is considered the most prominent probability distribution in statistics (Hamish & McGill, 2011). Several reasons for this are as follows (Hamish & McGill, 2011): First, the normal distribution is very tractable analytically, that is, a large number of results involving this distribution can be derived in explicit form. Second, the normal distribution arises as the outcome of the Central Limit Theorem, which with reasonable sample sizes, the sampling distribution of the sample mean is approximately normal, with mean and standard deviation (sigma over the square root of n) (Hamish & McGill, 2011). This holds regardless of the shape of the original population distribution, and the approximation becomes increasingly accurate as the sample size increases. Finally, the bell shape of the normal distribution makes it a convenient choice for modeling a large variety of random variables encountered in practice. According to Stevens (2009), each of the individual variables must be normally distributed in order to follow a multivariate normal distribution. Please refer to figure 1 for an example of a normally distributed data compared to non-normality. Non-normality, as shown in figure 1a, is evident as some of the points deviate from the normal distribution line

seen in red. Normality distribution is when the data (observation points) follow the red line, which is called normal distribution. Any data points that do not follow the red line are possible outliers, See 1b. a) Normal distribution line:

Q ic T e a da u k im n d c mr s o e o pe s r a en e e t s et isp t r . r e d d o e h ic ue

In general, outliers are very influential data points. In a univariate analysis, an outlier is an extreme value on one data point. In a multivariate analysis, an outlier is an unusual combination of scores on two or more variables and is known to be very sensitive to multivariate techniques (Stevens, 2009). Outliers, in both cases, have the potential to distort statistical results. According to Stevens (2009), outliers are found in: both univariate and multivariate situations; among both dichotomous and continuous variables; between both dependent and independent variables; and in both data and results of analysis. They are atypical, infrequent observations that diverge from the overall patterns and are unusual in size (big or small) compared to the other values being observed. They are so far separated in value from the remainder of the group suggesting that they may be from a different population or the result of an error in measurement. Identification of outliers According to Stevens (2009), outliers occur for four fundamental reasons: A data recording or entry error was made; there was a failure to specify missing-value in compute syntax so that the missing value is read as real data; the outlier was not a member of the population from which the sample was intended; and, the subjects were simply different from the rest. One can detect an outlier by visually examining the data at hand (Seaman & Allan, 2010). For example, a score of 9% on a test where the median score is 85%. The use of visual plots is helpful in the identification of outliers, as some of them are sometimes hard to recognize. There are four different kinds of plots: histograms, normal probability plots, scatter plots, and box plots. Figure 2 shows examples of the different kinds of plots. Figure 2: Types of visual plots to help identify outliers. Histogram

Normal Distribution

Q u ic k T im e a n d a Q u ic k T im e a n d a d e c o m p re s s o r d e c o m p re s s o r a re n e e d e d to s e e th is p ic tu re . a r e n e e d e d t o s e e t h is p ic t u r e .

Normal Probability Plot Normal Probability Plot (no outliers) Probability Plot (with outliers)

Q u ic k T im e a n d a d e c o m p re s s o r a re n e e d e d to s e e th is p ic tu re .

Q ic T e a da u k im n d c mr s o e o pe s r a en e e t s et isp t r . r e d d o e h ic u e

(a) Shotgun scatter plot with low correlation (d) Low correlation (b) Strong positive correlation (e) Low correlation (c) Strong negative correlation (f) Spurious high correlation because of the points shaded in gray Box Plots The main features of a box plot, including outliers or extreme values are excluded from the range

Qi k i e a d u T n a c m d c mr s o e o pes r aen e e t s e h p t r . r e d d o e t i i ue s c

The shape of the distribution is an important aspect of the description of a variable as it tells you the frequency of values for different ranges of the variable. Bivariate normality for correlated variables implies that the scatter plots for each pair of the variable will be elliptical; therefore, the higher the correlation, the thinner the ellipse (Stevens, 2009). Figure 3 identifies the

NON-NORMALITY AND OUTLIERS change in elliptical formation with the outlier included. Figure 3: Identifies change in elliptical formation with the presence of an outlier: No outlier Outlier

Stevens (2009) says that the identification of an outlier due to a data recording or entry error can be identified by listing the data and checking it to make sure the data has been read with accuracy. In figure 4, Stevens (2009) stresses the importance of using the median as a robust measure of central tendency, where there are extreme values shown above because the median is unaffected by outliers. Figure 4: Outlier seen in data set as well on scatter plot

Q i k i e a da uc Tm n d c mr s o e o pe s r aen e e t s et i pcue r e d d o e hs i t r .

In the data set, by looking at subject number 6, it is visible that the x and y values are comparatively different than the rest of the subjects. This means that subject 6 is an outlier.

The scatter plot on the right hand side shows the outlier very clearly as it is far away from the other points, in the top right corner. Importance of Identifying Outliers Outliers are important to identify as it can wrongly increase the value of a correlation coefficient or decrease the value of a proper correlation (Stevens, 2009). This could lead to Type I (false positive) and Type II errors (false negative). Furthermore, excluding the outlier could drastically change the interpretation of the results. The effect of outliers on normality in research Outliers can be known as problematic as it can highly influence the data by just one or two errant data points (Stevens, 2009). Researchers want the results to reflect most of the data and ideally represent the whole overall data set by statistical analysis. Statistical procedures are sensitive to outliers and there is a risk that outliers may have a profound influence on the researchers results (Stevens, 2009). Types of practices for testing outliers A number of tests exist to help test for outliers. These tests are dependent on whether or not the data is grouped. In ungrouped data, univariate and multivariate outliers are sought among all cases at once. Examples of grouped data are regression, canonical correlation, and factor analysis. Whereas in grouped data, outliers are looked for separately, within each group. Examples of grouped data are ANOVA and MANOVA. The following are some of the recommended tests for best practice when testing for outliers.

Univariate Tests Stevens (2009) identifies the following as methods that are useful for assessing univariate

10

i. Normal Probability Plot: Observations are arranged in increasing order of magnitude and then plotted against expected normal distribution values. The plot should resemble a straight line if normality is tenable. Outliers are evident out of the line formation. ii. Histogram of Stem-and-Leaf Plot: Examination of the variable in each group gives indication of whether normality might be violated. It is difficult to assess whether the normality is real or apparent with small or moderate sample sizes because of considerable sampling error. iii. Chi-square Goodness of Fit: Chi-square depends on the number of intervals used for grouping. iv. Kolmogorov-Smirnov: This test is not as powerful as Shapiro-Wik and Skewness-Kurtosis. v. Shapiro-Wilk Test and Skewness and Kurtosis Co-efficients: This combination is the most powerful in detecting departures from normality. Multivariate Tests Tabacknik and Fidell (2007) identify the following as methods that are useful for determining multivariate outliers: i. Mahalanobis Distance (hat elements): This test is used to identify influential data or outlier points on the predictors. The distance of a case from the centroid of the remaining cases where the centroid is the point created at the intersection of the means of all variables. Such outliers will not necessarily be influential (Stevens, 2009). ii. Leverage: Leverage is related to Mahalanobis distance (hat elements=leverage). iii. Discrepancy: It is the extent to which a case is in line with others. iv. Influence: It is the product of leverage and discrepancy.

11

v. Cooks distance: It is a measure of the change in the regression coefficients that would occur if this case were omitted revealing which cases are most influential in affecting the regression equations (Stevens, 2009). This is useful for identifying the combined influence of a case being an outlier on the y and on the set of predictors. A value of 1 would be considered large and would warrant further investigation of that case. vi. Weisberg Test: This test will detect y outliers. vii. DFFITS: This indicates how much the fitted value will change if the observation is deleted. viii. DFBETAS: This is useful in indicating how much each regression coefficient will change if the observation is deleted. Range Tests There are three main methods for identifying outliers using range (Jubal, 2001). i. Upper and lower quartile values: The upper quartile value (UQ) is the value that 75% of the data set is equal to or less than. The lower quartile value (LQ) is the value that 25% of the data set is equal to or less than. The interquartile range (IQR) is defined as the difference between the upper and lower quartiles (IQR=UQ-LQ). Statistically, outliers as those data points that are at least 1.5 IQR greater than the upper quartile or 1.5 IQR less that the lower quartile. ii. Z-test: Among continuous variables, univariate outliers are cases with very large standardized scores (Z scores) on one or more variables, that are disconnected from other z scores (Tabachnik & Fidell, 2007). The mean and the standard deviation of the data set are calculated in a z-test. Anything that falls more than three standard deviations away from the mean is identified as an outlier. That is, x is an outlier if abs(x-mean) ------------------> 3

12

About 99% of the scores should lie within three standard variations of the mean (Stevens, 2009). Therefore, any z value greater than 3 indicates a value very unlikely to occur. iii. Q-test: The Q-test compares how far out the outlier is to the total range of the data. To do a Q-test, the researchers first find the ratio. abs(x_a-x_b) Q=----------------R x_a is the possible outlier, x_b is the data point closest to it, and R is the total range of the data set. If Q is greater than a certain critical value (Qcrit depends on the number of data points and how sure you want to be that its okay to reject x_a as an outlier), then x_a is an outlier. Jubal (2001) cautions that it is possible to reject almost the entire data set if you apply the Qtest several times in succession, so never do it more than once. F.E. Grubbs Parametric Tests Grubbs tests can be given as follows, in which xi denotes an individual data point, s is the sample standard deviation and n is the sample size (Seaman & Allen, 2010):

, finds outliers

at either extreme.

Other tests that might be useful include Dixons Q test that is similar to G2 for a small number of observations (between 3 and 25), and Rosners test that is a generalization of Grubbs test to detect

NON-NORMALITY AND OUTLIERS up to k outliers when the sample size is 25 or more (Seaman & Allen, 2010). Chauvenets Criterion

13

Chauvenets criterion is a means of assessing whether one piece of experimental data, an outlier, from a set of observations, is likely to be spurious. To apply this criterion, first calculate the mean and standard deviation of the observed data. Then, use the normal distribution function based on how much the suspect data differs from the mean. This will determine the probability that a given data point will be at the value of the suspect data point. Multiply this probability by the number of data points taken. If the result is less than 0.5, the suspicious data point may be discarded. Peirces Criterion Peirces criterion is derived from a statistical analysis of the Gaussian distribution. Unlike some other criteria for removing outliers, Peirces method can be applied to identify two or more outliers. It is proposed to determine in a series of m observations the limit of error, beyond which all observations involving so great an error may be rejected, provided there are as many as n such observations. The principle upon which it is proposed to solve this problem is, that the observations that are being proposed should be rejected when the probability of the system of errors obtained by retaining them is less than that of the system of errors obtained by their rejection multiplied by the probability of making so many, and no more, abnormal observations (Peirce, 1878). Dealing with Outliers

14

Outliers can be misleading from various determinants such as: limited measurement precision, compared results due to an infinite number of standard deviations away from the mean of the remaining results, successive outlying points being identified by a genuine long tail on the distribution, and the risk of an outlier being identified by chance if there if very little data. Therefore, upon discovery of a suspected outlier, the initial temptation is to eliminate the points from the data and to simplify the analyses to make the results easy to explain (Seaman & Allen, 2010). On the basis of some simple assumptions, the outlier tests tell you where you are most likely to have technical error but they do not tell you that the point is wrong (Seaman & Allen, 2010). Also, it is not recommended that identified outliers should be dropped, as no matter how extreme the data is, it could be a correct piece of information (Stevens, 2009). With that being said, the following recommendations on how to deal with outliers are advised by Stevens (2009): If outliers are due to a recording or entry error, then one should correct the data value and redo the analysis; If the outlier is due to an instrumentation error or process error, it is legitimate to drop the outlier; and, if either of the previous is not the case, then do not drop the outlier but report two analyses, one that include the outlier and one that does not. Different Types of Outlier Approach An outlier may be the result of an error in measurement or data entry in which case should distort the interpretation of the data. Therefore, once the outlier is identified, it may be necessary to investigate the analysis and fix any errors that occurred. Identified data points should only be removed if a technical reason can be found for the unusual behaviour (Ostle, 1988). It is imperative that outliers be examined thoroughly and

15

carefully before starting any formal analysis. Dropping the outlier without any good reason is not recommended and should not be practiced. According the Seaman & Allen (2010), removing the outlier may miss intricacies of the data, have large affects on any analysis of the data, and lead to serious biases. In the case where more than 20% of the data are identified as outliers, the researcher should start questioning the assumption of the data distribution and the quality of the collected data (Timm, 1975). Another approach is to report two different analyses, one with the outlier and one without. This would allow the reader to make up their own ideas upon which analysis they should use. The only downfall to this approach is that it can be very time consuming to report both analyses. Discussion In conclusion, normal distribution is needed in order to show how well the data is normally distributed and how it follows the bell-shaped curve. Outliers should not be regarded as bad, as they can provide interesting cases for future study (Stevens, 2009). Testing for outliers is a necessary part of data analysis and must be conducted with care and caution. If an outlier is a genuine result, it should not be disregarded or dropped as it can bring important value to the study at hand and can help in discovering why certain studies are more extreme than others. Links for Checking Non-normality and Outliers in ANOVA and MANOVA Assessing Classical Test Assumptions http://www.statmethods.net/stats/anovaAssumptions.html Missing Values, Outliers, Robust Statistics, & Non-parametric Methods http://www.lcgceurope.com/lcgceurope/data/articlestandard/lcgceurope/502001/4509/article.pdf Mulitvariate Analysis of Variance (MANOVA)

16

http://www.stat.psu.edu/~ajw13/stat505/fa06/12_1wMANOVA/05_1wMANOVA_ex.html Numbers: Numerical methods for biosciences students http://web.anglia.ac.uk/numbers/graphsCharts.html Quality Progress http://www.asq.org/quality-progress/2010/02/statistics-roundtable/outlier-options.html StatSoft Electronic Statistics Textbook http://www.statsoft.com/textbook/basic-statistics/ Stat Trek: Teach yourself statistics http://stattrek.com/AP-Statistics-1/Residual.aspx The Math Doctor http://mathforum.org/dr.math/

References Taylor, H.J. & McGill, J.I. (2011). Analysis based decision making: Executive MBA programs school of business. Kingston, Ontario. Queens School of Business. Jubal, Dr. (2001, June). Using the range to find outliers. Retrieved from http://mathforum.org/library/drmath/view/52720.html

17

Ostle, B., & Malone, L. C. (1988). Statistics in research: Basic concepts and techniques for research workers (4th ed.). Ames, IA: Iowa State Press Peirce, B. (1878). On Peirces criterion. Proceedings of the American Academy of Arts and Sciences. 13, 348-351. doi: 10.2307/2513498 Seaman, J.E. & Allen, E.I. (2010, June). Consider simple parametric tests to find an outliers significance. Quality Progress. Retrieved from http://www.asq.org/qualityprogress/2010/02/statistics-roundtable/outlier-options.html Stevens, J.P. (2009). Applied multivariate statistics for the social sciences. (5th ed.). New York, NY; Routledge, Taylor & Francis Group. StatSoft Electronic Statistics Textbook (2011). Basic statistics. Retrieved from http://www.statsoft.com/textbook/basic-statistics/ Tabachnik & Fidell. (2007). Cleaning up your act: Screening data prior to analysis. (5th ed.). NY: Routledge. Timm, N. H. (1975). Multivariate analysis with applications in education and psychology. Monterey, CA: Brooks/Cole.

- Probability Cheatsheet 140718Uploaded byMohit Agrawal
- Basic Statistics[1]Uploaded byArmando Saldaña
- Chapter 7Uploaded byKoert Oosterhuis
- Age and HeightUploaded byKeshava Gabbaladka
- Econ1203 NotesUploaded bywhyisscribdsopricey
- Matlab - Clustering Data Outputs Irregular Plot Graph - Stack OverflowUploaded byFormat Seorang Legenda
- Manov a NewestUploaded byglpi
- Lecture 14 - Curve Fitting (Intro&Regression)Uploaded byTerminatorX75
- HKSS Exam Syllabus (2008)Uploaded byarcanum78
- Process parametric optimization of Impact test using Taguchi and ANOVA approachUploaded byAnonymous vQrJlEN
- Distribusi BinomialUploaded byAlex Sanderz SiRait
- Introducing Bandlines[1]Uploaded byKen
- Tapajyoti Ghosh Problem 6.7Uploaded byTapajyoti Ghosh
- ASWchapter6Uploaded byTaimur Zia
- NTCP Section 1Uploaded bylaurenjia
- Homework 1 SolutionsUploaded byVan Thu Nguyen
- review1-5Uploaded byapi-234480282
- Chapter_4Uploaded byAlessandro Parente
- chapter3project-katieoliviaUploaded byapi-345883789
- Sampling and StatisticsUploaded byalif
- Practice Exam 1Uploaded byldlewis
- 435 dropping the ballUploaded byapi-392605905
- Chapter_4.pdfUploaded byAlessandro Parente
- Cherubini 2001_Factor of Safety and ReliabilityUploaded bySISA Q
- Self Certificate Course in BiosttsicsUploaded byChetta
- Detailed Syllabus 2016Uploaded byHimesh Kothari
- Answers - Continuous Probability DistributionsUploaded bynikowawa
- anna baumann eccl poster purc 3 pdfUploaded byapi-434392094
- Variance of CoeffUploaded byRamiroScorolli
- Descriptive StatisticsUploaded byEdwin

- providing services across culturesUploaded byapi-159547603
- france goulard - summer practicum evaluation-2-1Uploaded byapi-159547603
- personal position-ctUploaded byapi-159547603
- france cpa posterUploaded byapi-159547603
- impacts of parents and peers without feedbackUploaded byapi-159547603
- fba luke spencer-1Uploaded byapi-159547603
- english final c vUploaded byapi-159547603
- 523 final exam-supporting students with learning difficultiesUploaded byapi-159547603
- ethical decision making exerciseUploaded byapi-159547603
- ethical policy decision exerciseUploaded byapi-159547603
- reading assessment for the hearing impairedUploaded byapi-159547603
- methodheads final-2Uploaded byapi-159547603
- role of collaboration in childhood psychopathologyUploaded byapi-159547603
- existential treatment planUploaded byapi-159547603
- inclusion-how it came to beUploaded byapi-159547603
- best practices for special educationUploaded byapi-159547603
- introduction to the review process5Uploaded byapi-159547603
- france joe jonas case study final1-1Uploaded byapi-159547603
- snifflespsychedreport-1Uploaded byapi-159547603
- tbi presentation finalUploaded byapi-159547603
- neuro and dev bases of learning and behaviour examUploaded byapi-159547603
- at and special edUploaded byapi-159547603
- james-a teacher perspective-goulard-kenney-deenUploaded byapi-159547603
- best practices for inclusionUploaded byapi-159547603
- flames-ucapes case conferenceUploaded byapi-159547603
- collaborative adaptationsUploaded byapi-159547603
- communication forms of autismUploaded byapi-159547603
- situational analysis-ipp collaborative processUploaded byapi-159547603
- individual situational analysisUploaded byapi-159547603

- 231-2011Uploaded byMichelle Carlson
- 2009_JSP_Race&GenderDiff_on_SAS&CSUploaded bytosutea
- Multivariate Analysis TechniquesUploaded byjojy02
- MONOVAUploaded byAvishkar Jain
- Handicapping- The Effects of Its Source and FrequencyUploaded byRoxy Shira Adi
- intimacyUploaded byapi-254863225
- Garson's extensive MANOVA etc notes.docxUploaded bykararra
- Real-Statistics-Multivariate-Examples.xlsxUploaded byjohn sebastian
- AssertivenessUploaded byHepie Nugraha
- 2009 - Andy FIELD - Discovering Statistics Using SPSSUploaded byPaula Prata
- Effect of Protective Clothing and Fatigue on Functional Balance of Firefighters 2165 7556.S2 004Uploaded bykirana_patrolina
- Age and Gender Effects on Coping InUploaded byIrma Silvi Luluinqolb
- Files-2-Presentations Malhotra Mr05 Ppt 16Uploaded byNeedmaterial123
- Dialnet-InfluenciaDeLaTabletEnElDesarrolloInfantil-5247184Uploaded byKiyoaki Matsugae
- Acceptance Based Treatment for Smoking CessationUploaded byAndré Galhardo
- Mental ToughnessUploaded bycitisolo
- M. Tech (R) Thesis by M M Sahoo 611CE303Uploaded byLeslie Falcon LP
- Mokhlis PhDUploaded bydaviddangello
- Colorism in the Job Selection ProcessUploaded byNaila Ansari
- Multivariate Analysis of VarianceUploaded byRyan Mohammad Macaradi Hadjinasser
- heijunkaUploaded bySheyeneLeon
- Session 8Uploaded byAnik Roy
- Interpreting the One-Way MANOVAUploaded byJuna Beqiri
- spss anova.pdfUploaded byNovi Arianti
- 5 MANOVA Presentation StatsUploaded byVaibhav Baluni
- Proc AnovaUploaded byexult2011
- Beliefs of Chinese Physical Educators on Teaching Students With Desabilities in General PEUploaded byJeanYan
- EFFECTS OF MATCHING AND MISMATCHING...by Susan M. Tendy, United States Military Academy - West PointUploaded byAnonymous sewU7e6
- Bailey 2011Uploaded byRistia Angesti
- An Introduction to SphericityUploaded byHarry SW

## Much more than documents.

Discover everything Scribd has to offer, including books and audiobooks from major publishers.

Cancel anytime.