P. 1
How to Calculate Outliers

How to Calculate Outliers

|Views: 1,035|Likes:

More info:

Published by: Celina Pae J. Borillo on Apr 01, 2012
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as DOCX, PDF, TXT or read online from Scribd
See more
See less

12/11/2012

pdf

text

original

How to Calculate Outliers

Calculate Outliers
An outlier is a value in a data set that is far from the other values. Outliers can be caused by experimental or measurement errors, or by a long-tailed population. In the former cases, it can be desirable to identify outliers and remove them from data before performing a statistical analysis, because they can throw off the results so that they do not accurately represent the sample population. The simplest way to identify outliers is with the quartile method

1
Sort the data in ascending order. For example take the data set {4, 5, 2, 3, 15, 3, 3, 5}. Sorted, the example data set is {2, 3, 3, 3, 4, 5, 5, 15}.

2
Find the median. This the number at which half the data points are larger and half are smaller. If there are an even number of data points, the middle two are averaged. For the example data set, the middle points are 3 and 4, so the median is (3 + 4) / 2 = 3.5.

3
Find the upper quartile, Q2; this is the data point at which 25 percent of the data are larger. If the data set is even, average the 2 points around the quartile. For the example data set, this is (5 + 5) / 2 = 5.

4
Find the lower quartile, Q1; this the data point at which 25 percent of the data are smaller. If the data set is even, average the 2 points around the quartile. For the example data, (3 + 3) / 2 = 3.

5
Subtract the lower quartile from the higher quartile to get the interquartile range, IQ. For the example data set, Q2 -- Q1 = 5 -- 3 = 2.

6
Multiply the interquartile range by 1.5. Add this to the upper quartile and subtract it from the lower quartile. Any data point outside these values is a mild outlier. For the example set, 1.5 x 2 = 3. 3 -- 3 = 0 and 5 + 3 = 8. So any value less than 0 or greater than 8 would be a mild outlier. This means that 15 qualifies as a mild outlier.

7
Multiply the interquartile range by 3. Add this to the upper quartile and subtract it from the lower quartile. Any data point outside these values is an extreme outlier. For the example set, 3 x 2 = 6. 3 -- 6 = --3 and 5 + 6 = 11. So any value less than --3 or greater than 11 would be a extreme outlier. This means that 15 qualifies as an extreme outlier.

Tips & Warnings
 
Extreme outliers are more indicative of a bad data point than a mild outlier. Examine the causes of all outliers carefully.

Read more: How to Calculate Outliers | eHow.com http://www.ehow.com/how_5201412_calculate-outliers.html#ixzz1lknpL6We

Detection of Outliers
Introduction An outlier is an observation that appears to deviate markedly from other observations in the sample.
Identification of potential outliers is important for the following reasons.

1. An outlier may indicate bad data. For example, the data may have been coded incorrectly or an experiment may not have been run correctly. If it can be determined that an outlying point is in fact erroneous, then the outlying

Outliers may be due to random variation or may indicate something scientifically interesting. the plot can help determine whether we need to check for a single outlier or whether we need to check for multiple outliers. Identification Iglewicz and Hoaglin distinguish the three following issues with regards to outliers. That is. if the data contains significant outliers. Labeling.use robust statistical techniques that will not be unduly affected by outliers. Accomodation. If the normality assumption for the data being tested is not valid. do we need modify our statistical analysis to more appropriately account for these observations? 3. outlier identification . 2. In any event. In addition to checking the normality assumption.value should be deleted from the analysis (or corrected if possible). This section focuses on the labeling and identification issues. the lower and upper tails of the normal probability plot can be a useful graphical technique for identifying potential outliers. Normality Assumption Identifying an observation as an outlier depends on the underlying distribution of the data. Single Versus Some outlier tests are designed to detect the prescence of a . In this section. the prescence of one or more outliers may cause the tests to reject normality when it is in fact a reasonable assumption for applying the outlier test. we limit the discussion to univariate data sets that are assumed to follow an approximately normal distribution. it may not be possible to determine if an outlying point is bad data. For this reason. In particular. we may need to consider the use of robust statistical techniques. outlier accomodation . The box plot and the histogram can also be useful graphical tools in checking the normality assumption and in identifying potential outliers.e. it is recommended that you generate a normal probability plot of the data before applying an outlier test. 1.formally test whether observations are outliers. we typically do not want to simply delete the outlying observation. However. and so on). Although you can also perform formal tests for normality. 2. indicative of an inappropriate distributional model. In some cases. are the potential outliers erroneous data..flag potential outliers for further investigation (i. if we cannot determine that potential outliers are erroneous observations. outlier labeling . then a determination that there is an outlier may in fact be due to the non-normality of the data rather than the prescence of an outlier.

5 be labeled as potential outliers. In addition.Multiple Outliers single outlier while other tests are designed to detect the prescence of multiple outliers. data is given in units of how many standard deviations it is from the mean. Although it is common practice to use Z-scores to identify possible outliers. Masking and Swamping Masking can occur when we specify too few outliers in the test. masking is one reason that trying to apply a single outlier test sequentially can fail. if we are testing for two or more outliers when there is in fact only a single outlier. some tests that detect multiple outliers may require that you specify the number of suspected outliers exactly. For example. These can be grouped by the following characteristics:  What is the distributional model for the data? We restrict our discussion to tests that assume the data follow an approximately normal distribution. with MAD denoting the median absolute deviation and denoting the median. In other words. Z-Scores and Modified ZScores The Z-score of an observation is defined as with and s denoting the sample mean and sample standard deviation. These authors recommend that modified Z-scores with an absolute value of greater than 3. masking may cause the outlier test for the first outlier to return a conclusion of no outliers (and so the testing for any additional outliers is not performed). Also. Formal Outlier Tests A number of formal outlier tests have proposed in the literature. For example. On the other hand. respectively. if we are testing for a single outlier when there are in fact two (or more) outliers. Due to the possibility of masking and swamping. Swamping and masking are also the reason that many tests require that the exact number of outliers being tested must be specified. Graphics can often help identify cases where masking or swamping may be an issue. . It is not appropriate to apply a test for a single outlier sequentially in order to detect multiple outliers. both points may be declared outliers (many tests will declare either all or none of the tested points as outliers). if there are multiple outliers. it is useful to complement formal outlier tests with graphical methods. these additional outliers may influence the value of the test statistic enough so that no points are declared as outliers. this can be misleading (partiucarly for small sample sizes) due to the fact that the maximum Z-score is at most Iglewicz and Hoaglin recommend using the modified Z-score . For example. swamping can occur when we specify too many outliers in the test.

This list is not exhaustive (a large number of outlier tests have been proposed in the literature). In addition to discussing additional tests for data that follow an approximately normal distribution. these sources also discuss the case where the data are not normally distributed. It has the limitation that the number of outliers must be specified exactly. Grubbs' Test . For example.this is the recommended test when testing for a single outlier. Tietjen-Moore Test . .this is a generalization of the Grubbs' test to the case of more than one outlier. 1. you can transform the data to normality by taking the logarithms of the data and then applying the outlier tests discussed here. 3. Further Information Iglewicz and Hoaglin provide an extensive discussion of the outlier tests given above (as well as some not given above) and also give a good tutorial on the subject of outliers. Lognormal Distribution The tests discussed here are specifically based on the assumption that the data follow an approximately normal disribution. is based a value being too large (or small) compared to its nearest neighbor. If your data follow an approximately lognormal distribution. the Dixon test. The tests given here are essentially based on the criterion of "distance from the mean". This is not the only criterion that could be used. Barnett and Lewis provide a book length treatment of the subject. which is not discussed here. does the number of outliers need to be specified exactly or can we specify an upper bound for the number of outliers? The following are a few of the more commonly used outlier tests for normally distributed data. Generalized Extreme Studentized Deviate (ESD) Test this test requires only an upper bound on the suspected number of outliers and is the recommended test when the exact number of outliers is not known. 2.  Is the test designed for a single outlier or is it designed for multiple outliers? If the test is designed for multiple outliers.

1 Causes o 1. the free encyclopedia Jump to: navigation. For example. naively interpreting the mean as "a typical sample". is incorrect. which may be two distinct sub-populations. but they are often indicative either of measurement error or that the population has a heavy-tailed distribution. For other uses. this is modeled by a mixture model. some data points will be further away from the sample mean than what is deemed reasonable. search This article is about the statistical term. may include the sample maximum or sample minimum. Estimators capable of coping with outliers are said to be robust: the median is a robust statistic. This can be due to incidental systematic error or flaws in the theory that generated an assumed family of probability distributions. Box plot of data from the Michelson-Morley Experiment displaying outliers in the middle column. Figure 1. outliers may be indicative of data points that belong to a different population than the rest of the sample set. or areas where a certain theory might not be valid.2 Caution 2 Identifying outliers . while in the latter case they indicate that the distribution has high kurtosis and that one should be very cautious in using tools or intuitions that assume a normal distribution. Grubbs[2] defined an outlier as: An outlying observation. Naive interpretation of statistics derived from data sets that include outliers may be misleading. Contents [hide]   1 Occurrence and causes o 1. if one is calculating the average temperature of 10 objects in a room. or it may be that some observations are far from the center of the data. or both. However. equivalent to the median.5 and 40 °C. or outlier. however. a small number of outliers is to be expected (and not due to any anomalous condition). in large samples. Outlier points can therefore indicate faulty data. but an oven is at 175 °C. Outliers. In statistics. being the most extreme observations. depending on whether they are extremely high or low. while the mean is not. the median better reflects the temperature of a randomly sampled object than the mean. is one that appears to deviate markedly from other members of the sample in which it occurs. In the former case one wishes to discard them or use statistics that are robust to outliers. As illustrated in this case. However. the sample maximum and minimum are not always outliers because they may not be unusually far from other observations. or may indicate 'correct trial' versus 'measurement error'. an outlier[1] is an observation that is numerically distant from the rest of the data. erroneous procedures. Outliers can occur by chance in any distribution. A frequent cause of outliers is a mixture of two distributions. and most are between 20 and 25 degrees Celsius. In most larger samplings of data. the median of the data may be 23 °C but the mean temperature will be between 35. In this case.Outlier From Wikipedia. see Outlier (disambiguation).

principled and systematic techniques are used. and flags them as potential outliers. being less than twice the expected number and hence within 1 standard deviation of the expected number – see Poisson distribution. If the sample size is only 100. just three such outliers are already reason for concern. Outliers arise due to changes in system behaviour. and thus for 1. where appropriate.1 Retention o 3. Outlier detection[4] has been used for centuries to detect and. Thus if one takes a normal distribution with cutoff 3 standard deviations from the mean. see three sigma rule[3] for details. however. and 1 in 370 will deviate by three times the standard deviation. instrument error or simply through natural deviations in populations. roughly 1 in 22 observations will differ by twice the standard deviation or more from the mean. This is essentially a learning approach analogous to unsupervised clustering.Model both normality and abnormality. it is possible to test if the number of outliers deviate significantly from what can be expected: for a given cutoff (so samples fall beyond the cutoff with probability p) of a given distribution. an outlier could be the result of a flaw in the assumed theory. This approach is analogous to supervised classification and requires pre-labeled data. The original outlier detection methods were arbitrary but now. it is ill-advised to ignore the presence of outliers. the pathological appearance of outliers of a certain form appears in a variety of datasets. In general.3%. if the nature of the population distribution is known a priori. indicating that the causative mechanism for the data might differ at the extreme end (King effect). the number of outliers will follow a binomial distribution with parameter p. and not indicative of an anomaly. the presence of up to five observations deviating from the mean by more than three times the standard deviation is within the range of what can be expected. tagged as normal or abnormal.3 Non-normal distributions o 3. In a sample of 1000 observations. It may be considered semi-supervised as the normal class is taught but the algorithm learns to recognize abnormality.4 Alternative models 4 See also 5 References 6 External links [edit] Occurrence and causes In the case of normally distributed data. There are three fundamental approaches to the problem of outlier detection:    Type 1 . being more than 11 times the expected number. A sample may have been contaminated with elements from outside the population being examined. [edit] Identifying outliers There is no rigid mathematical definition of what constitutes an outlier. Alternatively. pinpoints the most remote points. Additionally. [edit] Caution Unless it can be ascertained that the deviation is not significant.000 trials one can approximate the number of samples whose deviation exceeds 3 sigmas by a Poisson distribution with λ = 3. and identify observations which are deemed "unlikely" based on mean and standard deviation:   Chauvenet's criterion Grubbs' test for outliers . There may have been an error in data transmission or transcription. human error. fraudulent behaviour. determining whether or not an observation is an outlier is ultimately a subjective exercise. Outlier detection can identify system faults and fraud before they escalate with potentially catastrophic consequences. The approach processes the data as a static distribution. p is approximately . This is analogous to a semisupervised recognition or detection task.    3 Working with outliers o 3.2 Exclusion o 3. Type 3 . Type 2 . which can generally be well-approximated by the Poisson distribution with λ = pn. Model-based methods which are commonly used for identification assume that the data are from a normal distribution.Determine the outliers with no prior knowledge of the data. A physical apparatus for taking measurements may have suffered a transient malfunction. remove anomalous observations from data. [edit] Causes Outliers can have many anomalous causes. drawn from the full gamut of computer science and statistics. Outliers that cannot be readily explained demand special attention – see kurtosis risk and black swan theory.Model only normality (or in a few cases model abnormality). calling for further investigation by the researcher.

while mathematical criteria provide an objective and quantitative method for data rejection. The application should use a classification algorithm that is robust to outliers to model data with naturally occurring outlier points. In regression problems. for example by using a hierarchical Bayes model or a mixture model. then one could for some constant k. Rejection of outliers is more acceptable in areas of practice where the underlying model of the process being measured and the usual distribution of measurement error are confidently known. this should be clearly stated on any subsequent report. and no more. having "fat tails". that the proposed observations should be rejected when the probability of the system of errors obtained by retaining them is less than that of the system of errors obtained by their rejection multiplied by the probability of making so many. using a measure such as Cook's distance. Peirce's criterion[5][6][7][8] It is proposed to determine in a series of m observations the limit of error. [edit] Exclusion Deletion of outlier data is a controversial practice frowned on by many scientists and science instructors.) n   Dixon's Q test ASTM E178 Standard Practice for Dealing With Outlying Observations Other methods flag observations based on measures such as the interquartile range. provided there are as many as such observations. especially in small sets or where a normal distribution cannot be assumed. the sample mean fails to converge as the sample size increases. (Quoted in the editorial note on page 516 to Peirce (1982 edition) from A Manual of Astronomy 2:558 by Chauvenet. [edit] Working with outliers The choice of how to deal with an outlier should depend on the cause. [edit] Alternative models In cases where the cause of the outliers is known. an alternative approach may be to only exclude points which exhibit a large degree of influence on the parameters. An outlier resulting from an instrument reading error may be excluded but it is desirable that the reading is at least verified.[13][14] .[11] If a data point (or points) is excluded from the data analysis. [edit] Retention Even when a normal distribution model is appropriate to the data being analyzed. The principle upon which it is proposed to solve this problem is. For instance. and outliers are expected at far larger rates than for a normal distribution. Other approaches are distance-based[9][10] and frequently use the distance to the k-nearest neighbors to label observations as outliers or non-outliers.[12] the sample variance increases with the sample size. For example. beyond which all observations involving so great an error may be rejected. outliers are expected for large sample sizes and should not automatically be discarded if that is the case. abnormal observations. they do not make the practice more scientifically or methodologically sound. if define an outlier to be any observation outside the range: Q1 and Q3 are the lower and upper quartiles respectively. when sampling from a Cauchy distribution. it may be possible to incorporate this effect into the model structure. [edit] Non-normal distributions The possibility should be considered that the underlying distribution of the data is not approximately normal.

You're Reading a Free Preview

Download
scribd
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->