You are on page 1of 70

About These Notes

Introduction

Descriptive Statistics

Introduction to Probability and Statistics For Engineers and Scientists


The maths, the computation, and examples.

Dr Asad Ali
Department of Space Science Institute of Space Technology Islamabad, Pakistan

1 / 81

About These Notes

Introduction

Descriptive Statistics

About Me

Name: Asad Ali PhD (2007-2011) in Astro-Statistics from Department of Statistics, University of Auckland, New Zealand During PhD: Worked as gravitational wave data analyst in NASA-ESA space mission; the Laser Interferometer Space Antenna (LISA) (a space-borne GW detector). Developed Bayesian Monte Carlo Algorithms for Gravitational Wave Spectrum Analysis Used supercomputers such as
BeSTGRID (AUS-NZ) ATLAS (Max-Planck Institute for Gravitational Physics (AEI), Honnover, Germany)

Currently associated with European project, the Einstein Telescope (a deep Earth GW detector) in the same role. Oce 022, Hostel Block.

2 / 81

About These Notes

Introduction

Descriptive Statistics

Why I am here?

I want to help you!


In learning statistics to be able to eectively conduct research, to develop critical thinking and analytical skills, so that to act as an informed engineering scientist. Statistics is the fundamental tool of conducting research and analysis in all disciplines. Without proper knowledge of statistics a scientist is like a blind person who is looking, in a dark room, for a black cat, which is not actually there. If you can not present and interpret your measurements. Your knowledge is of unsatisfactory kind.

3 / 81

About These Notes

Introduction

Descriptive Statistics

About These Notes

These are just reference notes. No need to memorize. The purpose of these lecture notes is just to teach you the how to of statistics. Are self-explanatory. You are an END user... You are not supposed to worry about how your car is manufactured... rather,... you need to learn how to drive it... So... Spend your minds on understanding the statistical concepts and their applications. If you can not present and interpret your measurements. Your knowledge is of unsatisfactory kind. There are several good books on statistics, available in IST library. There are lots of good websites on Internet, as well.

4 / 81

About These Notes

Introduction

Descriptive Statistics

Recommended Study Materials

Textbooks 1. Modern Mathematical Statistics with Applications Second Edition by Jay L. Devore 2. Mathematical Statistics with Applications by John E. Freund Reference Material 1. Probability and Statistics for Engineering and the Sciences Fifth Edition by Jay L. Devore and Kenneth N. Berk 2. Probability & Statistics for Engineers and Scientists Fifth Edition by Ronald E. Walpole You can pick any book and visit any website, which you think can help you in learning statistics. For example, I would recommend 1. Introduction to Statistical Theory by Sher Muhammad Chaudhry and Dr Shahid Kamal (Part I, for now).

5 / 81

About These Notes

Introduction

Descriptive Statistics

Introduction

Chapter 1: Introduction

6 / 81

About These Notes

Introduction

Descriptive Statistics

Introduction
Everything dealing with the collection, processing, analysis, and interpretation of numerical data belongs to the domain of statistics. In engineering, this includes such diversied tasks as calculating the average length of the downtimes of a computer (System Engineering) collecting and analyzing data on various weather events; temperature, air pressure, water vapor (Meteorology) evaluating the eectiveness of commercial products (Quality Control) predicting the reliability of a rocket, or studying the vibrations of airplane wings (Aerospace and Aeronautics) estimating the break-points and analyzing the stress-strain relationships of materials. (Materials) calculating the average life of an electrical equipment (Electrical Engineering) and many other tasks pertaining to engineering and other disciplines of science and art

7 / 81

About These Notes

Introduction

Descriptive Statistics

What is Statistics?
Statistics is the area of science that deals with the collection, organization, analysis, and interpretation of data to assist in making more eective decisions in the face of uncertainty (incomplete information why? note it.). Branches of statistics: statistics can be divided into two major branches Descriptive statistics that involves the organization, summarization, and display of data. Descriptive statistics are typically presented graphically, in tabular form (in tables), or as summary statistics (single values). Inferential statistics are used to interpret the meaning of descriptive statistics. Inferential statistics are procedures used that allow researchers to infer or generalize observations made with samples to the larger population from which they were selected. In this course you will learn how to apply and interpret both types of statistics in science and in practice to make you a good interpreter of the statistical information and an excellent decision maker in the face of incomplete information.

Further reading and exercises: Have a look of the introduction and Section 1.1 of Devores book and the examples there in.

8 / 81

About These Notes

Introduction

Descriptive Statistics

Basic Terms and Concepts


Some fundamental terms and concepts that we need to know: Data
Data consist of information coming from observation e.g. counts, measurements, or responses.

Variable
A variable is a characteristic (or an attribute) that describes a person, place, thing, or idea. The value of the variable varies from one entity to another. Variable are generally expressed by X , Y , Z and their values/realizations by xi , yi , zi with subscript i denoting the ith object/item for which the observation is made. More clearly, xi is simply the ith observation on X .

Types of variables
Quantitative variable: A variable is called a quantitative when a characteristic can be expressed numerically such as temperature, time, weight, number of students in classes etc. Qualitative variable: A variable is called a qualitative when a characteristic can be expressed only with dierent categories such as eye color (blue, brown, black), education (BA, MA, MS), survey response (yes, no, agree, disagree) etc.

9 / 81

About These Notes

Introduction

Descriptive Statistics

Basic Terms and Concepts


Types of quantitative variables
Discrete Variables: Discrete variables vary only by whole numbers or integers (e.g. 1, 2,3,...,). A discrete variable represents count data; such as the number of students in a class (it would not make sense to have half a student, would it?) and the number of defected mobiles in a lot. Continuous Variables: A quantitative variable is continuous if its possible values come from a given interval (e.g. 10.0, 1.2, 87.2). A continuous variable represents measurement data such as the temperature, weight, length.

Measurement scales: The types of measurements of observations are usually called measurements scales. These are four, which are listed below.
Nominal scale: Categorical with no ordering or ranking, e.g. red, blue, green Ordinal scale: Categorical with ordering or ranking, e.g. low, medium, strong Interval scale: A constant interval size, but with no meaningful zero point, e.g. temperature Ratio scale: An interval scale with a meaningful zero point, e.g. length, age, weight

10 / 81

About These Notes

Introduction

Descriptive Statistics

Basic Terms and Concepts


Errors Errors are everywhere.Essentially all models are wrong but some are useful. (George E. P. Box). We can not develop a model or a formula, that can represent the exact state of nature. We are trying to predict the behavior of nature, but all our instruments are erroneous. For example, I can not say that my height is exactly 6 feet, even if its measured with the most sensitive and sophisticated instrument. In fact, it is somewhere between 5.956.05 feet or you can say 5.9956.005 feet depending on the sensitivity of the measuring equipment. Thus there are errors, no matter how small, but they do exist always and everywhere. This kind of errors are called measurement errors. The good thing about these errors is that they cancel out in repeated measurements (long run). Another type of error is bias. When the observed value is consistently and constantly higher or lower than the true value, we say there is bias in measurements. These errors arise due to the personal limitation of the observer, the imperfection of the measuring instrument or some other conditions that control the measurements. They are cumulative in nature, that is, the greater the number of measurements, the greater would be the magnitude of error.

11 / 81

About These Notes

Introduction

Descriptive Statistics

Basic Terms and Concepts


Population and samples
Population: A population is the collection of all observations (outcomes, responses, measurements, or counts) that have some characteristic of interest. The total number of observations in a population is called its size; generally denoted by N . Samples: A sample is a subset of a population. Its size is denoted by n. If the desired information is available for all items in the population, we have what is referred to as a census. In practice, we rarely have a complete set of data. We usually collect data in samples.

Parameters and statistics


The numbers used to describe a population are parameters and often are denoted using Greek letters (, ). Whereas the numbers used to describe a sample data set are called statistics often denoted , S ). A statistic may be used to by Greek letters with over them ( , ) or English letters (X or estimate a population parameter such as the average of a data set, e.g. X provides an estimate of the unknown population mean, . because in statistical data analysis we often make inferences about a population based on sample statistics. Since we rarely know every observation in a population, any conclusions or recommendations that are made based on sample statistics are subject to error. However, we typically will accept some margin of error rather than incur the cost of measuring every observation. This error is usually known as sampling error. We will learn more about it at a later point.

 Another type of error: The dierence between a statistic and a parameter is important to understand

Further reading: Have a look of section 1.1 in Devore and try to solve the exercises at the end of section.
12 / 81

About These Notes

Introduction

Descriptive Statistics

Descriptive Statistics

Chapter 2: Descriptive Statistics Presentation

13 / 81

About These Notes

Introduction

Descriptive Statistics

Descriptive Statistics
Researchers can measure many physical processes, such as pressure, strength, survival time, and amount. Often, hundreds or thousands of measurements are made, and procedures were developed to organize, summarize, and make sense of these measurements. These procedures, referred to as descriptive statistics, are specically used to condense and summarize numerical observations to get the initial (meaningful) information and make the data ready for further manipulations. In univariate case, descriptive statistics mainly covers the following tasks of data analysis. Presentation of data using
Tabulation methods (frequency distributions) Graphical methods (diagrams and graphs)

Measures of central tendency (averages and quantiles) Measures of dispersion (ranges, deviations, variations) In the multivariate case, descriptive statistics covers, along with the above, the analysis of the relationships (covariance, correlation and regression etc) between dierent variables as well.

14 / 81

About These Notes

Introduction

Descriptive Statistics

Presentation
Tabulation methods Frequency distribution: The frequency (f ) of a particular observation is the number of times that observation occurs in the data. A frequency distribution is a table that lists the observations along with their respective frequencies. Frequency distribution with no grouping: For discrete data with small range (or small number of actually distinct values) the frequency table is constructed by arranging the collected data values in ascending order of magnitude with their corresponding frequencies. Frequency distribution with grouping: In case of very broad range of values or if the data is continuous, the entire data is divided into dierent non-overlapping groups or classes with the number of observations falling in each group or class. A frequency distribution condenses bulky data to a small table, which tells us about the pattern and shape of the distribution of values of the underlying variable or population.

15 / 81

About These Notes

Introduction

Descriptive Statistics

Presentation
A very simple example (without grouping) Example 1. The marks awarded for an assignment set for a BE (MS&E) class of 20 students were as follows: 6 7 5 7 7 8 7 6 9 7 4 10 6 8 8 9 5 6 4 8.

Present this information in a frequency table. Solution : To construct a frequency table, we proceed as following: Draw a three columns table with columns heading Marks, Tally, and Frequency. Put all the possible distant values without repetition in the rst column in ascending (or descending) order as shown below. Marks 4 5 6 7 8 9 10 Tally Frequency

16 / 81

About These Notes

Introduction

Descriptive Statistics

Presentation
data: 6 7 5 7 7 8 7 6 9 7 4 10 6 8 8 9 5 6 4 8. The rst data value is 6, put a tally bar against it, second is 7 put a tally bar for it too. Go ahead and put tallies for all the values. Count the bars for each data value and thats the frequency. When the number of tally bars equals 5, bundle them in a group of 4 with a slash across it. Marks 4 5 6 7 8 9 10 Tally Frequency Marks 4 5 6 7 8 9 10 Tally Frequency 2 2 4 5 4 2 1

So we now have the data in a meaningful form. We can now answer the following questions? Where is the data concentration (peak) point? How is it declining? Is this a normal marks distribution? Or there is some thing wrong with class performance? Do we need further investigations?

17 / 81

About These Notes

Introduction

Descriptive Statistics

Presentation
The how to of a frequency distribution with grouping. When there are too many values in the data and are more spread out, it is dicult to set up a frequency table for every data value as there will be too many rows in the table. Before proceeding ahead, we need to learn about a few terms and rules that we will need for the construction of a frequency distribution with grouping or classes.
Class-limits: The numbers that describe a class or group. The two limits are called lower class limit and the upper class limit. The class-limits (CL) should be inclusive and should not cause any overlapping between any adjacent classes, e.g. age in years can be classied as 10-14, 15-19, 20-24 or 10.0-14.9, 15.0-19.9, 20.0-24.9 etc. Class-boundaries: The class-boundaries (CB) are precise numbers that separate one class from its rst neighbours. CBs are just the midpoint of the upper limit of one class and the lower limit of the next class, e.g. consider the rst two classes 10-14, 15-19, the class boundaries are calculated by 14+15 = 14.5. Thus, for 10-14, 15-19, 20-24, the CBs are 9.5-14.5, 14.5-19.5, 19.5-24.5, thus CBs are 2 by one decimal place more precise than class-limits. The upper class-boundary of one class coincides with the lower class-boundary of the next class, thus leaving no gap. Class marks: Class marks are simply the midpoints of classes. For example, the class mark of class 10-14 is 10+14 = 12. 2 Class interval or class width: Class interval, traditionally denoted by h is the dierence between the two class-boundaries of the same class or the dierence between the lower (or upper) limits of the two consecutive classes. In the above case the class interval is 5. Ideally, all the classes should have equal intervals, unequal intervals can also happens, but should be avoided, until required, because of diculty in interpretations. Class frequency: The frequency of a particular class is the number of times the data value occurs within the limits of that class.
18 / 81

About These Notes

Introduction

Descriptive Statistics

Presentation
A typical frequency distribution with grouping looks like the following table. Classes 10-14 15-19 20-24 Class-boundaries 9.5-14.5 14.5-19.5 19.5-24.5 Tally bars Class-Marks 10+14 =12 2 17 22 Frequency

The columns of class-boundaries and class-marks help in the calculations of dierent statistical quantities such as mean, median and quantiles as we will see in next chapter.

19 / 81

About These Notes

Introduction

Descriptive Statistics

Presentation
A few rules
How many classes? There is no hard rule to decide as to how many classes should we make. Both very few or too many classes will defeat the purpose of constructing the frequency distribution. Too few classes will result in the loss of lot of information and too many classes will kill the purpose of condensation. As a rule of thumb, a number between 5 and 15 would give reasonable results.
(I think, 15 is still too large; I would not take a number larger than 10, unless I am using a computer.)

Find the range, that is the dierence between the maximum and the minimum values in the data. Calculate the class width/interval h by dividing the range of data by the number of classes. If the division results in a decimal number, take the next higher whole number. Avoid using fractional numbers as intervals, it brings you headache. Taking a multiple of 5 or 10 would ease up the problem and also would increase the readability of the table. The resulting classes should cover the whole of data. Note: you can also choose a proper interval rst and then calaculate the number of classes, provided the whole data is covered in a reasonable number of classes. Where to start the rst class from? Usually the lower class-limit is put at or below the smallest data. Remember, the lower class-limit of the rst class should never be larger than the smallest value of the data otherwise that values at the lower end of data will be lost. Starting from a multiple of 5 or 10 would not hurt. Find the upper class-limit by counting from the lower class-limit to the end of the interval. Note that adding the interval directly to lower class-limit is erroneous, as we know the classes are inclusive. Adding an interval to the lower class-limit of a class gives you the lower class-limit of the next class, rather than the upper limit of the same class. (most students forget it...be careful)

20 / 81

About These Notes

Introduction

Descriptive Statistics

Presentation
Find the rest of the classes by just adding the interval to the lower and the upper class-limits to get the lower and upper class-limits of the next class. Now the hard part... scanning the data (mouse hunt)... and putting the values in appropriate classes. Placing tally marks and frequencies. Determine the sum of frequencies to check whether all the values were included.

An example of frequency distribution with grouping Example 2. Thirty energy saver light bulbs were tested to determine how long they usually last. The results, to the nearest day, were recorded as follows:
423 392 399 369 408 415 387 431 428 411 401 422 393 363 396 394 391 372 371 405 410 377 382 419 389 400 386 409 381 390

Construct a frequency distribution for these values. Solution: First we need to nd the range Range = Largest - Smallest = 431 363 = 68 Lets there be 8 classes, therefore class interval is 68 Range = = 8.5 10.0 Number of classes 8 We take h = 10.0 because it eases up the data scanning process. h=
21 / 81

About These Notes

Introduction

Descriptive Statistics

Presentation
Now lets make the table and set the classes. The smallest value is 363, we start from 360 and set the rst class as 360-369, second as 370-379 and so on. Now start scanning the data, allocate the values to their corresponding classes and put tallies for them accordingly. When a data value is allocated to some class, cancel that value in the actual data set, indicating that it has been counted, to avoid recounting.

423 369
392 399 408 415

387 431 428

411 401 422

393 363 396


TBs

394 391 372

371 405 410

377 382 419

389 400 386

409 381 390

Classes 360-369 370-379 380-389 390-399 400-409 410-419 420-429 430-439 Total

Frequency (f )

22 / 81

About These Notes

Introduction

Descriptive Statistics

Presentation
Go on scanning, canceling and counting and put the tallies accordingly. Fill up the rest of the columns.

423 369 387 411 393 394 371 377 389 409 392 408 431 401 363 391 405 382 400 381 399 415 428 422 396 372 410 419 386 390
Classes 360-369 370-379 380-389 390-399 400-409 410-419 420-429 430-439 Total TBs Frequency (f ) 2 3 5 7 5 4 3 1 f = n = 30

Sum up the frequencies to check whether all the data values are picked up. By looking at this frequency distribution, we can quickly nd that generally most of the bulbs have life between 390 and 399 days as this group has the largest frequency (7). Thus, this group can be regarded as a representative group of this data. We can also see how the frequencies decrease toward the tails of the distribution and the distribution looks fairly symmetric.

23 / 81

About These Notes

Introduction

Descriptive Statistics

Presentation
Relative frequency and percentage frequency While studying these data we may want to know not only how long the bulbs last, but also what proportion of the bulbs falls into each class of bulbs life. This is called the relative frequency (RF) of a particular observation or class and is found by dividing its corresponding frequency (f ) by the total number of observations n: that is: RF = f n

A more clear measure is the percentage frequency, which is found by multiplying each relative frequency value by 100. Thus: PRF = RF 100 The PRF tells us about what percent of observations fall in a particular class. This gives us a bit clearer picture than RF.

24 / 81

About These Notes

Introduction

Descriptive Statistics

Presentation
Example 3. Lets calculate the RF and PRF for Example 2.
Classes 360-369 370-379 380-389 390-399 400-409 410-419 420-429 430-439 Total f 2 3 5 7 5 4 3 1 f = n = 30
2 30 3 30 f RF = n = 0.07 = 0.10 0.17 0.23 0.17 0.13 0.10 0.03 1.0

PRF
2 100 = 7 30 3 100 = 10 30

17 23 17 13 10 3 100

Looking at this table we can now say that: The chance of any randomly selected bulb having a life in this range is approximately 0.23. 23% of bulbs have a life of from 390 days up to but less than 400 days.

25 / 81

About These Notes

Introduction

Descriptive Statistics

Presentation
Cumulative frequency distribution A cumulative frequency distribution table is the same as a frequency distribution table with additional columns that give the cumulative frequency (CF) and the cumulative percentage (CP) of the data. The cumulative frequency distribution gives us an idea of how many observations of the data falls below or above a given value. It also tells us about the number of observations that lie between a given interval of two values. The CFs are obtained by adding the frequencies of dierent classes in successive manner to the cumulative total of previous frequencies, that is accumulating (the running total) the elements of frequency column. The accumulation can be conducted either from the top class (or value), in which case the CF is called the less than type CF, or from the bottom class (or value), which is known as the more than type CF. In grouped data, for the less than type CF the upper class boundaries are used and for more than type the lower class boundaries are used.

26 / 81

About These Notes

Introduction

Descriptive Statistics

Presentation
Example 4. We calculate a less than type CF and CP for the data in Example 2.
Upper Class Boundaries <369.5 <379.5 <389.5 <399.5 <409.5 <419.5 <429.5 <439.5 Total f 2 3 5 7 5 4 3 1 n = 30 CF 2 2+3=5 5+5=10 10+7=17 17+5=22 22+4=26 26+3=29 29+1=30 CP =
CF 100 n 2 100 = 7 30 5 100 = 17 30

33 57 73 87 97 100

Suppose we have been asked to nd as to how many or what percent of observations lie below 399.5. From the table we quickly learn that - there are 17 observations below the given value, which makes them 57% of the entire data. Note: We use the upper class boundaries for a less than (<) type CF distribution.

27 / 81

About These Notes

Introduction

Descriptive Statistics

Presentation
Example 5. Now lets calculate a more than type CF and CP for the data in Example 2.
Upper Class Boundaries >359.5 >369.5 >379.5 >389.5 >399.5 >409.5 >419.5 >429.5 Total f 2 3 5 7 5 4 3 1 n = 30 CF 28+2=30 25+3=28 20+5=25 13+7=20 8+5=13 4+4=8 1+3=4 1 CP = CF 100 n 30 100 = 100 30 28 100 = 93 30 83 67 43 27 13 1

Suppose now we are asked to tell as to how many or what percent of observations lie above 399.5. From the table we quickly learn that - there are 13 observations above the given value, which makes them 43% of the entire data. Note: We use the lower class boundaries for a more than (>) type CF distribution.

28 / 81

About These Notes

Introduction

Descriptive Statistics

Presentation
Graphical Methods We now introduce the widely used graphic displays for data presentation in Engineering sciences. Most of the time we want visual presentation of data for clearly seeing patterns in data. Patterns in data are commonly described in terms of: center, spread, shape, and unusual features. Some common distributions have special descriptive labels, such as: symmetric, bell-shaped, skewed, etc. We often need answer to questions like Where are the data (center) located? How spread out are the data? Are the data symmetric or skewed? Are there outliers in the data? Histogram Histogram is a visual version of frequency table. The main purpose of a histogram is to enhance the presentation of data. You can present the same information in a table; however, the graphic presentation format usually makes it easier to see the nature of distribution. It consists of vertical bars, usually called bins or frequency bins, that represent dierent classes of a frequency table. Usually, there is no space between adjacent bars. The height of bars indicates the frequency of classes. A histogram can typically help you answer the following questions: What is the most frequent observation? What distribution (center, variation and shape) does the data have? Does the distribution of data look symmetric or is it skewed towards the left or right?
29 / 81

About These Notes

Introduction

Descriptive Statistics

Presentation
Example 6. Lets construct a histogram and relative frequency histogram for the energy saver bulbs data given in Example 2. We already have constructed the frequency table in Example 3. Lets now depict it.
Histogram of Data
7

Relative Frequency Histogram of Data


0.20 Frequency

Frequency

360

380

400 Data values

420

440

0.00 360

0.05

0.10

0.15

380

400 Data values

420

440

One can also construct a percentage relative frequency histogram by multiplying the relative frequencies by 100.
30 / 81

About These Notes

Introduction

Descriptive Statistics

Presentation
Some of the key features that we usually look for in a histogram. Center: Graphically, the center of a distribution is located at the median of the distribution. Median is the point in a graphic display where about half of the observations are on either side. In the chart to the right, the height of each column indicates the frequency of observations. Here, the observations are centered over 4. Spread: The spread of a distribution refers to the variability of the data. If the observations cover a wide range, the spread is larger. If the observations are more clustered around a single value, the spread is smaller.

31 / 81

About These Notes

Introduction

Descriptive Statistics

Presentation
Shape: The shape of a distribution is described by the following characteristics. Number of peaks. Distributions can have few or many peaks. Distributions with one clear peak are called unimodal, and distributions with two clear peaks are called bimodal. Symmetry. When it is graphed, a unimodal symmetric distribution can be divided at the center so that each half is a mirror image of the other. A single peaked symmetric distribution is referred to as bell-shaped distribution. Skewness. When displayed graphically, some unimodal distributions have many more observations on one side of the graph than the other side. Distributions with most of their observations on the left (toward lower values) are said to be skewed right; and distributions with most of their observations on the right (toward higher values) are said to be skewed left. Uniform. When the observations in a set of data are equally spread across the range of the distribution, the distribution is called a uniform distribution. A uniform distribution has no clear peak(s). Gaps. Gaps refer to areas of a distribution where there are no observations. The second last gure on the next slide has a gap; there are no observations in that part of the distribution. Outliers. Sometimes, distributions are characterized by extreme values that dier greatly from the other observations. These extreme values are called outliers.

32 / 81

About These Notes

Introduction

Descriptive Statistics

Presentation

Dierent shapes of histogram.

f (x i )

f (x i )

A normal distribution
xi

xi

A skewed distribution
xi

xi

f (x i )

f (x i )

f (x i )

f (x i )

A uniform distribution
xi

xi

f (x i )

f (x i )

A bimodal distribution
xi

xi

A distribution with outliers


xi

xi

f (x i )

f (x i )

f (x i )

f (x i )

Cliplike distribution
xi

xi

33 / 81

About These Notes

Introduction

Descriptive Statistics

Presentation
Cumulative Histogram Like histogramfrequency table pairing the cumulative histogram is a visual version of the cumulative frequency table. It tells what percentage of the total number of observations accumulates at each bin (or interval). It makes nding the percentage or proportion of observations falling within a given interval rather more easy. An ordinary and a cumulative histogram of the same data are given in the following gures.
Histogram of Data
30 7 7

Cumulative Histogram of Data


30 29

26 6 25

22

Cumulative Frequency

Frequency

20

17 15

10

10

360

380

400

420

440

0 360

380

400

420

440

Data values

Data values

Cumulative histogram is the actual concept that most of the probability distributions uses to calculate probabilities associated with dierent events. So learning about it, and understanding it, is must.
34 / 81

About These Notes

Introduction

Descriptive Statistics

Presentation
Dotplots A dotplot is an attractive summary of numerical data when the data set is reasonably small or there are relatively few distinct data values, especially discrete values. Each observation is represented by a dot above the corresponding location on a horizontal measurement scale. When a value occurs more than once, there is a dot for each occurrence, and these dots are stacked vertically. As with a stem-and-leaf display, a dotplot gives information about location, spread, extremes, and gaps. Example 7. The study included 33 students whose rst-grade IQ scores are given here:

The following gure shows a dotplot for the above data. A representative IQ value is around 110, and the data is fairly symmetric about the center.

35 / 81

About These Notes

Introduction

Descriptive Statistics

Presentation
Stem and Leaf Displays A stem-and-leaf plot (aka stemplot) of a quantitative variable is a textual graph that classies data items according to their most signicant numeric digits. It is generally used for small data sets (50 or fewer observations). A stem and leaf display is similar to a histogram, since it shows how many values in a set fall under a certain interval. It has even more information, it shows the actual values within the interval. A stem is the leading digit of an observation whereas the remaining digits are leaves. For example the observation 327 can be split as stem=3, and leaf=27 or stem=32, and leaf=7. The stemplot is drawn with two columns separated by a vertical line with stems listed to the left of the vertical line. Each stem is listed only once and no numbers are skipped, even if it has no leaves. The leaves are listed in increasing order in a row to the right of each stem. When there is a repeated number in the data (such as two 72s) then the plot must reect such (e.g. the plot of 72 72 75 76 would look like 7 | 2 2 5 6.)

36 / 81

About These Notes

Introduction

Descriptive Statistics

Presentation
Example 8. The stem-and-leaf plot of energy saver bulb data is constructed as below. Stem 36 37 38 39 40 41 42 43
(Key: 40|8 = 408)

Leaves 39 127 12679 0123469 01589 0159 238 1

In this example we could also use a stem of single digit but then there would have been only two stems; 3 and 4, resulting in a very less informative plot. In the case of values with decimal points (continuous data), the decimal part in each number is taken as leaf. Rounding may be used to suppress certain number of decimal points so that all data values have the same number of decimal points. Further reading and exercises: Have a look of the introduction and Section 1.2 of Devores book and the examples there in. Then solve questions 10, 11, 12, 13, 14, 15, 16.a, 16.b, 17, 20, 24, 25, 29 in exercise 1.2.
37 / 81

About These Notes

Introduction

Descriptive Statistics

Descriptive Statistics

Chapter 3: Descriptive Statistics Measures of Central Tendency

38 / 81

About These Notes

Introduction

Descriptive Statistics

Measures of Central Tendency


Instead of describing or presenting a whole data, sometimes we use one value to represent the whole data. In this way, it is easier to compare data of the same type. A data can be summarized in a single value, usually, somewhere in the center of the data. Its the value at which the data have a tendency to concentrate; the point at which the distribution is in balance. These measures are also called averages, or measures of location or position. The most commonly used measures of central tendency are mean, median and mode. Mean: aka Arithmetic Mean, the mean is typically what is meant by the word average. Its perhaps the most common measure of central tendency. The mean of a data is given by M ean = the sum of all data values the number of values

4+8+9 21 = = 7. 3 3 The sample mean is written as x , and the population mean as the Greek letter mu (). Despite its popularity, the mean may not be an appropriate measure of central tendency in skewed distributions, or in data with outliers. For example, the mean of 4, 8, and 9 is

39 / 81

About These Notes

Introduction

Descriptive Statistics

Measures of Central Tendency


Denition The sample mean x of a set of n observations x1 , x2 , ..., xn is dened as, x = x1 + x2 +, ..., +xn = n
n i=1

xi

i = 1, ..., n.

You can ignore the limits (i = 1, n) of the summation symbol and can simply write it as . For grouped data, arranged in frequency table, with k classes with midpoints x1 , x2 , ..., xk , and frequencies f1 , f2 , ..., fk , the mean is given by, f1 x1 + f2 x2 +, ..., +fk xk f1 + f2 +, ..., +fk
k i=1 fi xi k i=1 fi

= =

i = 1, ..., k.

Note that

k i=1

fi = n.

40 / 81

About These Notes

Introduction

Descriptive Statistics

Measures of Central Tendency


Example 9. In a mathematics test, the marks of six students are 36, 52, 57, 58, 67 and 90. Find the mean of their marks. Solution : The mean is given by x = = = = xi n 36 + 52 + 57 + 58 + 67 + 90 6 360 6 60 Marks

The arithmetic mean is quite sensitive to any change in a single value, that makes it an inappropriate measure under certain circumstances. It gives good results when the observations are reasonably similar. Its value can be greatly aected by the presences of a single outlier (extreme value observations). For example, in the above example, if one data value, say 52, was mistakenly recorded as 352. The resulting mean will then be 110 , which is quite dierent than the previous one, leading to a dierent decision about the data.

41 / 81

About These Notes

Introduction

Descriptive Statistics

Measures of Central Tendency


Example 10. Find the mean for data in Example 2. Solution : Lets put the necessary columns in frequency table obtained in Example 2. Classes 360-369 370-379 380-389 390-399 400-409 410-419 420-429 430-439 Thus the mean is x = Midpoints (x) 364.5 374.5 384.5 394.5 404.5 414.5 424.5 434.5
9 i=1 fi xi 9 i=1 fi

f 2 3 5 7 5 4 3 1 30

fx 729.0 1123.5 1922.5 2761.5 2022.5 1658.0 1273.5 434.5 11925

11925 = 397.5 30

42 / 81

About These Notes

Introduction

Descriptive Statistics

Measures of Central Tendency


Median: The median is the middle value of a set of data when they are arranged in ascending or descending order. Its a value above and below which 50% of the ordered data lie. Median is insensitive to outliers. Denition The sample median x ( is pronounced as tilde (till-day)) of a set of n observations x1 , x2 , ..., xn is obtained as, Arrange the n observations in ascending (or descending) order.
If n is an odd number, the median is the ( n+1 )th observation. 2 If n is an even number, the median is the mean of the ( n )th and and the ( n + 1)th (i.e. the two 2 2 middle) observations.

For grouped data the median is calculated by the following formula x =l+ h f n C 2

where l is lower class boundary of the median class and C is the cumulative frequency of the preceding class. Where median class is a class corresponding to n th observation. No need to worry about n 2 being odd or even.

43 / 81

About These Notes

Introduction

Descriptive Statistics

Measures of Central Tendency


Example 11. Consider the following 12 data points.
15.2 20.4 9.3 9.4 7.6 11.5 11.9 16.2 10.4 9.4 9.7 8.3

(a) Find the median for the full data. (b) Omit the largest (or the smallest) observation and nd the median again. Solution : Rearranging the values in ascending order. We have
7.6 8.3 9.3 9.4 9.4 9.7 10.4 11.5 11.9 15.2 16.2 20.4

(a) Here n = 12 is even and hence median is x = Mean of n 2


th

and

n +1 2

th

observations.

The two middle values are indicated by under-brace in the ordered data
7.6 8.3 9.3 9.4 9.4 9.7 10.4 11.5 11.9 15.2 16.2 20.4

x =

9.7 + 10.4 = 10.05 2


44 / 81

About These Notes

Introduction

Descriptive Statistics

Measures of Central Tendency


(b) Lets omit the last observation 20.4 and calculate the median again. Now since n = 11 is odd, median th n+1 is just the middle observation, that is the observation, which is 6th here. 2
7.6 8.3 9.3 9.4 9.4 9.7
x

10.4

11.5

11.9

15.2

16.2

Thus the median is x = 9 .7 . .3 = 11.61. Now lets replace 20.4 by 50.0, we see that now mean is For n = 12, the mean is x = 139 12 .9 x = 168 = 14 . 07 but the median remains unchanged. Thus median is insensitive to outliers. 12

45 / 81

About These Notes

Introduction

Descriptive Statistics

Measures of Central Tendency


Example 12. Find the median for grouped data in Example 2. Solution : Lets put the necessary columns in frequency table obtained in Example 2. First we nd the median group with n = 30 = 15. We see that 15 is falls below the 17 of cf column, so 389.5 399.5 is our 2 2 target or median group where the median lies. Classes 360-369 370-379 380-389 390-399 400-409 410-419 420-429 430-439 CBs 359.5-369.5 369.5-379.5 369.5-389.5 389.5-399.5 399.5-409.5 409.5-419.5 419.5-429.5 429.5-439.5 f 2 3 5 7 5 4 3 1 cf 2 5 10 17 22 26 29 30

C n = 2

30 2

= 15

Now we have l = 389.5, h = 10, f = 7, C = 10. Thus by putting the values in the formula we have, x =l+ h f n 10 C = 389.5 + (15 10) = 396.64 2 7
46 / 81

About These Notes

Introduction

Descriptive Statistics

Measures of Central Tendency


Mode: Mode is simply the most frequent observation in a data. A data can have more than one mode; if it has two, it is said to be bimodal. When you have categorical data, or data that appears as words instead of numbers, you need to use the mode. For example, if a sandwich shop sells 10 dierent types of sandwiches, the mode would represent the most popular sandwich. Mode is also insensitive to outliers. Example 13. The mode of { 1, 1 , 2, 3, 5, 8} is 1. The modes of {1, 3, 5, 7, 9, 9 , 21, 25, 25, 31} are 9 and 25. Thus, the data is bimodal. Denition For grouped data, the mode is calculated by the following formula. Mode = l + where l is the lower class boundary of the modal class (the class with largest frequency) fm the frequency of the modal class (the largest frequency), f1 the frequency of the preceding class of the modal class, f2 the frequency of the succeeding class of the modal class, and h is the class interval.
47 / 81

fm f1 h 2fm f1 f2

About These Notes

Introduction

Descriptive Statistics

Measures of Central Tendency


Example 14. Find the mode for grouped data in Example 2. Solution : We know that the class with highest frequency is 389.5-399.5, so its our modal group. Classes 360-369 370-379 380-389 390-399 400-409 410-419 420-429 430-439 CBs 359.5-369.5 369.5-379.5 369.5-389.5 389.5-399.5 399.5-409.5 409.5-419.5 419.5-429.5 429.5-439.5 f 2 3 5 7 5 4 3 1

f1 fm f2

From the above table we have l = 389.5, h = 10, f1 = 5, fm = 7 and f2 = 5. So the mode is Mode = l + fm f1 75 h = 389.5 + 10 = 394.5 2f m f 1 f 2 2755

48 / 81

About These Notes

Introduction

Descriptive Statistics

Measures of Central Tendency


Quantiles: Quantiles are the kins of the median, as they represent equidistant points around the median. The median divides an ordered data into two equals parts while quantiles divide it into more than two parts that helps in indicating the extent to which the data lies near the median, or near the extremes. Some of the commonly used quantiles are dened here. Quartiles: Quartiles divide an ordered data set into four equal parts. The values that divide each part are called the rst, second, and third quartiles; and they are denoted by Q1 , Q2 , and Q3 , respectively.
Q1 is the middle value in the rst half of the rank-ordered data set. That is below which 25% of data lie. Q2 is equal to the median value in the set. Q3 is the middle value in the second half of the rank-ordered data set. That is below which 75% of data lie.

Quartiles are calculated in the same manner as median except the multiplication by extra factors of 1, 2, 3 for rst, second and third quartiles respectively. Since Q2 = Median therefore we usually calculate only the rst and the third quartiles for a given data. Quartiles are also known as fourths (Devores terminology). Q1 is called lower quartile or lower fourth and Q3 is known as upper quartile or upper fourth.

49 / 81

About These Notes

Introduction

Descriptive Statistics

Measures of Central Tendency


Denition For a sample of size n, the j th (j = 1, 3) quartile is dened as, First arrange the n observations in ascending (or descending) order. Then Qj = j where j = 1, 3. for j = 1, and for j = 3, Q3 = 3 n+1 4 th observation Q1 = 1 n+1 4 th observation n+1 4 th observations

If the result contains a fraction (because n is even), then the value is the mean between the values at the index above and below.

50 / 81

About These Notes

Introduction

Descriptive Statistics

Measures of Central Tendency


Example 15. Find the two quartiles for the following ungrouped data. 65 55 89 56 35 14 56 55 87 45 92

Solution : First arrange the values in ascending order, 14 We have, (n + 1) (11 + 1) 12 th = th = th = 3rd 4 4 4 Q1 = 45 Q1 = and 3(n + 1) 3(11 + 1) 36 th = th = th = 9th 4 4 4 Q3 = 87 Q3 = observation observation 35 45 55 55 56 56 65 87 89 92

Now, if n was 10, then the index of the 1st quartile is 2.5. The quartile is the average of the 2nd and 3rd value in the list.
51 / 81

About These Notes

Introduction

Descriptive Statistics

Measures of Central Tendency


There are several other ways of nding quartiles but this one is simple. Dierent methods may give dierent results. For example, one method, which can be found in many texts, divides the ordered data into two halves. The median of the rst half is Q1 and that of the second half is Q3 . If n is odd than the left over middle value of the entire data is included in both halves. Now lets nd the quartiles for the data in the previous example by this method. The ordered data is a given as below.
Lower Half Q3 = 65+87 =76 2

14

35

45

55

55

56

56

65

87

89

92

Q1 = 45+55 =50 2

Upper Half

Since n is odd, therefore there is a single middle value: 56, we divide the above data into two halves and include 56 in both sets. Both halves have an even number of data points. Using the method of nding the median for even n, we get Q1 = 50 and Q3 = 76. Deciles: Deciles divide a rank-ordered data set into ten equal parts. These are dened in the same way as quartiles, except that now the divisor is 10 instead of 4 and j runs from 1 to 10. These are denoted by D1 , D2 ,..., D10 . Percentiles: Percentiles divide a rank-ordered data set into hundred equal parts. These are also dened in the same way as quartiles, except that now the divisor is 100 instead of 4 and j runs from 1 to 100. These are denoted by P1 , P2 ,..., P100 .
52 / 81

About These Notes

Introduction

Descriptive Statistics

Measures of Central Tendency


In reality, the quartiles and deciles are also percentiles. We can relate them as following.
P25 P50 P75 P60 is is is is equal to Q1 . the median value in the set. equal to Q3 . equal to D6 and so on..

For grouped data (frequencies) one has to use the cumulative frequency, as was used in the calculation of median. Percentiles are useful for giving the relative standing of an individual observation in a population, they are essentially the rank position of an individual observation. For grouped data, we calculate the quartiles, deciles and percentiles using the same formula, with a slight modication, as that for median. Thus, Quartiles h jn Qj = l + C where j = 1, 2, 3 f 4 Deciles Dj = l + and Percentiles Pj = l + h f jn C 100 where j = 1, 2, ..., 100 h f jn C 10 where j = 1, 2, ..., 10

Solve an exercise using these formulas.


53 / 81

About These Notes

Introduction

Descriptive Statistics

Measures of Central Tendency


In J L Devore Book; the soft copy: Have a look of Section 1.3 in Chapter 1. Exercises Section 1.3 Questions: 30-37. Note: Ignore the trimmed mean and trimmed median etc for now. However, the concept is way too easy and if you would like it, I can explain them to you.

54 / 81

About These Notes

Introduction

Descriptive Statistics

Descriptive Statistics

Chapter 4: Descriptive Statistics Measures of Dispersion

55 / 81

About These Notes

Introduction

Descriptive Statistics

Measures of Dispersion
Imagine, you are comparing two dierent data sets (for now, measured in the same units, e.g. kg , km etc). By chance, it happens that the two data sets have the same means, medians or modes. Does it mean that the two data sets are the same or they have the same features? No. Here we need some extra insight into the data; as a rst step, we need to measure their respective dispersions or variabilities about the center and then compare them. Some of the most commonly used measures of dispersions are
Range, Mid-range Inter-quartile Range (also called the fourth-spread), Semi-inter-quartile Range Mean Deviation Variance and Standard Deviation

Range is quite a simple measure, as you know, just the dierence of the two extreme values in the data and mid-range is just the average of two extreme values, i.e. mid-range = max-value + min-value 2

So we will start from inter-quartile and semi-inter-quartile range.


56 / 81

About These Notes

Introduction

Descriptive Statistics

Measures of Dispersion
Inter-quartile range (aka fourth-spread): The interquartile range, denoted IQR, is a measure of spread from the lower quartile to the upper quartile, IQR = Q3 Q1 From now on we will denote IQR by fs for simplicity. Semi-Inter-quartile range: (SIQR) is just the half of IQR, SIQR = Q3 Q1 2

The pure measure (free of units of measurements) is the co-ecient of quartile deviation (CQD) dened as Q3 Q1 CQD = Q3 + Q1 This measure is free of measurements units and can be used to compare two or more data with dierent units of measurement.

57 / 81

About These Notes

Introduction

Descriptive Statistics

Measures of Dispersion
Mean Deviation: Mean (or median) deviation (MD) or mean absolute deviation (MAD) is also a measure of dispersion dened as the average of the absolute dierences/deviations between the data values and the data center (usually, mean or median). Mathematically, Using the mean as the data center, MD = Similarly, for median the MedD is dened as, MedD =
n i=1 n i=1

|x i x | . n

|xi x | n

x . where x = n For grouped data, arranged in frequency table, with k classes having midpoints x1 , x2 , ..., xk , and frequencies f1 , f2 , ..., fk , the MD and MedD are given by, MD = where x = fx . n
58 / 81

k i=1

fi |xi x | n

and

MedD =

k i=1

f i |x i x | n

About These Notes

Introduction

Descriptive Statistics

Measures of Dispersion
Example 16. Find the MD and MedD for the following simple data. 65 55 89 56 35 14 56 55 87 45 92

Solution : Lets denote the data by X . What we need rst, are the mean and median. The mean is x = = = xi n 65 + 55 + ... + 92 649 = 11 11 59.
n i=1

Since n is odd, the median is just the middle observation of the oredered data, 14 35 45 55 55 56 56 65 87 89 92

hence median is 56. Now lets proceed to nd MD and MedD.

59 / 81

About These Notes

Introduction

Descriptive Statistics

Measures of Dispersion
Lets arrange the data in a table, and calculate the required quantities i.e. for the above formulas,
xi 65 55 89 56 35 14 56 55 87 45 92 xi x 65-59=6 55-59=-4 30 -3 -24 -45 -3 -4 28 -14 33 |x i x | 6 4 30 3 24 45 3 4 28 14 33 194 xi x 65-56=9 55-56=-1 33 0 -21 -42 0 -1 31 -11 36 33 | xi x | 9 1 33 0 21 42 0 1 31 11 36 185

|x i x | and

|xi x |,

Hence, MD = |x i x | 194 = = 17.6 n 11 and MedD = |xi x | 185 = = 16.8 n 11

60 / 81

About These Notes

Introduction

Descriptive Statistics

Measures of Dispersion
Example 17. Find the MD and MedD for the following grouped data. x f : : 14 4 35 7 45 11 55 13 56 18 65 13 87 8 89 6 92 3

Solution : Again, rst we need the mean and the median to calculate the necessary columns. The mean is x = and the median is
k i=1 fi xi k i=1 fi

4870 = 58.7 83

n th observation = 41.5th observation 2 From the table on the following slide, we nd that median is, x = The x = 56

61 / 81

About These Notes

Introduction

Descriptive Statistics

Measures of Dispersion
x 14 35 45 55 56 65 87 89 92 f 4 7 11 13 18 13 8 6 3 83 xi x -44.7 -23.7 -13.7 -3.7 -2.7 6.3 28.3 30.3 33.3 fi |xi x | 178.7 165.7 150.4 47.8 48.1 82.2 226.6 182.0 100.0 1182 cf 4 11 22 35 53 66 74 80 82 xi x -42 -21 -11 -1 0 9 31 33 36 | f |xi x 168 147 121 13 0 117 248 198 108 1120

We have all the stu for MD and MedD, fi |xi x | 1182 = = 14.2 fi 83 fi |xi x | 1120 = 13.5 = fi 83

MD MedD

= =

Thats it!

62 / 81

About These Notes

Introduction

Descriptive Statistics

Measures of Dispersion
Variance and Standard Deviation: Variance is dened as the mean of the squared deviations of all the observations from the mean. Population variance is denoted by 2 and the sample variance is denoted by S 2 or 2 . Mathematically, for simple data, 2 = S2 = or s2 = (xi x )2 , n1 for small sample data (n 30), (xi )2 , N 2 (x i x ) , n for population data, for large sample data (n > 30),

The standard deviation is just the positive square root of the variance, dened as, = S= or s= (xi x )2 , n1 for small sample data (n 30)
63 / 81

(xi )2 , N (xi x )2 , n

for population data, for large sample data (n > 30)

About These Notes

Introduction

Descriptive Statistics

Measures of Dispersion
Standard deviation (SD) is a widely used measure of variability or diversity, used in statistics and probability theory. It shows how much variation or dispersion exists from the average (mean, or expected value). A low standard deviation indicates that the data points tend to be very close to the mean, whereas high standard deviation indicates that the data points are spread out over a large range of values.
Do you remember the formula D = (x1 x2 )2 + (y1 y2 )2 ? Do you notice the similarity between SD and this formula? SD does almost the same function as D, except that it averages the squared deviations and that the coordinates of the second (here mean) point are the same for all pairs. i.e. for two data points SD = (x1 x )2 + (x2 x )2 2

Generalizing to n observations x1 , x2 , ..., xn , the SD is SD = (x1 x )2 + (x2 x )2 + ... + (xn x )2 = n (xi x )2 n

SD is commonly used to measure condence in statistical conclusions.

64 / 81

About These Notes

Introduction

Descriptive Statistics

Measures of Dispersion
For grouped data arranged in frequency table with k classes with midpoints x1 , x2 , ..., xk , and frequencies f1 , f2 , ..., fk the variance and standard deviations are given by,
k i=1

S2 = and S= where x =

fi (xi x )2
k i=1

fi

k i=1

fi (xi x )2
k i=1

fi

k i=1 fi xi k i=1 fi

In next slides we will solve an example problem using the data from example 17.

65 / 81

About These Notes

Introduction

Descriptive Statistics

Measures of Dispersion
Example 18. Find the SD and variance of the data in Example 16. Solution : Looking at the nature of data (i.e. observations with frequencies), we need to use the formula for SD at previous page. That is, k )2 i=1 fi (xi x S= k i=1 fi For which we need to nd the mean; x , which is 58.7 (from Example 16), so lets just calculate the k required quantity for the above formula, i.e. )2 . The variance is calculated by taking i=1 fi (xi x square of the SD. So we need to construct the following table. We now have the required stu for the formula, lets put the values in it. S= 30056 = 19.0 83
x 14 35 45 55 56 65 87 89 92 f 4 7 11 13 18 13 8 6 3 83 xi x -44.7 -23.7 -13.7 -3.7 -2.7 6.3 28.3 30.3 33.3 f i (x x )2 7983 3923 2057 176 129 520 6419 5518 3332 30056

Thus the standard deviation (SD) of the said data is 19.0. The variance is simply the square of S , i.e. S 2 = 19.02 = 362.0.

66 / 81

About These Notes

Introduction

Descriptive Statistics

Measures of Dispersion
In practice, variance and SD are calculated by using computationally friendly formulas given as below, For a sample of size n with values xi ; i = 1, 2, ..., n, S2 = = 1 n
n

(xi x )2
i=1 n i=1

x2 i

n i=1

xi

Similarly for grouped data, distributed in k groups, with midpoints xi and frequencies fi (i = 1, 2, ..., k) we use S2 = 1 n
k

fi (xi x )2
i=1 k i=1

fi x2 i

k i=1

fi xi

The benet of these formulas is that, here one does not need to calculate the column of dierences i.e. xi x . By taking positive square root of S , we get SD.
67 / 81

About These Notes

Introduction

Descriptive Statistics

Measures of Central Tendency


Box-plots Stem-and-leaf and histograms convey rather general impressions about a data set, whereas a single summary such as the mean or standard deviation focuses on just one aspect of the data. A pictorial summary called box-and-whisker plot or simply box-plot has been used successfully to describe several most prominent features of a data set. These features include (1) center, (2) spread, (3) the extent and nature of any departure from symmetry, and (4) identication of outliers, observations that lie unusually far from the main body of the data. Because even a single outlier can drastically aect the values of x and s, a box-plot is based on measures that are resistant to the presence of a few outliers, i.e. the median and a measure of spread called the fourth spread (quartile range). A typical boxplot is given in the following gure.
Variable Name Whisker

Lower Quartile Q1 4

Median 2

Upper Quartile Q3 0 2

Data

The whisker usually ends at 1.5fs .


68 / 81

About These Notes

Introduction

Descriptive Statistics

Measures of Dispersion
Boxplots That Show Outliers A boxplot can be decorated further to indicate explicitly the presence of outliers. Denition Any observation farther than 1.5fs from the closest fourth is an outlier. An outlier is extreme if it is more than 3fs from the nearest fourth, and it is mild otherwise. Example 19. The relevant summary quantities for Example 1.17 (page 39 Devore) are x = 92.17 Q3 Q1 = fs = 122.15 Q1 = 45.64 1.5fs = 183.225 Q3 = 167.79 3fs = 366.45

Subtracting 1.5fs from the Q1 gives a negative number, and none of the observations are negative, so there are no outliers on the lower end of the data. However, Q3 + 1.5fs = 351.015 and Q3 + 3fs = 534.24

Thus the four largest observations 563.92, 690.11, 826.54, and 1529.35 are extreme outliers, and 352.09, 371.47, 444.68, and 460.86 are mild outliers. The box-plot for the above data can then be sketched as following.

69 / 81

About These Notes

Introduction

Descriptive Statistics

Measures of Dispersion

The whiskers in the boxplot in Figure 1.19 extend out to the smallest observation 9.69 on the low end and 312.45, the largest observation that is not an outlier, on the upper end. There is some positive skewness in the middle half of the data (the median line is somewhat closer to the right edge of the box than to the left edge) and a great deal of positive skewness overall. We will learn about positive/negative skewness in the next few slides. Most importantly, boxplots can be used to compare several data sets at once, e.g. see the following gure of the monthly boxplots of the daily temperatures in some country.
q q

q q

q q q q q q

10

January

February

March

April

May

June

July

August

September

October

November

December

70 / 81