About These Notes

Introduction

Descriptive Statistics

Introduction to Probability and Statistics For Engineers and Scientists
The maths, the computation, and examples.

Dr Asad Ali
Department of Space Science Institute of Space Technology Islamabad, Pakistan

1 / 81

About These Notes

Introduction

Descriptive Statistics

About Me

Name: Asad Ali PhD (2007-2011) in Astro-Statistics from Department of Statistics, University of Auckland, New Zealand During PhD: Worked as gravitational wave data analyst in NASA-ESA space mission; the Laser Interferometer Space Antenna (LISA) (a space-borne GW detector). Developed Bayesian Monte Carlo Algorithms for Gravitational Wave Spectrum Analysis Used supercomputers such as
BeSTGRID (AUS-NZ) ATLAS (Max-Planck Institute for Gravitational Physics (AEI), Honnover, Germany)

Currently associated with European project, the Einstein Telescope (a deep Earth GW detector) in the same role. Office 022, Hostel Block.

2 / 81

About These Notes

Introduction

Descriptive Statistics

Why I am here?

I want to help you!
In learning statistics to be able to effectively conduct research, to develop critical thinking and analytical skills, so that to act as an informed engineering scientist. Statistics is the fundamental tool of conducting research and analysis in all disciplines. “Without proper knowledge of statistics a scientist is like a blind person who is looking, in a dark room, for a black cat, which is not actually there.” “If you can not present and interpret your measurements. Your knowledge is of unsatisfactory kind.”

3 / 81

The purpose of these lecture notes is just to teach you the “how to” of statistics.. Spend your minds on understanding the statistical concepts and their applications. as well. rather.. Your knowledge is of unsatisfactory kind. Are self-explanatory... You are not supposed to worry about how your car is manufactured. available in IST library.. 4 / 81 . There are several good books on statistics.. You are an END user. No need to memorize.. So...About These Notes Introduction Descriptive Statistics About These Notes These are just reference notes.. There are lots of good websites on Internet.. If you can not present and interpret your measurements. you need to learn how to drive it.

About These Notes Introduction Descriptive Statistics Recommended Study Materials Textbooks 1. For example. 5 / 81 . Modern Mathematical Statistics with Applications Second Edition by Jay L. which you think can help you in learning statistics. for now). Freund Reference Material 1. Devore 2. Mathematical Statistics with Applications by John E. Walpole You can pick any book and visit any website. Probability and Statistics for Engineering and the Sciences Fifth Edition by Jay L. Berk 2. Probability & Statistics for Engineers and Scientists Fifth Edition by Ronald E. Devore and Kenneth N. Introduction to Statistical Theory by Sher Muhammad Chaudhry and Dr Shahid Kamal (Part I. I would recommend 1.

About These Notes Introduction Descriptive Statistics Introduction Chapter 1: Introduction 6 / 81 .

(Materials) calculating the average life of an electrical equipment (Electrical Engineering) and many other tasks pertaining to engineering and other disciplines of science and art– 7 / 81 . or studying the vibrations of airplane wings (Aerospace and Aeronautics) estimating the break-points and analyzing the stress-strain relationships of materials. analysis.About These Notes Introduction Descriptive Statistics Introduction Everything dealing with the collection. In engineering. processing. air pressure. temperature. and interpretation of numerical data belongs to the domain of statistics. water vapor (Meteorology) evaluating the effectiveness of commercial products (Quality Control) predicting the reliability of a rocket. this includes such diversified tasks as calculating the average length of the downtimes of a computer (System Engineering) collecting and analyzing data on various weather events.

). Inferential statistics are used to interpret the meaning of descriptive statistics. organization. In this course you will learn how to apply and interpret both types of statistics in science and in practice to make you a good interpreter of the statistical information and an excellent decision maker in the face of incomplete information. Branches of statistics: statistics can be divided into two major branches Descriptive statistics that involves the organization. and interpretation of data to assist in making more effective decisions in the face of uncertainty (incomplete information– why? note it. and display of data.About These Notes Introduction Descriptive Statistics What is Statistics? Statistics is the area of science that deals with the collection. 8 / 81 . analysis. or as summary statistics (single values). Descriptive statistics are typically presented graphically. in tabular form (in tables). Inferential statistics are procedures used that allow researchers to infer or generalize observations made with samples to the larger population from which they were selected.1 of Devore’s book and the examples there in. Further reading and exercises: Have a look of the introduction and Section 1. summarization.

9 / 81 . or responses. disagree) etc. The value of the variable “varies” from one entity to another. no. Variable are generally expressed by X . brown. Y . Qualitative variable: A variable is called a qualitative when a characteristic can be expressed only with different categories such as eye color (blue. time. zi with subscript “i” denoting the ith object/item for which the observation is made. or idea. Z and their values/realizations by xi . survey response (yes. More clearly. Variable A variable is a characteristic (or an attribute) that describes a person. education (BA. weight. Types of variables Quantitative variable: A variable is called a quantitative when a characteristic can be expressed numerically such as temperature.About These Notes Introduction Descriptive Statistics Basic Terms and Concepts Some fundamental terms and concepts that we need to know: Data Data consist of information coming from observation e. measurements. number of students in classes etc. thing.g. agree. MA. MS). yi . xi is simply the ith observation on X . counts. black). place.

blue. These are four.About These Notes Introduction Descriptive Statistics Basic Terms and Concepts Types of quantitative variables Discrete Variables: Discrete variables vary only by whole numbers or integers (e. strong Interval scale: A constant interval size.3. 10. such as the number of students in a class (it would not make sense to have half a student.). temperature Ratio scale: An interval scale with a meaningful zero point. Continuous Variables: A quantitative variable is continuous if its possible values come from a given interval (e. Nominal scale: Categorical with no ordering or ranking.0.2). medium. weight 10 / 81 . but with no meaningful zero point. 1. green Ordinal scale: Categorical with ordering or ranking. e.g.g. red. age. e. length. which are listed below.g..g.2.g. low. 2. weight. length. A continuous variable represents measurement data such as the temperature. e... 87. would it?) and the number of defected mobiles in a lot. A discrete variable represents count data.g. e. 1. Measurement scales: The types of measurements of observations are usually called measurements scales..

P. no matter how small.About These Notes Introduction Descriptive Statistics Basic Terms and Concepts Errors Errors are everywhere. In fact. 11 / 81 .95–6. The good thing about these errors is that they cancel out in repeated measurements (long run). We are trying to predict the behavior of nature. the greater would be the magnitude of error. They are cumulative in nature.“Essentially all models are wrong but some are useful.005 feet depending on the sensitivity of the measuring equipment.995–6. These errors arise due to the personal limitation of the observer. Another type of error is bias. the greater the number of measurements. it is somewhere between 5. we say there is bias in measurements. that is. that can represent the “exact ” state of nature. even if it’s measured with the most sensitive and sophisticated instrument. but all our instruments are erroneous. For example.05 feet or you can say 5. We can not develop a model or a formula. When the observed value is consistently and constantly higher or lower than the true value. the imperfection of the measuring instrument or some other conditions that control the measurements. Box). Thus there are errors. but they do exist always and everywhere. I can not say that my height is exactly 6 feet. This kind of errors are called measurement errors.” (George E.

We will learn more about it at a later point. Whereas the numbers used to describe a sample data set are called statistics often denoted ¯ . because in statistical data analysis we often make inferences about a population based on sample statistics. This error is usually known as sampling error.1 in Devore and try to solve the exercises at the end of section. σ ˆ ) or English letters (X ¯ or µ estimate a population parameter such as the average of a data set. e. X ˆ provides an estimate of the unknown population mean. Its size is denoted by ‘n’. we rarely have a complete set of data. µ.  Another type of error: The difference between a statistic and a parameter is important to understand Further reading: Have a look of section 1. we have what is referred to as a census. 12 / 81 . S ). Parameters and statistics The numbers used to describe a population are parameters and often are denoted using Greek letters (µ.g. In practice. A statistic may be used to by Greek letters with “ ˆ ” over them (ˆ µ. We usually collect data in samples. measurements. any conclusions or recommendations that are made based on sample statistics are subject to error. The total number of observations in a population is called its size. However. or counts) that have some characteristic of interest. responses. generally denoted by ‘N ’. Samples: A sample is a subset of a population.About These Notes Introduction Descriptive Statistics Basic Terms and Concepts Population and samples Population: A population is the collection of all observations (outcomes. we typically will accept some margin of error rather than incur the cost of measuring every observation. If the desired information is available for all items in the population. σ ). Since we rarely know every observation in a population.

About These Notes Introduction Descriptive Statistics Descriptive Statistics Chapter 2: Descriptive Statistics Presentation 13 / 81 .

strength. These procedures. summarize. descriptive statistics mainly covers the following tasks of data analysis. deviations. descriptive statistics covers. the analysis of the relationships (covariance. such as pressure. correlation and regression etc) between different variables as well. along with the above. 14 / 81 . and procedures were developed to organize. Often. are specifically used to condense and summarize numerical observations to get the initial (meaningful) information and make the data ready for further manipulations. variations) In the multivariate case.About These Notes Introduction Descriptive Statistics Descriptive Statistics Researchers can measure many physical processes. In univariate case. referred to as descriptive statistics. survival time. Presentation of data using Tabulation methods (frequency distributions) Graphical methods (diagrams and graphs) Measures of central tendency (averages and quantiles) Measures of dispersion (ranges. hundreds or thousands of measurements are made. and amount. and make sense of these measurements.

the entire data is divided into different non-overlapping groups or classes with the number of observations falling in each group or class. Frequency distribution with no grouping: For discrete data with small range (or small number of actually distinct values) the frequency table is constructed by arranging the collected data values in ascending order of magnitude with their corresponding frequencies. A frequency distribution is a table that lists the observations along with their respective frequencies. which tells us about the pattern and shape of the distribution of values of the underlying variable or population. 15 / 81 . A frequency distribution condenses bulky data to a small table.About These Notes Introduction Descriptive Statistics Presentation Tabulation methods Frequency distribution: The frequency (f ) of a particular observation is the number of times that observation occurs in the data. Frequency distribution with grouping: In case of very broad range of values or if the data is continuous.

About These Notes Introduction Descriptive Statistics Presentation A very simple example (without grouping) Example 1. Present this information in a frequency table. we proceed as following: Draw a three columns table with column’s heading “Marks”. and “Frequency”. Marks 4 5 6 7 8 9 10 Tally Frequency 16 / 81 . The marks awarded for an assignment set for a BE (MS&E) class of 20 students were as follows: 6 7 5 7 7 8 7 6 9 7 4 10 6 8 8 9 5 6 4 8. “Tally”. Solution : To construct a frequency table. Put all the possible distant values without repetition in the first column in ascending (or descending) order as shown below.

second is 7 put a tally bar for it too. When the number of tally bars equals 5. The first data value is 6. bundle them in a group of 4 with a slash across it. We can now answer the following questions? Where is the data concentration (peak) point? How is it declining? Is this a normal marks’ distribution? Or there is some thing wrong with class performance? Do we need further investigations? 17 / 81 . Marks 4 5 6 7 8 9 10 Tally Frequency Marks 4 5 6 7 8 9 10 Tally Frequency 2 2 4 5 4 2 1 =⇒ So we now have the data in a meaningful form. Count the bars for each data value and that’s the frequency. Go ahead and put tallies for all the values. put a tally bar against it.About These Notes Introduction Descriptive Statistics Presentation data: 6 7 5 7 7 8 7 6 9 7 4 10 6 8 8 9 5 6 4 8.

About These Notes

Introduction

Descriptive Statistics

Presentation
The “how to” of a frequency distribution with grouping. When there are too many values in the data and are more spread out, it is difficult to set up a frequency table for every data value as there will be too many rows in the table. Before proceeding ahead, we need to learn about a few terms and rules that we will need for the construction of a frequency distribution with grouping or classes.
Class-limits: The numbers that describe a class or group. The two limits are called lower class limit and the upper class limit. The class-limits (CL) should be inclusive and should not cause any overlapping between any adjacent classes, e.g. age in years can be classified as 10-14, 15-19, 20-24 or 10.0-14.9, 15.0-19.9, 20.0-24.9 etc. Class-boundaries: The class-boundaries (CB) are precise numbers that separate one class from its first neighbours. CBs are just the midpoint of the upper limit of one class and the lower limit of the next class, e.g. consider the first two classes 10-14, 15-19, the class boundaries are calculated by 14+15 = 14.5. Thus, for 10-14, 15-19, 20-24, the CBs are 9.5-14.5, 14.5-19.5, 19.5-24.5, thus CBs are 2 by one decimal place more precise than class-limits. The upper class-boundary of one class coincides with the lower class-boundary of the next class, thus leaving no gap. Class marks: Class marks are simply the midpoints of classes. For example, the class mark of class 10-14 is 10+14 = 12. 2 Class interval or class width: Class interval, traditionally denoted by “h” is the difference between the two class-boundaries of the same class or the difference between the lower (or upper) limits of the two consecutive classes. In the above case the class interval is 5. Ideally, all the classes should have equal intervals, unequal intervals can also happens, but should be avoided, until required, because of difficulty in interpretations. Class frequency: The frequency of a particular class is the number of times the data value occurs within the limits of that class.
18 / 81

About These Notes

Introduction

Descriptive Statistics

Presentation
A typical frequency distribution with grouping looks like the following table. Classes 10-14 15-19 20-24 ··· ··· Class-boundaries 9.5-14.5 14.5-19.5 19.5-24.5 ··· ··· Tally bars ··· ··· ··· ··· ··· Class-Marks 10+14 =12 2 17 22 ··· ··· Frequency ··· ··· ··· ··· ···

The columns of class-boundaries and class-marks help in the calculations of different statistical quantities such as mean, median and quantiles as we will see in next chapter.

19 / 81

About These Notes

Introduction

Descriptive Statistics

Presentation
A few rules
How many classes? There is no hard rule to decide as to how many classes should we make. Both very few or too many classes will defeat the purpose of constructing the frequency distribution. Too few classes will result in the loss of lot of information and too many classes will kill the purpose of condensation. As a rule of thumb, a number between 5 and 15 would give reasonable results.
(I think, 15 is still too large; I would not take a number larger than 10, unless I am using a computer.)

Find the range, that is the difference between the maximum and the minimum values in the data. Calculate the class width/interval “h” by dividing the range of data by the number of classes. If the division results in a decimal number, take the next higher whole number. Avoid using fractional numbers as intervals, it brings you headache. Taking a multiple of 5 or 10 would ease up the problem and also would increase the readability of the table. The resulting classes should cover the whole of data. Note: you can also choose a proper interval first and then calaculate the number of classes, provided the whole data is covered in a reasonable number of classes. Where to start the first class from? Usually the lower class-limit is put at or below the smallest data. Remember, the lower class-limit of the first class should never be larger than the smallest value of the data otherwise that values at the lower end of data will be lost. Starting from a multiple of 5 or 10 would not hurt. Find the upper class-limit by counting from the lower class-limit to the end of the interval. Note that adding the interval directly to lower class-limit is erroneous, as we know the classes are inclusive. Adding an interval to the lower class-limit of a class gives you the lower class-limit of the next class, rather than the upper limit of the same class. (most students forget it...be careful)

20 / 81

.. scanning the data (mouse hunt). and putting the values in appropriate classes. were recorded as follows: 423 392 399 369 408 415 387 431 428 411 401 422 393 363 396 394 391 372 371 405 410 377 382 419 389 400 386 409 381 390 Construct a frequency distribution for these values. to the nearest day. h= 21 / 81 . Placing tally marks and frequencies.0 Number of classes 8 We take h = 10.. An example of frequency distribution with grouping Example 2.. The results.0 because it eases up the data scanning process.Smallest = 431 − 363 = 68 Lets there be 8 classes. Determine the sum of frequencies to check whether all the values were included.About These Notes Introduction Descriptive Statistics Presentation Find the rest of the classes by just adding the interval to the lower and the upper class-limits to get the lower and upper class-limits of the next class. Thirty energy saver light bulbs were tested to determine how long they usually last. Now the hard part.5 ≈ 10. Solution: First we need to find the range Range = Largest . therefore class interval is 68 Range = = 8.

to avoid recounting. indicating that it has been counted. When a data value is allocated to some class. allocate the values to their corresponding classes and put tallies for them accordingly. Now start scanning the data. second as 370-379 and so on.About These Notes Introduction Descriptive Statistics Presentation Now lets make the table and set the classes.  ¨ ¨ ¨ 423 369 ¨ 392 399 408 415 387 431 428 411 401 422 393 363 396 TBs 394 391 372 371 405 410 377 382 419 389 400 386 409 381 390 Classes 360-369 370-379 380-389 390-399 400-409 410-419 420-429 430-439 Total Frequency (f ) 22 / 81 . The smallest value is 363. we start from 360 and set the first class as 360-369. cancel that value in the actual data set.

By looking at this frequency distribution. Thus. 23 / 81 . canceling and counting and put the tallies accordingly.About These Notes Introduction Descriptive Statistics Presentation Go on scanning. ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ 423 369 387 411 393 394 371 377 389 409 ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ 392 408 431 401 363 391 405 382 400 381 ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ 399 415 428 422 396 372 410 419 386 390 ¨ Classes 360-369 370-379 380-389 390-399 400-409 410-419 420-429 430-439 Total TBs Frequency (f ) 2 3 5 7 5 4 3 1 f = n = 30 Sum up the frequencies to check whether all the data values are picked up. We can also see how the frequencies decrease toward the tails of the distribution and the distribution looks fairly symmetric. Fill up the rest of the columns. we can quickly find that generally most of the bulbs have life between 390 and 399 days as this group has the largest frequency (7). this group can be regarded as a representative group of this data.

This gives us a bit clearer picture than RF. This is called the relative frequency (RF) of a particular observation or class and is found by dividing its corresponding frequency (f ) by the total number of observations n: that is: RF = f n A more clear measure is the percentage frequency. which is found by multiplying each relative frequency value by 100.About These Notes Introduction Descriptive Statistics Presentation Relative frequency and percentage frequency While studying these data we may want to know not only how long the bulbs last. but also what proportion of the bulbs falls into each class of bulb’s life. 24 / 81 . Thus: PRF = RF × 100 The PRF tells us about what percent of observations fall in a particular class.

17 0.0 PRF 2 × 100 = 7 30 3 × 100 = 10 30 17 23 17 13 10 3 100 Looking at this table we can now say that: The chance of any randomly selected bulb having a life in this range is approximately 0. 25 / 81 .23 0. 23% of bulbs have a life of from 390 days up to but less than 400 days.13 0.About These Notes Introduction Descriptive Statistics Presentation Example 3.03 1. Classes 360-369 370-379 380-389 390-399 400-409 410-419 420-429 430-439 Total f 2 3 5 7 5 4 3 1 f = n = 30 2 30 3 30 f RF = n = 0.23. Lets calculate the RF and PRF for Example 2.07 = 0.10 0.10 0.17 0.

In grouped data. that is accumulating (the running total) the elements of frequency column. or from the bottom class (or value). which is known as the “more than” type CF. The CFs are obtained by adding the frequencies of different classes in successive manner to the cumulative total of previous frequencies. in which case the CF is called the “less than” type CF. It also tells us about the number of observations that lie between a given interval of two values. for the “less than” type CF the upper class boundaries are used and for “more than” type the lower class boundaries are used. The cumulative frequency distribution gives us an idea of how many observations of the data falls below or above a given value. The accumulation can be conducted either from the top class (or value).About These Notes Introduction Descriptive Statistics Presentation Cumulative frequency distribution A cumulative frequency distribution table is the same as a frequency distribution table with additional columns that give the cumulative frequency (CF) and the cumulative percentage (CP) of the data. 26 / 81 .

Note: We use the upper class boundaries for a “less than” (<) type CF distribution.5 <429. Upper Class Boundaries <369. From the table we quickly learn that . 27 / 81 .About These Notes Introduction Descriptive Statistics Presentation Example 4.5 <419.5 <439.5 <379.5 Total f 2 3 5 7 5 4 3 1 n = 30 CF 2 2+3=5 5+5=10 10+7=17 17+5=22 22+4=26 26+3=29 29+1=30 CP = CF × 100 n 2 × 100 = 7 30 5 × 100 = 17 30 33 57 73 87 97 100 Suppose we have been asked to find as to how many or what percent of observations lie below 399.5 <399. We calculate a “less than” type CF and CP for the data in Example 2.5.5 <409.5 <389. which makes them 57% of the entire data.there are 17 observations below the given value.

From the table we quickly learn that .5.5 >429.5 Total f 2 3 5 7 5 4 3 1 n = 30 CF 28+2=30 25+3=28 20+5=25 13+7=20 8+5=13 4+4=8 1+3=4 1 CP = CF × 100 n 30 × 100 = 100 30 28 × 100 = 93 30 83 67 43 27 13 1 Suppose now we are asked to tell as to how many or what percent of observations lie above 399.5 >399. Note: We use the lower class boundaries for a “more than” (>) type CF distribution.5 >409. 28 / 81 .5 >389.5 >369. Now lets calculate a “more than” type CF and CP for the data in Example 2. Upper Class Boundaries >359.About These Notes Introduction Descriptive Statistics Presentation Example 5.5 >419. which makes them 43% of the entire data.there are 13 observations above the given value.5 >379.

It consists of vertical bars. Most of the time we want visual presentation of data for clearly seeing patterns in data. We often need answer to questions like Where are the data (center) located? How spread out are the data? Are the data symmetric or skewed? Are there outliers in the data? Histogram Histogram is a visual version of frequency table. variation and shape) does the data have? Does the distribution of data look symmetric or is it skewed towards the left or right? 29 / 81 . spread. usually called ‘bins’ or ’frequency bins’. such as: symmetric. and unusual features. skewed. Some common distributions have special descriptive labels. that represent different classes of a frequency table. The height of bars indicates the frequency of classes.About These Notes Introduction Descriptive Statistics Presentation Graphical Methods We now introduce the widely used graphic displays for data presentation in Engineering sciences. however. shape. Patterns in data are commonly described in terms of: center. bell-shaped. A histogram can typically help you answer the following questions: What is the most frequent observation? What distribution (center. the graphic presentation format usually makes it easier to see the nature of distribution. etc. The main purpose of a histogram is to enhance the presentation of data. there is no space between adjacent bars. You can present the same information in a table. Usually.

We already have constructed the frequency table in Example 3. Histogram of Data 7 Relative Frequency Histogram of Data 0.10 3 0.05 0.About These Notes Introduction Descriptive Statistics Presentation Example 6.15 380 400 Data values 420 440 One can also construct a percentage relative frequency histogram by multiplying the relative frequencies by 100.00 360 0 0. Lets construct a histogram and relative frequency histogram for the energy saver bulbs data given in Example 2. 30 / 81 . Lets now depict it.20 Frequency 5 6 Frequency 4 2 1 360 380 400 Data values 420 440 0.

Median is the point in a graphic display where about half of the observations are on either side. Center: Graphically.About These Notes Introduction Descriptive Statistics Presentation Some of the key features that we usually look for in a histogram. If the observations are more clustered around a single value. Here. In the chart to the right. the height of each column indicates the frequency of observations. Spread: The spread of a distribution refers to the variability of the data. the spread is larger. the center of a distribution is located at the median of the distribution. 31 / 81 . the observations are centered over 4. If the observations cover a wide range. the spread is smaller.

Uniform. Sometimes. Outliers. When displayed graphically. These extreme values are called outliers. distributions are characterized by extreme values that differ greatly from the other observations. a unimodal symmetric distribution can be divided at the center so that each half is a mirror image of the other. and distributions with two clear peaks are called bimodal. Number of peaks. Distributions with one clear peak are called unimodal. A single peaked symmetric distribution is referred to as bell-shaped distribution. Distributions can have few or many peaks. some unimodal distributions have many more observations on one side of the graph than the other side. and distributions with most of their observations on the right (toward higher values) are said to be skewed left. When it is graphed. Skewness. When the observations in a set of data are equally spread across the range of the distribution. The second last figure on the next slide has a gap. Distributions with most of their observations on the left (toward lower values) are said to be skewed right. there are no observations in that part of the distribution. Gaps. A uniform distribution has no clear peak(s). Gaps refer to areas of a distribution where there are no observations. Symmetry. 32 / 81 . the distribution is called a uniform distribution.About These Notes Introduction Descriptive Statistics Presentation Shape: The shape of a distribution is described by the following characteristics.

About These Notes Introduction Descriptive Statistics Presentation Different shapes of histogram. f (x i ) f (x i ) A normal distribution xi xi A skewed distribution xi xi f (x i ) f (x i ) f (x i ) f (x i ) A uniform distribution xi xi f (x i ) f (x i ) A bi−modal distribution xi xi A distribution with outliers xi xi f (x i ) f (x i ) f (x i ) f (x i ) Clip−like distribution xi xi 33 / 81 .

Histogram of Data 30 7 7 Cumulative Histogram of Data 30 29 26 6 25 5 5 5 22 Cumulative Frequency Frequency 4 4 20 17 15 3 3 3 2 2 10 10 1 1 5 5 2 0 360 380 400 420 440 0 360 380 400 420 440 Data values Data values Cumulative histogram is the actual concept that most of the probability distributions uses to calculate probabilities associated with different events. 34 / 81 . An ordinary and a cumulative histogram of the same data are given in the following figures.About These Notes Introduction Descriptive Statistics Presentation Cumulative Histogram Like histogram–frequency table pairing the cumulative histogram is a visual version of the cumulative frequency table. is must. It tells what percentage of the total number of observations accumulates at each bin (or interval). So learning about it. and understanding it. It makes finding the percentage or proportion of observations falling within a given interval rather more easy.

A representative IQ value is around 110.About These Notes Introduction Descriptive Statistics Presentation Dotplots A dotplot is an attractive summary of numerical data when the data set is reasonably small or there are relatively few distinct data values. and the data is fairly symmetric about the center. spread. The study included 33 students whose first-grade IQ scores are given here: The following figure shows a dotplot for the above data. extremes. As with a stem-and-leaf display. there is a dot for each occurrence. Each observation is represented by a dot above the corresponding location on a horizontal measurement scale. When a value occurs more than once. Example 7. a dotplot gives information about location. 35 / 81 . and gaps. and these dots are stacked vertically. especially discrete values.

and leaf=7.g.About These Notes Introduction Descriptive Statistics Presentation Stem and Leaf Displays A stem-and-leaf plot (aka stemplot) of a quantitative variable is a textual graph that classifies data items according to their most significant numeric digits.) 36 / 81 . the plot of 72 72 75 76 would look like 7 | 2 2 5 6. It is generally used for small data sets (50 or fewer observations). Each stem is listed only once and no numbers are skipped. The stemplot is drawn with two columns separated by a vertical line with stems listed to the left of the vertical line. When there is a repeated number in the data (such as two 72s) then the plot must reflect such (e. even if it has no leaves. The leaves are listed in increasing order in a row to the right of each stem. it shows the actual values within the interval. since it shows how many values in a set fall under a certain interval. For example the observation 327 can be split as stem=3. It has even more information. and leaf=27 or stem=32. A stem and leaf display is similar to a histogram. A stem is the leading digit of an observation whereas the remaining digits are leaves.

11. 29 in exercise 1. Further reading and exercises: Have a look of the introduction and Section 1.b. the decimal part in each number is taken as leaf. 20. 16. 14. Rounding may be used to suppress certain number of decimal points so that all data values have the same number of decimal points.2.a. 25. 17. 16. 12. Then solve questions 10. The stem-and-leaf plot of energy saver bulb data is constructed as below. In the case of values with decimal points (continuous data). 3 and 4. Stem 36 37 38 39 40 41 42 43 (Key: 40|8 = 408) Leaves 39 127 12679 0123469 01589 0159 238 1 In this example we could also use a stem of single digit but then there would have been only two stems. 15. 24. resulting in a very less informative plot. 37 / 81 . 13.About These Notes Introduction Descriptive Statistics Presentation Example 8.2 of Devore’s book and the examples there in.

About These Notes Introduction Descriptive Statistics Descriptive Statistics Chapter 3: Descriptive Statistics Measures of Central Tendency 38 / 81 .

or measures of location or position. the mean is typically what is meant by the word average. Mean: aka Arithmetic Mean. For example. sometimes we use one value to represent the whole data. and the population mean as the Greek letter mu (µ). These measures are also called averages. The mean of a data is given by M ean = the sum of all data values the number of values 4+8+9 21 = = 7. It’s the value at which the data have a tendency to concentrate. 8. In this way. Despite its popularity. it is easier to compare data of the same type. or in data with outliers. the mean of 4. It’s perhaps the most common measure of central tendency. somewhere in the center of the data. the mean may not be an appropriate measure of central tendency in skewed distributions. 3 3 The sample mean is written as x ¯. usually. the point at which the distribution is in balance. median and mode. and 9 is 39 / 81 . A data can be summarized in a single value. The most commonly used measures of central tendency are mean.About These Notes Introduction Descriptive Statistics Measures of Central Tendency Instead of describing or presenting a whole data.

...About These Notes Introduction Descriptive Statistics Measures of Central Tendency Definition The sample mean x ¯ of a set of n observations x1 . xn is defined as.. n... x2 . fk .. with k classes with midpoints x1 . . the mean is given by. . f1 x1 + f2 x2 +. You can ignore the limits (i = 1.. n) of the summation symbol and can simply write it as . For grouped data... k.. .. and frequencies f1 ... x ¯= x1 + x2 +.. arranged in frequency table. +xn = n n i=1 xi n i = 1... .. 40 / 81 . f2 .. +fk k i=1 fi xi k i=1 fi x ¯ = = i = 1... .. x2 . . . +fk xk f1 + f2 +. xk . Note that k i=1 fi = n. ...

For example. 67 and 90.About These Notes Introduction Descriptive Statistics Measures of Central Tendency Example 9. which is quite different than the previous one. Find the mean of their marks. Its value can be greatly affected by the presences of a single outlier (extreme value observations). 52. 58. The resulting mean will then be 110 . 41 / 81 . Solution : The mean is given by x ¯ = = = = xi n 36 + 52 + 57 + 58 + 67 + 90 6 360 6 60 Marks The arithmetic mean is quite sensitive to any change in a single value. 57. It gives good results when the observations are reasonably similar. was mistakenly recorded as 352. in the above example. leading to a different decision about the data. say 52. In a mathematics test. if one data value. that makes it an inappropriate measure under certain circumstances. the marks of six students are 36.

5 374.5 414.5 2761.5 2022.5 384. Classes 360-369 370-379 380-389 390-399 400-409 410-419 420-429 430-439 Thus the mean is x ¯= Midpoints (x) 364.0 1273.5 1922.5 434.5 434.5 1658.5 394.5 30 42 / 81 .5 424.5 11925 = 11925 = 397.0 1123. Solution : Lets put the necessary columns in frequency table obtained in Example 2.5 – 9 i=1 fi xi 9 i=1 fi f 2 3 5 7 5 4 3 1 30 fx 729.5 404. Find the mean for data in Example 2.About These Notes Introduction Descriptive Statistics Measures of Central Tendency Example 10.

. It’s a value above and below which 50% of the ordered data lie.About These Notes Introduction Descriptive Statistics Measures of Central Tendency Median: The median is the middle value of a set of data when they are arranged in ascending or descending order. x2 .e. No need to worry about n 2 being odd or even. xn is obtained as. Definition The sample median x ˜ (∼ is pronounced as “tilde” (till-day)) of a set of n observations x1 . the median is the mean of the ( n )th and and the ( n + 1)th (i.. .. 2 ƒ If n is an even number. Median is insensitive to outliers. Arrange the n observations in ascending (or descending) order. the median is the ( n+1 )th observation. 43 / 81 . the two 2 2 middle) observations. For grouped data the median is calculated by the following formula x ˜=l+ h f n −C 2 where l is lower class boundary of the median class and C is the cumulative frequency of the preceding class. Where median class is a class corresponding to n th observation. ‚ If n is an odd number.

7 8.4 x ˜= 9. Solution : Rearranging the values in ascending order.9 15.6 8.5 11.6 11.3 (a) Find the median for the full data. 15.05 2 44 / 81 .4 (a) Here n = 12 is even and hence median is x ˜ = Mean of n 2 th and n +1 2 th observations.4 9.6 8. We have 7.4 9.9 16.4 9.3 9.4 9.2 16.3 9.2 10.3 9. Consider the following 12 data points.3 9.7 10.4 11.9 15.5 11.5 11.4 = 10.2 20.2 20.7 + 10.4 9. (b) Omit the largest (or the smallest) observation and find the median again.4 9.3 9.2 20.4 11.4 9.About These Notes Introduction Descriptive Statistics Measures of Central Tendency Example 11.2 16.4 7.7 10. The two middle values are indicated by under-brace in the ordered data 7.

4 and calculate the median again.3 9.9 x ¯ = 168 = 14 .7 x ˜ 10.3 = 11.4 9.About These Notes Introduction Descriptive Statistics Measures of Central Tendency (b) Lets omit the last observation 20. 12 45 / 81 . .6 8. that is the observation. the mean is x ¯ = 139 12 .2 Thus the median is x ˜ = 9 . Now since n = 11 is odd. median th n+1 is just the middle observation.0.9 15.4 11.4 by 50. we see that now mean is For n = 12. 07 but the median remains unchanged.3 9.61. Thus median is insensitive to outliers. Now lets replace 20.4 9.2 16.7 . which is 6th here.5 11. 2 7.

h = 10.5. Classes 360-369 370-379 380-389 390-399 400-409 410-419 420-429 430-439 CBs 359.5 419. Solution : Lets put the necessary columns in frequency table obtained in Example 2.5 f 2 3 5 7 5 4 3 1 cf 2 5 10 17 22 26 29 30 ←C ← n = 2 30 2 = 15 Now we have l = 389. Thus by putting the values in the formula we have. so 389.5 369.5 369. First we find the median group with n = 30 = 15.5-439.5-399.5 399.5-419.5-409.64 2 7 46 / 81 .5 429.5 − 399.5-429.About These Notes Introduction Descriptive Statistics Measures of Central Tendency Example 12.5-389. C = 10. f = 7.5-379.5 409.5 389. We see that 15 is falls below the 17 of cf column.5-369.5 is our 2 2 target or median group where the median lies. Find the median for grouped data in Example 2.5 + (15 − 10) = 396. x ˜=l+ h f n 10 − C = 389.

A data can have more than one mode. 31} are 9 and 25. Thus. The mode of { 1. 5. 5. 21. f2 the frequency of the succeeding class of the modal class. the data is bimodal. Definition For grouped data. 1 . and h is the class interval. 47 / 81 fm − f1 ×h 2fm − f1 − f2 . 3. if it has two. When you have categorical data. 3. 2. Example 13. The modes of {1. 8} is 1. Mode is also insensitive to outliers. or data that appears as words instead of numbers. you need to use the mode. For example. 7. 9 . it is said to be bimodal. f1 the frequency of the preceding class of the modal class. 25. 25. Mode = l + where l is the lower class boundary of the modal class (the class with largest frequency) fm the frequency of the modal class (the largest frequency). the mode is calculated by the following formula. the mode would represent the most popular sandwich. if a sandwich shop sells 10 different types of sandwiches.About These Notes Introduction Descriptive Statistics Measures of Central Tendency Mode: Mode is simply the most frequent observation in a data. 9.

So the mode is Mode = l + fm − f1 7−5 × h = 389.5-389.5-409.5-429.5 369. fm = 7 and f2 = 5.5 389.5 f 2 3 5 7 5 4 3 1 ← f1 ← fm ← f2 From the above table we have l = 389.5 409.5-419.5-369. h = 10.5-379.5.5.About These Notes Introduction Descriptive Statistics Measures of Central Tendency Example 14.5 429. so it’s our modal group. f1 = 5.5 399.5 419.5-399. Solution : We know that the class with highest frequency is 389.5 2f m − f 1 − f 2 2×7−5−5 48 / 81 .5-439. Classes 360-369 370-379 380-389 390-399 400-409 410-419 420-429 430-439 CBs 359.5 369.5-399.5 + × 10 = 394. Find the mode for grouped data in Example 2.

Q2 is equal to the median value in the set. Q3 is the “middle” value in the second half of the rank-ordered data set. Q2 . and Q3 . second. Since Q2 = Median therefore we usually calculate only the first and the third quartiles for a given data. and third quartiles. 3 for first. 49 / 81 . second and third quartiles respectively. Quartiles are also known as fourths (Devore’s terminology). That is below which 75% of data lie. as they represent equidistant points around the median. Quartiles: Quartiles divide an ordered data set into four equal parts. Q1 is the “middle” value in the first half of the rank-ordered data set. Quartiles are calculated in the same manner as median except the multiplication by extra factors of 1. The median divides an ordered data into two equals parts while quantiles divide it into more than two parts that helps in indicating the extent to which the data lies near the median. 2. Q1 is called lower quartile or lower fourth and Q3 is known as upper quartile or upper fourth. or near the extremes. Some of the commonly used quantiles are defined here. respectively. The values that divide each part are called the first. and they are denoted by Q1 .About These Notes Introduction Descriptive Statistics Measures of Central Tendency Quantiles: Quantiles are the kins of the median. That is below which 25% of data lie.

Then Qj = j × where j = 1. 3. for j = 1. Q3 = 3 × n+1 4 th observation Q1 = 1 × n+1 4 th observation n+1 4 th observations If the result contains a fraction (because n is even). First arrange the n observations in ascending (or descending) order. then the value is the mean between the values at the index above and below. the j th (j = 1. and for j = 3.About These Notes Introduction Descriptive Statistics Measures of Central Tendency Definition For a sample of size n. 3) quartile is defined as. 50 / 81 .

14 We have.5. (n + 1) (11 + 1) 12 th = th = th = 3rd 4 4 4 Q1 = 45 Q1 = and 3(n + 1) 3(11 + 1) 36 th = th = th = 9th 4 4 4 Q3 = 87 Q3 = observation observation 35 45 55 55 56 56 65 87 89 92 Now. 65 55 89 56 35 14 56 55 87 45 92 Solution : First arrange the values in ascending order.About These Notes Introduction Descriptive Statistics Measures of Central Tendency Example 15. if n was 10. then the index of the 1st quartile is 2. Find the two quartiles for the following ungrouped data. 51 / 81 . The quartile is the average of the 2nd and 3rd value in the list.

except that now the divisor is 10 instead of 4 and j runs from 1 to 10. These are defined in the same way as quartiles.. we divide the above data into two halves and include 56 in both sets. These are denoted by P1 . 52 / 81 . Deciles: Deciles divide a rank-ordered data set into ten equal parts. Different methods may give different results. Percentiles: Percentiles divide a rank-ordered data set into hundred equal parts. one method.. The ordered data is a given as below. D10 .About These Notes Introduction Descriptive Statistics Measures of Central Tendency There are several other ways of finding quartiles but this one is simple.. These are denoted by D1 . These are also defined in the same way as quartiles. therefore there is a single middle value: 56.. Now lets find the quartiles for the data in the previous example by this method. Lower Half Q3 = 65+87 =76 2 14 35 45 55 55 56 56 65 87 89 92 Q1 = 45+55 =50 2 Upper Half Since n is odd. Both halves have an even number of data points. we get Q1 = 50 and Q3 = 76. P100 . Using the method of finding the median for even n. P2 . The median of the first half is Q1 and that of the second half is Q3 . except that now the divisor is 100 instead of 4 and j runs from 1 to 100.. If n is odd than the left over middle value of the entire data is included in both halves. divides the ordered data into two halves.. For example. which can be found in many texts.. D2 ..

2. Quartiles h jn Qj = l + −C where j = 1.. the quartiles and deciles are also percentiles. 100 h f jn −C 10 where j = 1. with a slight modification. . . We can relate them as following. Thus. For grouped data. we calculate the quartiles.. 3 f 4 Deciles Dj = l + and Percentiles Pj = l + h f jn −C 100 where j = 1.. equal to Q3 . as was used in the calculation of median. Percentiles are useful for giving the relative standing of an individual observation in a population. 2. the median value in the set. as that for median. equal to D6 and so on. 10 Solve an exercise using these formulas.. deciles and percentiles using the same formula.About These Notes Introduction Descriptive Statistics Measures of Central Tendency In reality.. 2... 53 / 81 . P25 P50 P75 P60 is is is is equal to Q1 . they are essentially the rank position of an individual observation. For grouped data (frequencies) one has to use the cumulative frequency.

3 in Chapter 1. the soft copy: Have a look of Section 1. Note: Ignore the trimmed mean and trimmed median etc for now.3 Questions: 30-37. However. I can explain them to you. Exercises Section 1. 54 / 81 .About These Notes Introduction Descriptive Statistics Measures of Central Tendency In J L Devore Book. the concept is way too easy and if you would like it.

About These Notes Introduction Descriptive Statistics Descriptive Statistics Chapter 4: Descriptive Statistics Measures of Dispersion 55 / 81 .

measured in the same units. km etc).e. Mid-range Inter-quartile Range (also called the fourth-spread). Semi-inter-quartile Range Mean Deviation Variance and Standard Deviation Range is quite a simple measure. as a first step. Here we need some extra insight into the data. Does it mean that the two data sets are the same or they have the same features? No. just the difference of the two extreme values in the data and mid-range is just the average of two extreme values. e. as you know. it happens that the two data sets have the same means. we need to measure their respective dispersions or variabilities about the center and then compare them. you are comparing two different data sets (for now. mid-range = max-value + min-value 2 So we will start from inter-quartile and semi-inter-quartile range. By chance.About These Notes Introduction Descriptive Statistics Measures of Dispersion Imagine.g. kg . 56 / 81 . medians or modes. i. Some of the most commonly used measures of dispersions are Range.

is a measure of spread from the lower quartile to the upper quartile. Semi-Inter-quartile range: (SIQR) is just the half of IQR. 57 / 81 .About These Notes Introduction Descriptive Statistics Measures of Dispersion Inter-quartile range (aka fourth-spread): The interquartile range. IQR = Q3 − Q1 From now on we will denote IQR by fs for simplicity. denoted IQR. SIQR = Q3 − Q1 2 The pure measure (free of units of measurements) is the co-efficient of quartile deviation (CQD) defined as Q3 − Q1 CQD = Q3 + Q1 This measure is free of measurements units and can be used to compare two or more data with different units of measurement.

n 58 / 81 k i=1 fi |xi − x ¯| n and MedD = k i=1 f i |x i − x ˜| n .. for median the MedD is defined as. MedD = n i=1 n i=1 |x i − x ¯| .. arranged in frequency table. where x ¯= n For grouped data.. Using the mean as the data center. n |xi − x ˜| n x . MD = Similarly.. xk . with k classes having midpoints x1 . the MD and MedD are given by. .. fk .About These Notes Introduction Descriptive Statistics Measures of Dispersion Mean Deviation: Mean (or median) deviation (MD) or mean absolute deviation (MAD) is also a measure of dispersion defined as the average of the absolute differences/deviations between the data values and the data center (usually.. x2 . MD = where x ¯= fx . . f2 . and frequencies f1 . Mathematically. mean or median).

What we need first. 65 55 89 56 35 14 56 55 87 45 92 Solution : Lets denote the data by X .. 59 / 81 . Find the MD and MedD for the following simple data. n i=1 Since n is odd. are the mean and median. The mean is x ¯ = = = xi n 65 + 55 + . + 92 649 = 11 11 59. 14 35 45 55 55 56 56 65 87 89 92 hence median is 56.About These Notes Introduction Descriptive Statistics Measures of Dispersion Example 16. Now lets proceed to find MD and MedD.. the median is just the middle observation of the oredered data.

MD = |x i − x ¯| 194 = = 17. xi 65 55 89 56 35 14 56 55 87 45 92 xi − x ¯ 65-59=6 55-59=-4 30 -3 -24 -45 -3 -4 28 -14 33 |x i − x ¯| 6 4 30 3 24 45 3 4 28 14 33 194 xi − x ˜ 65-56=9 55-56=-1 33 0 -21 -42 0 -1 31 -11 36 33 | xi − x ˜| 9 1 33 0 21 42 0 1 31 11 36 185 |x i − x ¯| and |xi − x ˜|.6 n 11 and MedD = |xi − x ˜| 185 = = 16.e.About These Notes Introduction Descriptive Statistics Measures of Dispersion Lets arrange the data in a table.8 n 11 60 / 81 . for the above formulas. and calculate the required quantities i. Hence.

we find that median is.About These Notes Introduction Descriptive Statistics Measures of Dispersion Example 17. x f : : 14 4 35 7 45 11 55 13 56 18 65 13 87 8 89 6 92 3 Solution : Again. Find the MD and MedD for the following grouped data. first we need the mean and the median to calculate the necessary columns.7 83 n th observation = 41. The mean is x ¯= and the median is k i=1 fi xi k i=1 fi = 4870 = 58.5th observation 2 From the table on the following slide. x ˜ = The x ˜ = 56 61 / 81 .

fi |xi − x ¯| 1182 = = 14.3 33.4 47.3 28.3 30.3 fi |xi − x ¯| 178.7 -2.7 -23.0 100.7 6.2 226.0 1182 cf 4 11 22 35 53 66 74 80 82 xi − x ˜ -42 -21 -11 -1 0 9 31 33 36 ¯| f |xi − x 168 147 121 13 0 117 248 198 108 1120 We have all the stuff for MD and MedD.7 -13.6 182.7 -3.2 fi 83 fi |xi − x ˜| 1120 = 13.5 = fi 83 MD MedD = = That’s it! 62 / 81 .About These Notes Introduction Descriptive Statistics Measures of Dispersion x 14 35 45 55 56 65 87 89 92 f 4 7 11 13 18 13 8 6 3 83 xi − x ¯ -44.7 150.7 165.8 48.1 82.

N (xi − x ¯ )2 . for large sample data (n > 30) . Population variance is denoted by σ 2 and the sample variance is denoted by S 2 or σ ˆ 2 . n for population data.About These Notes Introduction Descriptive Statistics Measures of Dispersion Variance and Standard Deviation: Variance is defined as the mean of the squared deviations of all the observations from the mean. σ2 = S2 = or s2 = (xi − x ¯ )2 . for simple data. n−1 for small sample data (n ≤ 30) 63 / 81 (xi − µ)2 . σ= S= or s= (xi − x ¯ )2 . n for population data. The standard deviation is just the positive square root of the variance. defined as. (xi − µ)2 . N 2 (x i − x ¯) . Mathematically. for large sample data (n > 30). n−1 for small sample data (n ≤ 30).

or expected value). used in statistics and probability theory. for two data points SD = (x1 − x ¯)2 + (x2 − x ¯)2 2 Generalizing to n observations x1 .. A low standard deviation indicates that the data points tend to be very close to the mean. It shows how much variation or “dispersion” exists from the average (mean. x2 .About These Notes Introduction Descriptive Statistics Measures of Dispersion Standard deviation (SD) is a widely used measure of variability or diversity... xn . + (xn − x ¯ )2 = n (xi − x ¯)2 n SD is commonly used to measure confidence in statistical conclusions.. i. except that it averages the squared deviations and that the coordinates of the second (here mean) point are the same for all pairs.e. 64 / 81 . whereas high standard deviation indicates that the data points are spread out over a large range of values. the SD is SD = (x1 − x ¯)2 + (x2 − x ¯)2 + . Do you remember the formula D = (x1 − x2 )2 + (y1 − y2 )2 ? Do you notice the similarity between SD and this formula? SD does almost the same function as D.. .

. x2 ... and frequencies f1 . . k i=1 fi (xi − x ¯ )2 k i=1 fi k i=1 fi xi k i=1 fi In next slides we will solve an example problem using the data from example 17. xk . 65 / 81 . f2 .. . fk the variance and standard deviations are given by... k i=1 S2 = and S= where x ¯= fi (xi − x ¯ )2 k i=1 fi .About These Notes Introduction Descriptive Statistics Measures of Dispersion For grouped data arranged in frequency table with k classes with midpoints x1 .

3 33.e.3 f i (x − x ¯)2 7983 3923 2057 176 129 520 6419 5518 3332 30056 Thus the standard deviation (SD) of the said data is 19.0.0 83 x 14 35 45 55 56 65 87 89 92 f 4 7 11 13 18 13 8 6 3 83 xi − x ¯ -44. i. S 2 = 19. we need to use the formula for SD at previous page.e.7 -23. Find the SD and variance of the data in Example 16. We now have the required stuff for the formula. i. 66 / 81 . The variance is simply the square of S .7 6. lets put the values in it. x ¯.0.02 = 362.3 30.7 -3. So we need to construct the following table.About These Notes Introduction Descriptive Statistics Measures of Dispersion Example 18. Solution : Looking at the nature of data (i.7 -2. The variance is calculated by taking i=1 fi (xi − x square of the SD.7 -13. so lets just calculate the k required quantity for the above formula. ¯)2 .7 (from Example 16).3 28. That is. which is 58. S= 30056 = 19. observations with frequencies). k ¯ )2 i=1 fi (xi − x S= k i=1 fi For which we need to find the mean.e.

here one does not need to calculate the column of differences i.. By taking positive square root of S . k) we use S2 = 1 n k fi (xi − x ¯ )2 i=1 k i=1 = fi x2 i n − k i=1 fi xi 2 n The benefit of these formulas is that. . 2.. variance and SD are calculated by using computationally friendly formulas given as below. with midpoints xi and frequencies fi (i = 1. xi − x ¯. i = 1.. n. 2.About These Notes Introduction Descriptive Statistics Measures of Dispersion In practice...e. . 67 / 81 .. we get SD. For a sample of size n with values xi . distributed in k groups. S2 = = 1 n n (xi − x ¯ )2 i=1 n i=1 x2 i n − n i=1 xi 2 n Similarly for grouped data.

and (4) identification of “outliers. A pictorial summary called box-and-whisker plot or simply box-plot has been used successfully to describe several most prominent features of a data set. These features include (1) center.e.About These Notes Introduction Descriptive Statistics Measures of Central Tendency Box-plots Stem-and-leaf and histograms convey rather general impressions about a data set. 68 / 81 . A typical boxplot is given in the following figure.5fs . (3) the extent and nature of any departure from symmetry. i. the median and a measure of spread called the fourth spread (quartile range). Variable Name Whisker Lower Quartile Q1 −4 Median −2 Upper Quartile Q3 0 2 Data The whisker usually ends at 1.” observations that lie unusually far from the main body of the data. a box-plot is based on measures that are “resistant” to the presence of a few outliers. whereas a single summary such as the mean or standard deviation focuses on just one aspect of the data. Because even a single outlier can drastically affect the values of x ¯ and s. (2) spread.

and it is mild otherwise. and 352. 69 / 81 .17 (page 39 Devore) are x ˜ = 92.5fs from the closest fourth is an outlier. An outlier is extreme if it is more than 3fs from the nearest fourth.5fs = 183. and 460.47.79 3fs = 366. 690.11.35 – are extreme outliers.15 Q1 = 45. Example 19.5fs = 351. 371.225 Q3 = 167. and 1529. 826. so there are no outliers on the lower end of the data. Definition Any observation farther than 1.015 and Q3 + 3fs = 534.09. The box-plot for the above data can then be sketched as following.92.64 1. However. Q3 + 1.45 Subtracting 1.About These Notes Introduction Descriptive Statistics Measures of Dispersion Boxplots That Show Outliers A boxplot can be decorated further to indicate explicitly the presence of outliers.5fs from the Q1 gives a negative number.54. and none of the observations are negative.68. 444.24 Thus the four largest observations – 563.17 Q3 − Q1 = fs = 122.86 are mild outliers. The relevant summary quantities for Example 1.

69 on the low end and 312. Most importantly. q q 5 q q −5 0 q q q q q q −10 January February March April May June July August September October November December 70 / 81 .19 extend out to the smallest observation 9. the largest observation that is not an outlier. We will learn about positive/negative skewness in the next few slides. e. see the following figure of the monthly boxplots of the daily temperatures in some country. boxplots can be used to compare several data sets at once. on the upper end. There is some positive skewness in the middle half of the data (the median line is somewhat closer to the right edge of the box than to the left edge) and a great deal of positive skewness overall.About These Notes Introduction Descriptive Statistics Measures of Dispersion The whiskers in the boxplot in Figure 1.45.g.

Sign up to vote on this title
UsefulNot useful