You are on page 1of 61

APPLIED STATISTICS

Submitted to : Dr. IMELDA E. CUATEL

GRADUATE SCHOOL UNIVERRSITY OF LUZON Sunday 8:00-12:30

DESCRIBING, EXPLORING, AND COMPARING DATA


Prepared by : SAIFULDEEN SINAN

Introduction to Statistics

What is Statistics? a set of procedures and rulesfor reducing

large masses of data to manageable proportions and for allowing us to draw conclusions from those data
Statistics is a branch of mathematics that deals with

the effective management and analysis of data.

What can Stats do?


Allow us to draw conclusions from the data Make data more manageable Allows us to do this objectively and quantitatively

Why Statistics?

To develop an appreciation for variability and how it effects

products and processes. Build an appreciation for the advantages & Limitations of informed observation and Experimentation. Determine how to analyze data from designed experiments in order to build knowledge and continuously improve.

Grouped Frequency Distributions


A frequency distribution is a table used to organize data . The left column (called classes or groups)

includes numerical intervals on a variable being studied. The right column is a list of the frequencies, or number of observations, for each class. .

Grouped frequency distributions - can be used when the

range of values in the data set is very large. The data must be grouped into classes that are more than one unit in width

Construction of a Frequency Distribution


Find the highest and lowest value. Find the range. Select the number of classes desired. Find the width by dividing the range by the number of

classes and rounding up Select a starting point (usually the lowest value); add the width to get the lower limits. Find the upper class limits. Find the boundaries. Tally the data, find the frequencies and find the cumulative frequency.

Example

In a survey of 20 patients who smoked, the following data were obtained. Each value represents the number of cigarettes the patient smoked per day. Construct a frequency distribution using six classes.

10 22 11 13 5

8 13 9 12 11

6 17 18 15 16

14 19 14 15 11

Answer
Step 1: Find the highest and lowest

values: H = 22 and L = 5.
Step 2: Find the range:

R = H L = 22 5 = 17.
Step 3: Select the number of classes desired. In this case it is equal to

6.
Step 4: Find the class width by dividing the range by the number of

classes. Width = 17/6 = 2.83. This value is rounded up to 3.


Step 5: Select a starting point for the lowest class limit. For

convenience, this value is chosen to be 5, the smallest data value. The lower class limits will be 5, 8, 11, 14, 17 and 20.
Step 6: The upper class limits will be 7, 10, 13, 16, 19 and 22.

Step 7: Find the class boundaries by subtracting 0.5 from each lower

class limit and adding 0.5 to the upper class limit


Step 8: Tally the data, write the numerical values for the tallies in the

frequency column and find the cumulative frequencies.


Note: The dash - represents to.

Class Limits

Class Boundaries

Frequency

Cumulative Frequency

05 to 07 08 to 10 11 to 13 14 to 16 17 to 19 20 to 22

4.5 - 7.5 7.5 - 10.5 10.5 - 13.5 13.5 - 16.5 16.5 - 19.5 19.5 - 22.5

2 3 6 5 3 1

2 5 11 16 19 20

Histogram
What is a histogram
It is "a representation of a frequency distribution by means of rectangles whose widths represent class intervals and whose areas are proportional to the corresponding frequencies A histogram is like a bar chart, but there are some important differences. It can only be used to show continuous data It can only be used to show numerical data The data is always grouped.

So The width of a bar represents a quantitative variable x, such as age rather than a category The height of each bar indicates frequency
How is a Real Histogram Made? Example * Consider the set Below

{3, 11, 12, 19, 22, 23, 24, 25, 27, 29,31, 35, 36, 37, 45, 49}.
A graph which shows how many ones, how many twos, how many threes, etc. would be meaningless. Instead we bin the data into convenient ranges. In this case, with a bin width of 10, we can easily group the data as below Bin =The class size (width of the rectangles) in a histogram

SEE NEXT SLIDE

SOLUTION
{3, 11, 12, 19, 22, 23, 24, 25, 27, 29,31, 35, 36, 37, 45, 49}. a bin width of 10 Data Range 0-10 10-20

Frequency 1 3

20-30 30-40 40-50

6 4 2

Note: Changing the size of the bin changes the apprearance of the graph

Histogram shapes

Box plot
A box plot (also referred to as a box and whisker diagram) is a diagram showing statistical distribution.
A box plot summarizes data using the median, upper and lower

quartiles, and the extreme (least and greatest) values. It allows you to see important characteristics of the data at a glance.

We need 5 numbers, called the 5 number summary:


1. minimum value 2. Q1 3. median 4. Q3 5. maximum value

Construction of BOX PLOT


28 30 24 38 31 32 25 32 34 28 42 44 33 30 31 37 38 44 44 29

39

29

32

29

MPG of 4-cylinder cars

To make a box plot, organize the data in order least to greatest : 24 25 28 28 29 29 29 30 30 31 31 32 32 32 33 34 37 38 38 39 42 44 44 44

* THEN we Find the median of the data. It is 32 * This divides the data in half. The lower half : 24 25 28 28 29 29 29 30 30 31 31 32 and the upper half: 32 32 33 34 37 38 38 39 42 44 44 44

Find the median of the top half of the data. 32 32 33 34 37 38 38 39 42 44 44 44 This is called the high median, upper quartile or quartile 3 . Q 3 = 38. Take the lower half of the data and find the median of it. 24 25 28 28 29 29 29 30 30 31 31 32 This is called the low median, or quartile 1. Q1 = 29 Next, find the lowest data, 24, and the highest data, 44. Lets organize all 5 pieces of data together so we can see Lower extreme = 24 Lower quartile(Q1) =29 Median (Q2) = 32 Upper quartile(Q3) =38 Upper extreme(Q4)=44

Next, make a number line that will best display the 5 pieces of data (24 ,29 , 32 ,38, 44)
20 24 28 32 36 40 44

Place a dot above the number line to show the lower

extreme and one for the upper extreme. Put a vertical slash above the number line for the median and one for the lower and upper quartiles.

20

24

28

32

36

40

44

Enclose the vertical slashes into a box. Draw a line from the right center of the box to the upper extreme and one from the lower end of the box to the lower extreme, forming the whiskers. THEN

All graphs must have a title that clearly represents what your graph is showing
Miles per Gallon of 4-cylinder Cars

20

24

28

32

36

40

44

Miles per gallon (mpg)

OGIVE
An ogive, sometimes called a cumulative line graph, is a

line that connects points that are the cumulative percentage of observations below the upper limit of each class in a cumulative frequency distribution. How to Construct Ogives ? Make a frequency table showing class boundaries and cumulative frequencies. For each class, put a dot over the upper class boundary at the height of the cumulative class frequency. Place dot on horizontal axis at the lower class boundary of the first class. Connect the dots.

Example

Draw the x and y axis , Plot the points

Pie Chart
Pie graph - A pie graph is a circle that is divided into

sections or wedges according to the percentage of frequencies in each category of the distribution How to make a Pie Chart ? 1. Organize your information 2. Add the data all together and reach a sum 3. Know the angle between the two sides of the piece 4. Use a mathematical compass to draw a circle 5. Draw the radius 6. Draw each section division 7. Color each segment.

Example
A family's weekly expenditure on its house mortgage, food and fuel is as follows:

Draw a pie chart to display the information.

Solution :

We can find what percentage of the total expenditure each item equals. Percentage of weekly expenditure on:

To draw a pie chart, divide the circle into 100 percentage parts. Then allocate the number of percentage parts required for each item.

Measures of Central Tendency (Averages)


A measure of central tendency is a univariate statistic that

indicates, in one manner or another. the average or typical observed value of a variable in a data set. Central Tendency = values that summarize/ represent the majority of scores in a distribution Three main measures of central tendency: Mean

Median
Mode

Averages

Mode
The mode (or modal value) of a variable in a set of data is

the value of the variable that is observed most frequently in that data (or, given a continuous frequency curve, is at the point of greatest

Note: the mode is the value that is observed most

frequently, not the frequency itself ) The mode is defined for every type of variable [i.e., nominal, ordinal, interval, or ratio].

Frequency

40 35 30 25 20 15 10 5 0

5 DV

Mode = most frequently occurring data point

Mode = (3+4)/2 = 3.5


Data Point

Frequency
2

1
2

5
7

3
4 5 6

14
15 8 5

Median
Middle-most Value 50% of observations are above the Median, 50% are

below it The difference in magnitude between the observations does not matter Therefore, it is not sensitive to outliers Formula Median = n + 1 / 2

Median = the middle number when data are

arranged in numerical order


Data: 3 5 1
Step 1: Arrange in numerical order

1 3 5 Step 2: Pick the middle number (3) Data: 3 5 7 11 14 15 Median = (7+11)/2 = 9

Median Median Location = (N +1)/2 = (56 + 1)/2 = 28.5 Median = (3+4)/2 = 3.5
Data Point
0 2

Frequency

1
2 3

5
7 14

4
5 6

15
8 5

Mean
The mean (or mean value) of a variable in a set of data is

the result of adding up all the observed values of the variable and dividing by the number of cases ( the average as the term is most commonly used). The mean is defined if and only if the variable is at least interval in nature [i.e., interval or ratio].

Mean = Average = X/N X = 191 Mean = 191/56 = 3.41


Data Point 0 1 2 3 4 5 6 2 5 7 14 15 8 5 Frequency 0 5 14 42 60 40 30 X

Advantages and Disadvantages of the Measures: Median 1. Also unaffected by extreme scores Data: 5 8 11 Median = 8 Data: 5 8 5 million Median = 8 2. Usually its value actually occurs in the data 3. But cannot be entered into equations, because there is no equation that defines it 4. And not as stable from sample to sample, because dependent upon the number of scores in the sample

Advantages and Disadvantages of the Measures: Mean 1. Defined algebraically 2. Stable from sample to sample 3. But usually does not actually occur in the data 4. And heavily influenced by outliers Data: 5 8 11 Mean = 8 Data: 5 8 5 million Mean = 1,666,671

Measures of Variation
Measures of variation is a measure that describes how spread

out or scattered a set of data. It is also known as measures of dispersion or measures of spread.
Measures of Variation include:

1. The range
2. The Variance 3. The Standard Deviation

The standard deviation is just the square root of the variance

Range: difference between the extreme values (max - min),

actual values are most often reported in the literature (min max) rather than the difference Variance - measure of variation in a sample of data: mean squared deviations of a value from the mean, often referred to as the mean square or MS Standard deviation: square root of the variance, measures amount of variation of values around the mean

Example
Heights (in inches) of 5 starting players from basketball

team A: A: 72 , 73, 76, 76, 78 The range is the difference between maximum and minimum values of the data set. Range of team A: 78-72=6 The sample standard deviation takes into account all data values. The following procedure is used to find the sample standard deviation.

Step 1.

Find the mean of data

Step 2. Find the deviation of each score from the mean

xi
72 73

xx

72-75 = -3 7375 = -2

76
76

76-75 = 1
76-75 = 1

78
Note that the sum of the deviations is zero:

78-75= 3

Step 3. Square each deviation from the mean . Find the sum of the squared deviations.

xi 72 73 76 76

x xi
72-75 = -3 7375 = -2 76-75 = 1 76-75 = 1

( x xi )2
9 4 1 1

78

78-75= 3
0

9
24

Step 4. The sample variance is determined by dividing the sum of the squared deviations by (n-1) (number of scores minus one)

Team A, the sample variance is

Step 5. The standard deviation Is the square root of the variance.


The mathematical formula for the sample standard deviation is

The sample standard deviation for Team A is

Measures of Position
Identify the position of a data value in a data set, using

various measures of position such as percentiles and quartiles Are used to locate the relative position of a data value in a data set Can be used to compare data values from different data sets Can be used to compare data values within the same data set Can be used to help determine outliers within a data set Includes z-(standard) score, percentiles, quartiles

z-scores
Also called the standard score

Represents the number of standard deviations a score is

from the mean

Always round value to 2 decimal places Can be used to compare data values from different data

sets by converting raw data to a standardized scale Calculation involves the mean and standard deviation of the data set Represents the number of standard deviations that a data value is from the mean for a specific distribution

Z -score
Is obtained by subtracting the

mean from the given data value and dividing the result by the standard deviation. Symbol of BOTH population and sample is z Can be positive, negative or zero A date point can be considered unusual if its z-score is sufficiently large or small

Formula

Sample

Example
Human body temperatures have a mean of 98.20 degrees and a standard deviation of 0.62 degrees. Find the z score for temperatures of: a. 100 degrees b. 97 degrees

Solution Z = (100 98.20)/0.62 Z = 2.90


Z = (97 98.20)/0.62 Z = -1.94

Significance of Z
Z scores above 2 or below -2 are considered to be

UNUSUAL. Z scores above 3 or below -3 are considered to be VERY UNUSUAL. So The temperature of 100 degrees is UNUSUAL.

The temperature of 97 degrees is ordinary

Percentiles
Are position measures used indicate the position of an

individual in a group Divides the data set in 100 (per cent) equal groups Used to compare an individual data value with the national norm Symbolized by P1, P2 ,.. Percentile rank indicates the percentage of data values that fall below the specified rank

Where B = number of scores below x E = number of scores equal to x n = number of scores

A percentile tells the percent of scores that are lower

than a given score. Example : If Jason graduated 25th out of a class of 150 students, then 125 students were ranked below Jason. Jason's percentile rank would be:

Jason's standing in the class at the 84th percentile is as higher or higher than 84% of the graduates.

Quartiles
Quartiles divide the data set into 4 groups, each of which

has the same number of members. Q1 corresponds to P25 Q2 corresponds to P50 or the median Q3 corresponds to P75 Q1, Q2, Q3 divides ranked scores into four equal parts

Example

Find : Q1,Q2,Q3 ?

Q2(Median)
The median is the

average of the 6th and 7th scores.


(80.2+ 82.5)/2

Q2= 81.35

Q1
Find the median of

the first 6 scores


(78.6 + 79.2)/2 78.9

Q3
Find the median

of the last 6 scores


(84.3+84.6)/2 84.45

THE END

You might also like