You are on page 1of 57

BAS 115

Probability –Statistics
Lecture 1
What is Statistics?
Statistics is the science of
conducting studies to: collect,
organize, present, summarize
analyze, and draw conclusions
“decisions” from data.
Types of Data
(i) Quantitative: consists of numbers
representing counts or measurements.

(ii) Qualitative: can be separated into categories


that are distinguished by nonnumeric
characteristics, (blood types, colours, genders
(male/female) , letter grades in an exam etc.)
Types of Data (cont.)
Throughout this course we are mostly concerned with
quantitative data
 Quantitative data assume numeric values:
◦ Discrete: when the number of values is finite or
countable, e.g., number of students in class.
◦ Continuous: result from infinitely many values
correspond to some scale that covers a range of values
without gaps, e.g., Temperature, height, weight.
Types of statistical Applications

There are two possible types of


studies, depending on why the study
is conducted:
1. Descriptive statistics
2. Inferential statistics.
1-Descriptive Statistics
• Involves only the collection,
organization, presentation, and
summarization of data.

• The point is to describe a certain situation


as represented by a particular data set.
2-Inferential Statistics
Involves drawing conclusions from data.

The point is to make inferences about a


certain situation represented by a
particular data set.
Populations and Samples
Population is the complete collection of
all elements to be studied.
Populations are normally so large that it is
logistically impossible to examine all the
individuals. So…
One takes a subgroup of a population and
examines the desired characteristic for the
subcollection.
• One takes a subgroup of a population and
examines the desired characteristic for the
subcollection.

• Such subcollections are called samples.

• A major problem is to ensure that a selected


sample is representative of the population.

So the major task of inferential statistics is to


draw conclusions about a whole population on
the basis of [analyzing] sample.
Descriptive Statistics:
Summarization

• We’ll consider two aspects:

1.Measures of central tendency.

2.Measures of Variation ( Dispersion ).


3. Measures of Position
Measures of Central Tendency

• We’re interested in a value that represents the


center of the distribution:
We’ll study three measures:
 1-Mean
 2-Median
 3-Mode
i. The (Arithmetic) Mean
• For a population of size N, the mean, , is
given by N


 = i 1
Xi

• For a sample of size n, the mean, X , is given


by n

X i
X i 1
n
Example 1
The data represent the number of days off per year for a
sample of individuals selected from 9 different countries.
Find the sample mean.

20 26 40 36 23

42 35 24 30
Solution
• n = 9.
20  26  40  36  23  42  35  24  30
X 
9
276

9
 30.7

• The mean is rounded to one more decimal


place than occurs in the data.
ii. The Median

• The median, MD, is the midpoint of the entire data


array.
• To determine the median:
1. Sort the data values.
2. Pick the value in the middle
– For n data values,
(i) If n is odd,
then MD = middle point,

( n  1)
i.e., the value in position
2
If n is even,
then MD = (sum of the two middle points) / 2

i.e., the average of two numbers in positions

n n
and ( )  1
2 2

(Note: MD need not be a data value)


Example (2): Find the median for the following
sample

1 2 6 7 12

13 2 6 9 5

18 7 3 15 15

4 17 1 14 5
Step 1: Sorting…

1 1 2 2 3

4 5 5 6 6

7 7 9 12 13

14 15 15 17 18
Step 2 : n = 20 , which is even number so ,the two values
in the middle are in positions n/2 = 20/2 = 10 and
(n/2) + 1 =11

1 1 2 2 3

4 5 5 6 6

7 7 9 12 13

14 15 15 17 18

Then the median MD = (6 + 7) / 2 = 6.5


iii. The Mode
• The mode is a data value that has the
highest frequency in a data set.
• A distribution may have one, more than
one, or no mode at all.
• Defined for both qualitative and
quantitative data.
Example(3) : The following data represent the
duration (in days) of U.S. Space Shuttle voyages
for the years 1992—1994.

8 9 9 14 8 8
10 7 6 9 7 8
10 14 11 8 14 11
Solution : Sort the data for convenience
6 7 7 8 8 8
8 8 9 9 9 10
10 11 11 14 14 14
Identify the value with highest frequency
6 7 7 8 8 8
8 8 9 9 9 10
10 11 11 14 14 14

Then ,the mode is 8.


Example (4) :The following data represent the
number of coal employees per county for 10
selected counties in southwestern
Pennsylvania.
110 731 1031 84 20
118 1162 1977 103 752

Solution: Since each value occurs exactly once, then


there is no mode.
Note that : the mode is not 0.
Example(5):The following data represent the favorite
subject of 10 MIU students.

CS Math Math Physics Physics


Math CS CS Chemistry Physics

Solution :The distribution has three modes:


Math, CS and Physics.
Measures of Variability

• Measures of central tendency locate the center of a


distribution.
• They do not indicate how the values are distributed
around the center.
– Measures of variability examine the spread, or
variation, of data values around the center.
Example (6) : Consider two very small populations , each
consists of 10 measurements .

A : 55 60 65 70 75 80 85 90 95 100

B: 73 74 75 76 77 78 79 80 81 82

55  60  65  70  75  80  85  90  95  100
A   77.5
10

73  74  75  76  77  78  79  80  81  82
B   77.5
10
• The two distributions have the same mean!
• Are they the same?
• How exactly are they different?
– The difference is in the spread of values around the mean.
– In population B, the data values are clustered closer to the
mean.
– The grades of population B are more consistent.
Measures of Variation
Dispersion
• We’ll consider two measures of variation:
1. The variance.

2. The standard deviation.


1. The Variance
• The population variance is defined by

N
 ( Xi  ) 2

  i 1
2
• Note N
– The variance is an average.
– It is the mean of the square of distances to the
population mean.
– The squaring is needed to get only positive
distances.
Example(7) : Computing the Variance of population A in example(6)
55 60 65 70 75 80 85 90 95 100
1. We know that µ = 77.5

2. Subtract the mean from each data value, i.e., (x-µ):

55  77.5  22.5 60  77.5  17.5 65  77.5  12.5


70  77.5  7.5 75  77.5  2.5 80  77.5  2.5
85  77.5  7.5 90  77.5  12.5 95  77.5  17.5
100  77.5  22.5
3. Square each result
(22.5) 2  506.25 (17.5) 2  306.25 (12.5) 2  156.25
(7.5)  56.25
2
(2.5)  6.25
2
(2.5)  6.25
2

(7.5) 2  56.25 (12.5) 2  156.25 (17.5) 2  306.25


(22.5) 2  506.25

4. Find the sum of the squares = 2062.5


5. Divide by N (= 10)
2062.5 / 10 = 206.25  206.3
Sample Variance
• The sample variance is used as an estimate
of the population variance.
• For a better estimate, the sample
variance is defined by:
n 2
 (Xi  X)
2
S  i 1
n 1
Sample Variance. Shortcut
formula.
Rearranging the terms in the formula for the
variance we arrive at an expression that does
not involve the mean explicitly:

2
n

n  x    xi 
2 n

 
i
i 1 i 1
S 
2

n  n  1
2. The Standard Deviation

• The standard deviation is the square root of


the variance.
– It has the same units as the raw data.
• For a population
2
 
• For a sample
2
s s
Example (8) : Find the sample standard deviation for the amount of European
auto sales for a sample of 6 years shown. The data are in millions of dollars:
11.2, 11.9, 12.0, 12.8, 13.4, 14.3
Solution :Use the shortcut formula.
1. Find the sum of the values

6
 x i  11 .2  11 .9  12.0  12.8  13.4  14.3  75.6
i 1
2. Square each value and find the sum
6 2 2 2 2 2 2
 x i  (11 .2)  (11 .9)  (12.0)  (12.8)  (13.4)  (14.3)  958.94
i 1
3. Substitute into the formula
2
n  x i2   
n n
  xi   6   958.94    75.6 
2

S2  i 1  i 1    1.276
n  n  1  6  5

4. Compute the square root. s   s 2  1.276  1.13


Example (9): Find the variance and standard deviation of the set of
numbers: 4,5,8,10,13
Solution: Arithmetic mean

X  4 5810 13  8
5

So,


S  n1
2 ( x  x )2
 (48)2  (58)2  (88)2  (108)2  (138)2
 13.5
4

Also,

S  13.5  3.67
3] Coefficient of Variation (C.V)

Sample Coefficient of variation

(C.V)  .100 %
S
X
Population Coefficient of variation


(C.V)  .100 %

and is generally expressed as a percentage .This measure allows us to
compare the relative variability of the two data sets
Example (10): Page 29 Measurements made with one of the micrometer
diameter of a ball bearing have a mean of 3.92 mm and a standard
deviation of 0.0152mm, whereas measurements made with another
micrometer of the unstretched length of a spring have a mean of 1.54
inches and a standard deviation of 0.0086 inches. Which of these two
measuring instruments is relatively more precise?
Solution: Calculating the two coefficients of variation, we get
Ball bearing Spring
X  3.92 mm X  1.54 inches
S = 0.0152 mm S= 0.0086 inches

C.V 
0.0152
.100  0.39% C.V  0.0086
1.54 .100  0.56 %
3.92

Thus , the measurements made with the first micrometer are


relatively more precise .
Measures of Position
1-Quartiles,

2-Deciles,

3-Percentiles
Measures of Position
Quartiles, Deciles and Percentiles:
 Used to locate the relative position of a data
value in a data set.

 If a set of data is arranged in order of


magnitude, the middle value, which divides the
set into two equal parts , is the median.

 By extending this idea we can think of these


values which divide the set into four, ten and
hundred parts.
1-Quartiles
Q1, Q2, Q3 divides ranked data
into four equal parts

25% 25% 25% 25%


(minimum) (maximum)
Q1 Q 2 Q3
(median)
2-Deciles
D1, D2, D3, D4, D5, D6, D7, D8, D9
divides ranked data into ten equal
parts
10% 10% 10% 10% 10% 10% 10% 10% 10% 10%

D1 D2 D3 D4 D5 D6 D7 D8 D9
3-Percentiles
99 Percentiles

P1, P2, P3, …, P99 divides ranked data


into hundred equal parts
Deciles & Percentiles
The values which divide the data into ten equal parts are
called deciles and are denoted by D1, D2,.....,D9

The values dividing the data into one hundred parts are
called percentiles and are denoted by P1, P2,....., P99.

Example : 90th percentile, is the value such that 90% of the


observations are less or equal to it .
Quartiles Deciles
Q1 = P25 D1 = P10
D2 = P20
Q2 = P50 D3 = P30

Q3 = P75 •

D9 = P90
Sample Percentiles
• How to get the 100 pth percentile ?

1- Order the n observations from smallest to largest


2- Determine the product np
(i) If np is not an integer, round it up to the next
integer and find the corresponding ordered value

(ii) If np is an integer , say k , calculate the mean of


the kth and (k+1)st ordered observations
Example (11) : Consider: 18, 15, 12, 6, 8, 2, 3, 5, 20, 10.
Find the value corresponding to the 25th percentile?

Solution:
1- Sort: 2, 3, 5, 6, 8, 10, 12, 15, 18, 20.

2- m = n.p = (10)(0.25) = 2.5 not a whole number, so we round it up


to L = 3.
Hence the data value # 3 (3rd value )which is 5 corresponds to the 25th
percentile
Example(12) : Consider: 18, 15, 12, 6, 8, 2, 3, 5, 20, 10.
Find the value corresponding to the 60th percentile?

Solution
1- Sort: 2, 3, 5, 6, 8, 10, 12, 15, 18, 20.

2- m = n.p = (10)(0.60) = 6 a whole number. Hence we need the data


value # 6 which is 10 and the data value # 7 which is 12.
The value corresponds to the 60th percentile is 11 which is the
average of 10 & 12. (10 + 12)/ 2 = 11
Quartiles Example
Example(13) : The following are the weight of a sample of 12
patients:175, 260, 150, 165, 170, 180, 190, 210, 210, 235, 240, 270.
Find Q1, Q2 , and Q3 .
Solution : Q1 = 25th & Q2 = 50th & Q3 = 75th percentile
• Step 1:
Rank data and divide into 4 parts:
150, 165, 170 175, 180, 190 210, 210, 235 240, 260, 270

Q1 Q2 Q3
Step 2
Q1
= (170 + 175)/2
= 172.5 Q2
= (190 + 210)/2
= 200.0 Q3
= (235 + 240)/2
= 237.5
Definitions
1- Range = Max. value – Min. value

2- Interquartile Range (or IQR):Q3 - Q1

Q3  Q1
3- Semi-interquartile Range:
2
Example (14) : Back to example(13) . Find IQR
Solution :
• Step 1:
Rank data and divide into 4 parts:
150, 165, 170 175, 180, 190 210, 210, 235 240, 260, 270

Q1 Q2 Q3
Step 2
Q1
= (170 + 175)/2
= 172.5 Q2
= (190 + 210)/2
= 200.0 Q3
= (235 + 240)/2
= 237.5
Step 3: Calculate (IQR ): Q3-Q1
Q3 – Q1 = 237.5 – 172.5 = 65
Example(15) :Page 30 & 31
Consider the data collected in a nanotechnology setting . Engineers
fabricating a new transmission-type electron multiplier created an array
of silicon nanopillars on a flat silicon membrane . The precise structure
can influence the electrical properties , so the heights of 50 nanopillars
were measured in nanometers (nm) , or 10-9 x meters . The ordered
heights of the nanopillars are :
221 234 245 253 265 266 271 272 274 276
276 276 278 284 289 290 290 292 292 296
297 298 300 303 304 305 305 308 308 309
310 311 312 314 315 315 323 330 333 336
337 338 343 346 355 364 366 373 390 391

Obtain the following


(i) The quartiles Q1, Q2 and Q3 and the 93rd percentile
(ii) The range and interquartile range
Solution: (i) 1- Q1 = 25th percentile
1- Sort: 221 234 245 253 265 266 271 272 274 276
276 276 278 284 289 290 290 292 292 296
297 298 300 303 304 305 305 308 308 309
310 311 312 314 315 315 323 330 333 336
337 338 343 346 355 364 366 373 390 391

2- m = n.p = (50)(0.25) = 12.5 not a whole number, so we round it


up to L = 13.
Hence the data value # 13 (13th value )which is 278 corresponds to the
Q1 ( 25th percentile )
Solution: (i) 2- Q2 = 50th percentile
1- Sort: 221 234 245 253 265 266 271 272 274 276
276 276 278 284 289 290 290 292 292 296
297 298 300 303 304 305 305 308 308 309
310 311 312 314 315 315 323 330 333 336
337 338 343 346 355 364 366 373 390 391
2- m = n.p = (50)(0.50) = 25 a whole number. Hence we need the data
value # 25 which is 304 and the data value # 26 which is 305.
The value corresponds to the Q2 (50th percentile ) is 11 which is the
average of 304 & 305. (304 + 305)/ 2 = 304.5
Solution: (i) 1- Q3 = 75th percentile
1- Sort: 221 234 245 253 265 266 271 272 274 276
276 276 278 284 289 290 290 292 292 296
297 298 300 303 304 305 305 308 308 309
310 311 312 314 315 315 323 330 333 336
337 338 343 346 355 364 366 373 390 391

2- m = n.p = (50)(0.75) = 37.5 not a whole number, so we round it


up to L = 38.
Hence the data value # 38 (38th value )which is 330 corresponds to the
Q3 ( 75th percentile )
Solution: (i) 4- 93rd percentile
1- Sort: 221 234 245 253 265 266 271 272 274 276
276 276 278 284 289 290 290 292 292 296
297 298 300 303 304 305 305 308 308 309
310 311 312 314 315 315 323 330 333 336
337 338 343 346 355 364 366 373 390 391

2- m = n.p = (50)(0.93) = 46.5 not a whole number, so we round it


up to L = 47.
Hence the data value # 47 (47th value )which is 366 corresponds to the
93rd percentile
Solution: (ii)

1- Range = maximum- minimum = 391-221 = 170

Interquartile Range (or IQR):Q3 - Q1 = 330 – 278 = 52

You might also like