You are on page 1of 49

STAT 210

Probability and Statistics


Unit 1:Descriptive Statistics
Outline
 Introduction to Statistics:

 Graphical method:
Bar and pie charts, Histogram

 Summary Statistics:
Measures of location, measures of variability,
boxplot

STAT210: Probability and Statistics 2


Why Statistics?
 Statistics deals with collecting, processing, summarizing,
analyzing and interpreting data. On the other hand,
engineering and industrial management deal with such
diverse issues as solving production problems, effective use
of materials and labor, development of new products,
quality improvement and reliability and, of course, basic
research.
 The field of statistics involves methods for:
1. Designing and carrying out research studies.
2. Describing collected data.
3. Making decisions, predictions, or inferences about
phenomena represented by the data by designing valid
experiments and drawing reliable conclusions.
STAT210: Probability and Statistics 3
Why Statistics?
 Branches of Statistics
1. Descriptive statistics: statistical
methods that summarize and describe the
prominent features of data.
2. Inferential statistics: statistical methods that
generalize results from a sample to a
population.

STAT210: Probability and Statistics 4


 UAEU ( average grade of students)
 Compare the grades over the last 10 years

 Collect the grade of students


 Put the grades into table
 Calculate the average
 Graphs

STAT210: Probability and Statistics 5


 100 students
 Grades , average = 17

 Estimate the value of the grade average in


UAEU

 Average is around 17

STAT210: Probability and Statistics 6


 Sample population
 Average average
 Percentage percentage
 Proportion proportion

 Statistic parameter
 estimator

STAT210: Probability and Statistics 7


Sampling
 As it is generally impossible or impractical to find out something about
the entire population, we examine a part of it to make inferences.
 A population is the entire collection of objects or outcomes about
which information is sought.
 A sample is a subset of a population, containing the objects or
outcomes that are actually observed.
 A parameter is a numerical characteristic of a population, which is
usually unknown.
 A statistic is computed from the sample and varies from sample to
sample and used as an estimate of the population parameter.

Example: A researcher is interested in measuring the satisfaction of


customers about the internet connection in a certain city. He randomly
sampled 50 customers from a list of subscribers. The population of
interest is all customers in the city while the sample is the 50 selected
customers.
STAT210: Probability and Statistics 8
Data Collection
Besides organizing and analyzing data, statistics
deals with the development of techniques for
collecting the data. If data is not properly collected,
an investigator may not be able to answer the
questions under consideration with a reasonable
degree of confidence.
Observational Studies: Engineer simply observes the
process without disturbing it and records quantities
of interest. May be able to find relationship between
input and output but cannot study relationship
between all factors because appropriate changes
were not made.
STAT210: Probability and Statistics 9
 Code python
 Observe and Record the result

 Provide the code of python and recode the


result

STAT210: Probability and Statistics 10


Data Collection
Controlled (Designed) Experiments: Measurements
are recorded while controlling some factors that
might influence the results of the study. Measures
the response or output variable of interest.

Surveys: Questionnaires designed to solicit


information from people. Data may be collected by
face-to-face interview, telephone interview, postal
mail, email, fax.

STAT210: Probability and Statistics 11


Simple Random Sampling (SRS)
A simple random sample (SRS) of size n is a sample chosen by a
method in which each collection of n population items is equally
likely to comprise the sample.
 A SRS is not guaranteed to reflect the population perfectly;

 SRS's always differ in some ways from each other;


 Two samples from the same population may vary from each
other. This is known as sampling variation;
 Items in a SRS may be treated as independent in most
cases encountered in practice. The exception occurs when
the population is finite and the sample comprises a
substantial fraction (more than 5%) of the population.

STAT210: Probability and Statistics 12


Simple Random sampling
Sampling with replacement: Replace each item after it is
sampled.
 The population remains the same on every draw. The
sampled units are truly independent.
 In the sample the researcher collected, 80% of users were
satisfied with their internet connection.
 In the population of customers, it is unlikely there will be
exactly 80% who are satisfied with their internet connection.
 It is more realistic to think that there will be somewhere
around 80% of the customers who are satisfied with their
internet connection.
 Another researcher repeats the study with a different SRS of
50 customers. She finds 90% are satisfied with their internet
connection.
STAT210: Probability and Statistics 13
Simple Random Sampling
 Did she do something wrong or did the first researcher do
something wrong?
 Sample variation at work; two different samples from the
same population will differ from each other and from the
population.

STAT210: Probability and Statistics 14


Stratified Sampling
Sometimes alternative sampling methods can be used to
make the selection process easier, to obtain extra information,
or to increase the degree of confidence in conclusions.
 One such method, stratified sampling, entails separating
the population units into non-overlapping groups and
taking a sample from each one.
 For example, a manufacturer of TV might want
information about customer satisfaction for units produced
during the previous year. If three different models were
manufactured and sold, a separate sample could be
selected from each of the three corresponding strata.
 This would result in information on all three models and
ensure that no one model was over- or underrepresented
in the entire sample.
STAT210: Probability and Statistics 15
Convenience Sampling
Frequently a convenience sample is obtained by selecting
individuals or objects without systematic randomization. Such
sample is not drawn by a well defined random method.
Example: A computer engineer received a shipment of
1000 monitors in a huge container. He wants to test the
brightness of the monitors by testing a sample of 10 ones.
The engineer takes 10 monitors from the top of the
container as the sample.
 Things to consider with convenience samples:
 Differ systematically in some way from the population.
 Only use when it is not feasible to draw a random
sample.

STAT210: Probability and Statistics 16


Types of Variable
 A variable is any characteristic whose value may change from one object to
another. The variables can be classified as either quantitative or qualitative.
 Quantitative (Numerical) variables: A numerical quantity is assigned to
each item in the sample. Quantitative variables can be classified as either
discrete or continuous:
 A discrete variable is a variable whose possible values can be listed,
even though the list may continue indefinitely. For example, the
number of visits to a particular Web site during a specified period, the
number of PCs owned by a family, or the number of students in an
introductory statistics class.
 A continuous variable is a variable whose possible values form some
interval of numbers. Typically, a continuous variable involves a
measurement of something, such as the price of a laptop, the CPU time
of a certain task (in seconds), or the length of time a PC battery lasts.
STAT210: Probability and Statistics 17
 Quantitative (quantity)
 Continuous :Age, weight 75.8 kg, temp,
time,… [-1,1]
 Discrete: number of students, number of
laptop 0,1

 Qualitative ( quality
 Nominal: Names, color, gender, nationality,
brand,
 Ordinal : level of education ( school, high
school, college)

STAT210: Probability and Statistics 18


Types of Variable
Qualitative (Categorical) variables: The sample items are
placed into categories, groups or levels.
Examples: brand of laptop owned by a student, the defective
status (defective or not), computer knowledge (beginner,
intermediate, expert), education level (less than high school,
high school, etc.).
Values of a qualitative variable are sometimes coded with numbers.
We cannot do arithmetic with such numbers, in contrast to those of a
quantitative variable.
Qualitative data can be classified as either nominal or ordinal. The
categories of an ordinal data can be ranked or meaningfully ordered
but the categories of a nominal data can't be ordered. Of the four
qualitative data sets listed above, brand of laptop and defective
status are nominal while computer knowledge and education level
are ordinal.

STAT210: Probability and Statistics 19


Exercises
(1) An IT student, working on his thesis, plans a survey to determine the
proportion of all computer users who regularly scan flash disks before
using them. He decides to interview his classmates in the three classes
he is currently enrolled.
a) What is the population of interest? all computer users
b) What is the sample ? The classmates in the three classes
c) What is the parameter and the statistic?
Parameter :proportion of all computer users who regularly scan flash
disks before using them
Statistic: proportion of users who regularly scan flash disks before
using them in the three classes

STAT210: Probability and Statistics 20


Exercises
(2) Are the following data quantitative or qualitative?
a) Number of hard drives a PC has. Quantitative :discrete 0,1,2,

b) Employment Status (employed, unemployed). Qualitative:


nominal

c) The price of a laptop. Quantitative : Continuous [ 300, 4000]

d) Quality of an item (low, medium, high). Qualitative : ordinal

Size of cloths :
32 36 37 38 40 : quantitative
S, M, L, XL : qualitative
21
Graphical Methods
Descriptive statistics can be divided into two general areas;
graphical and numerical. In this part, we consider
representing a data set using graphical techniques.
Appropriate graphs are-
 For qualitative data: Bar chart and Pie chart
 For quantitative data: Histogram; Boxplot

STAT210: Probability and Statistics 22


Bar and Pie Charts
 Bar chart: A vertical or horizontal rectangle represents the
frequency for each category.
Height can be frequency, relative frequency, or percent
frequency.
In some cases, there will be a natural ordering of groups; for
example, freshmen, sophomores, juniors, seniors, graduate
students whereas in other cases the order will be arbitrary; for
example, Dell, hp, etc.
What to Look For: Frequently and infrequently occurring
categories. In Minitab: Graph - Bar Chart
 Pie chart: A circle divided into slices where the size of each slice
represents its relative frequency or percent frequency.
What to Look For: Categories that form large and small
proportions of the data set.
In Minitab: Graph - Pie Chart
STAT210: Probability and Statistics 23
Example
A quality manager uses a questionnaire to ask customers how
they rate the customer support services o ered by the IT
Services center. The services are rated on a scale of
outstanding (O), very good (V), good (G), average (A), and
poor (P). The responses of 50 customers were:
GOVGAOVOVGOVAVOPVOGAOOOGOVVAGOVP
VOOGOOVOGAOVOOGVAG
The data are summarized in the following frequency table:
Rating Frequency
Outstanding 19
Very good 13
Good 10
Average 6
Poor 2
STAT210: Probability and Statistics 24
Relative
Rating Frequency frequency Percent
19/50= 0.38
Outstanding 19 38%
Very good 13 13/50=0.26 26%
Good 10 10/50=0.2 20%
Average 6 6/50=0.12 12%
Poor 2 2/50=0.04 4%

frequency,
relative frequency each category = frequency/ n ( n : size of the
sample: sum of the frequency )

or percent frequency= relative frequency *100%


STAT210: Probability and Statistics 25
Example

STAT210: Probability and Statistics 26


Exercise
The top three internet browsers in 2011 were Internet Explorer (IE),
Firefox (FF) and Chrome (GC) besides others (OT). Data indicating
the preferred browser for a sample of 60 internet users follow.
GC FF FF IE IE IE IE GC OT GC

IE FF GC GC OT FF FF FF FF IE

GC FF FF OT FF FF IE GC FF FF

GC IE IE IE GC FF OT OT OT OT

FF IE IE IE OT IE FF OT IE FF

FF IE IE GC IE FF GC GC GC FF

(a) Are these data categorical or quantitative?


(b) Provide frequency and percent frequency distributions.
© Construct a bar chart and a pie chart.
(d) On the basis of the sample, which browser has the largest
share? Which one is second?

STAT210: Probability and Statistics 27


Histogram
Graphical display that gives an idea of the shape of
the data distribution.
The bars of the histogram touch each other. A space
indicates that there are no observations in that
interval.
What to Look For: Central or typical value, extent
of spread or variation, general shape, location and
number of peaks, presence of gaps and outliers.

In Minitab: Graph - Histogram

STAT210: Probability and Statistics 28


Shapes of Histogram
 A histogram is perfectly symmetric if its right half is a
mirror image of its left half.
 Histograms that are not symmetric are referred to as
skewed.
 A histogram with a long right-hand tail is said to be
skewed to the right, or positively skewed.
 A histogram with a long left-hand tail is said to be skewed
to the left, or negatively skewed.

A histogram is unimodal if it has only one peak, or mode, and


bimodal if it has two clearly distinct modes. Bimodality can occur
when the data set consists of observations on two quite different
kinds of individuals or objects. In principle, a histogram can have
more than two modes, but this does not happen often in practice.
STAT210: Probability and Statistics 29
Shapes of Histogram

STAT210: Probability and Statistics 30


Example
To evaluate the effectiveness of a processor for a certain type
of tasks, a researcher recorded the CPU time for n = 30
randomly chosen jobs (in seconds),
70 36 43 69 82 48
34 62 35 15 59 139
46 37 42 30 55 56
36 82 38 89 54 25
35 24 22 9 56 19

Construct a histogram and describe the distribution of the CPU


times.

STAT210: Probability and Statistics 31


Example

The distribution of the CPU times is skewed to the right with one potential outlier.

STAT210: Probability and Statistics 32


Exercises
For each of the following data set, draw a histogram
determine whether the distribution is right-skewed, left-
skewed, or symmetric.
(1) 19, 24, 12, 19, 18, 24, 8, 5, 9, 20, 13, 11, 1, 12, 11, 10,

22, 21, 7, 16, 15, 15, 26, 16, 1, 13, 21, 21, 20, 19
(2) 17, 24, 21, 22, 26, 22, 19, 21, 23, 11, 19, 14, 23, 25,
26, 15, 17, 26, 21, 18, 19,21,24,18,16,20,21,20,23,33

(3) 56,52, 13,34,33, 18, 44, 41, 48, 75, 24, 19,35, 27, 46,
62, 71, 24, 66, 94, 40,18,15,39,53,23,41,78,15,35

STAT210: Probability and Statistics 33


Descriptive Statistics
Visual summaries of data are excellent tools for obtaining preliminary
impressions and insights. More formal data analysis often requires
the calculation and interpretation of numerical summary measures.
In practice, the entire population is never observed, so the
population parameters cannot be calculated directly. However,
sample statistics are often used to estimate parameters.
Percentages and proportions are used to summarize the distribution
of qualitative variables. For quantitative data, we will look at:
 Measures of location (center): mean, median, trimmed mean,
percentiles and quartiles.
 Measures of variability (spread): variance, standard deviation
(SD), range, interquartile range (IQR).

In Minitab, all summary statistics can be produced using:


Stat - Basic Statistics - Display Descriptive Statistics

STAT210: Probability and Statistics 34


Mean
Let x1, x2,…, xn be the values of the sample data, then the
mean is the average of these values.
The sample mean, denoted by x , is given by
n

x i
x i 1

n
Similarly, the population mean, denoted by µ, is given by
N

x i
 i 1
N
where N is the population size.
Sometimes a sample may contain a few points that are much
larger or smaller than the rest. Such points are called outliers
and may affect the mean.
STAT210: Probability and Statistics 35
Median
The median is the value in the middle when the data are
arranged in ascending order (smallest value to largest value).
To find the median the values in the sample are ordered from
smallest to largest, then
 If n is odd, the sample median is the number in (n+1)/2
position .
 If n is even, the sample median is the average of the
numbers in n/2 and (n/2)+1 positions.
Although the mean is the more commonly used measure of
central location, in some situations the median is preferred.
The mean is influenced by extremely small and large data
values. In such case, the median is often the preferred
measure of central location.
STAT210: Probability and Statistics 36
Mean vs. Median
 Mean tends to be drawn in the direction of the tail of a
skewed distribution. The median is more appropriate when
the distribution is highly skewed.
 Mean can be greatly a effected by the presence of outliers
whereas median is not.
 For symmetric distributions, mean and median are the
same.
 For skewed distributions, the mean lies towards the longer
tail relative to the median.

STAT210: Probability and Statistics 37


Mode and Trimmed Mean
Mode:
 Themode is the value which occurs most frequently in the
sample. There may be no mode or may be several modes.
 The mode is not a affected by extreme values.
 Mainly used for grouped numerical data or categorical data.

Trimmed Mean:
 The trimmed mean is a measure of center that is not affected by
outliers.
 With the trimmed mean, p% of the data is trimmed from either
end of the data set.
 First, arranging the sample values in (ascending or descending)
order. 2 Then, trimming an equal number of them (np/100 points)
from each end. Finally, computing the sample mean of the
remaining points.
Note: Minitab prints the 5% trimmed mean.

STAT210: Probability and Statistics 38


Percentile and Quartile
The pth percentile of a sample, for a number between 0 and 100,
divides the sample so that as nearly as possible p% of the sample
values are less than the pth percentile.

To find the percentiles, order the sample values from smallest to


largest. Then compute the quantity i = (n+1)p/100, where n is the
sample size. If this quantity is an integer, the sample value in this
position is the pth percentile. Otherwise, average the two sample
values on either side.

The first quartile, Q1, is the value that has approximately 25% of
the observations below it. It represents the median of the lower half
of the data and corresponds to the 25th percentile.
The second quartile or median is the 50th percentile.
The third quartile, Q3, has approximately 75% of the observations
below it and corresponds to the 75th percentile.
STAT210: Probability and Statistics 39
Measures of Variability: Variance and Standard
Deviation
The variance is the average of squared deviations of values from the
mean. The population variance (σ2) is given by
N
1
  2

N
 ( x  )
i 1
i
2

While the sample variance (s2) is given by


1 N
s 
2
 i
n  1 i 1
( x  x ) 2

The sample variance is a reasonable estimate of the population


variance.

The standard deviation is the square root of the variance.

STAT210: Probability and Statistics 40


Range and Inter Quartile Range
The Range (R) is simplest measure of variation but of limited use.

It is difference between the largest and the smallest observations.


R= max(xi) - min(xi)
It is not commonly used as it is based on only two observations and
is highly influenced by extreme values.

The Interquartile Range (IQR) is the range for the middle 50% of
the data.
IQR = Q3 - Q1
It is not in influenced by outliers but used to detect them.

Detection of outliers: Measure 1.5×(IQR) down from the first


quartile and up from the third quartile. All the data points observed
outside of this interval are classified as outliers.
STAT210: Probability and Statistics 41
Example
To evaluate the effectiveness of a processor for a certain type of
tasks, a researcher recorded the CPU time for n = 30 randomly
chosen jobs (in seconds),
70 36 43 69 82 48 34 62 35 15
59 139 46 37 42 30 55 56 36 82
38 89 54 25 35 24 22 9 56 19

Minitab Output:
Descriptive Statistics: CPU Time

Variable N N* Mean SE Mean StDev Minimum Q1 Median Q3 Maximum


CPU Time 30 0 48.23 4.84 26.52 9.00 33.00 42.50 59.75 139.00

STAT210: Probability and Statistics 42


Boxplot
 The boxplot is a graphical display that simultaneously
describes several important features of a data set, such as
center, spread, departure from symmetry, and identification
of outliers.
 The plot is based on the five number summary:
(minimum; Q1; median; Q3; maximum)
 Comparative or side-by-side boxplots is a very effective
way of comparing two or more data sets consisting of
observations on the same variable fuel efficiency
observations for four different types of automobiles, prices
for three different brands of note-books, and so on.

In Minitab: Graph - Boxplot

STAT210: Probability and Statistics 43


Distribution shape and Boxplot

STAT210: Probability and Statistics 44


Example

The distribution of the CPU times is skewed to the right with


one outlier.

STAT210: Probability and Statistics 45


Comparative or side-by-side
boxplot
The following comparative boxplots represent the amount of internet traffic
handled by a certain center during a week. What we can see:
 Traffic is heaviest on Fridays and least on Saturdays and Sundays.
 The greatest spread occurs on Fridays and the least on Saturdays and
Sundays.
 The distributions all appear to be slightly right skewed, although there is
little skew in the distributions on Saturday and Sunday. There our large
outliers on Monday, Thursday, and Friday.

STAT210: Probability and Statistics 46


Exercises
(1) The following data set represents the number of new computer
accounts registered during ten consecutive days:
43 37 50 51 58 105 52 45 45 10
a) Compute the mean, median, quartiles, and standard deviation.
b) Delete the outliers and redo part (a) again.
c) Make a conclusion about the effect of outliers.

(2) The numbers of blocked intrusion attempts on each day during


the first two weeks of the month were
56 47 49 37 38 60 50 43 43 59 50 56 54 58
After the change of firewall settings, the numbers of intrusions during the next
20 days were
53 21 32 49 45 38 44 33 32 43

53 46 36 48 39 35 37 36 39 45
compare the number of intrusions before and after the change, construct
parallel boxplots and comment on your findings.
STAT210: Probability and Statistics 47
Exercise
(3) Match each histogram to the boxplot that represents the
same data set.

STAT210: Probability and Statistics 48


Exercise
(4) A network provider investigates the load of its network. The
number of concurrent users is recorded at 50 locations (‘000 of
people),
17.2 22.1 18.5 17.2 18.6 14.8 21.7 15.8 16.3 22.8

24.1 13.3 16.2 17.5 19.0 23.9 14.8 22.2 21.7 20.7

13.5 15.8 13.1 16.1 21.9 23.9 19.3 12.0 19.9 19.4

15.4 16.7 19.5 16.2 16.9 17.1 20.2 13.4 19.8 17.7

19.7 18.7 17.6 15.9 15.2 17.1 15.0 18.8 21.6 11.9

a) Compute the sample mean, variance, and standard deviation of


the number of concurrent users.
b) Compute the five-number summary and construct a boxplot.
c) Compute the interquartile range. Are there any outliers?
d) It is reported that the number of concurrent users follows
approximately normal distribution. Does the histogram support
this claim?
STAT210: Probability and Statistics 49

You might also like