You are on page 1of 18

LESSON 1: REVIEW OF BASIC STATISTICAL CONCEPTS

CA51018: STATISTICAL ANALYSIS WITH SOFTWARE APPLICATION


PROFESSOR: Asst. Prof. Arnold Petalver, Ph.D.
UST-AMV College of Accountancy Second Semester | A.Y. 2023 - 2024

WHAT IS STATISTICS? (DOANE & SEWARD, 2019) WHY STUDY STATISTICS? (LEVINE ET AL., 2008)
● Statistics is the science of collecting, organizing, ● Decision Makers Use Statistics To:
analyzing, interpreting, and presenting data. ○ Present and describe business data and information
○ Root word: State properly
● A statistic is a single measure, reported as a number, ○ Draw conclusions about large populations, using
used to summarize a sample data set; for example, information collected from samples
the average height of students in a university. ○ Make reliable forecasts about a business activity
● Examples of statistics – average height for the length ○ Improve business processes
of the gowns, maximum height to design the height
of the doorways of the classrooms, etc. BASIC VOCABULARY OF STATISTICS (LEVINE ET AL.,
2008)
KINDS OF STATISTICS (DOANE & SEWARD, 2019) ● VARIABLE
● Descriptive Statistics refers to the collection, ○ A variable is a characteristic of an item or individual.
presentation, and summary of data (either using ● DATA
charts and graphs or using numerical summary). ○ Data are the different values associated with a
● Inferential Statistics refers to the generalizing from a variable.
sample to a population, estimating unknown ● OPERATIONAL DEFINITIONS
population parameters, drawing conclusions, and ○ Variable values are meaningless unless their
making decisions. variables have operational definitions, universally
accepted meanings that are clear to all associated
STATISTICS (LEVINE ET AL., 2008) with an analysis.
● The branch of mathematics that transforms data into ● POPULATION
useful information for decision makers. ○ A population consists of all the items or individuals
● Descriptive Statistics about which you want to draw a conclusion.
○ Collecting, summarizing, and describing data ● SAMPLE
○ Collect data – Survey ○ A sample is the portion of a population selected for
○ Present data – Tables and graphs analysis.
○ Characterize data – Sample mean ● PARAMETER
● Inferential Statistics ○ A parameter is a numerical measure that describes a
○ Drawing conclusions and/or making decisions characteristic of a population.
concerning a population based only on sample data ● STATISTIC
○ Estimation – Ex. Estimate the population mean ○ A statistic is a numerical measure that describes a
weight using the sample mean weight characteristic of a sample.
○ Hypothesis testing – Ex. Test the claim that the
population mean weight is 120 pounds POPULATION VS. SAMPLE (LEVINE ET AL., 2008)
● Drawing conclusions and/or making decisions ● Population
concerning a population based on sample results. ○ Measures used to describe the population are called
parameters
Tasks in Statistics (Doane & Seward, 2019) ● Sample
○ Measures computed from sample data are called
statistics

DATA MEASUREMENT
● Why collect data?
○ A marketing research analyst needs to assess the
effectiveness of a new television advertisement.
○ A pharmaceutical manufacturer needs to determine
whether a new drug is more effective than those
currently in use.
○ An operations manager wants to monitor a
manufacturing process to find out whether the

1 TJMS, JLLA
quality of product being manufactured is
conforming to company standards.
○ An auditor wants to review the financial transactions
of a company in order to determine whether the
company is in compliance with generally accepted
accounting principles.

SOURCES OF DATA
● Primary Sources: The data collector is the one using
the data for analysis.
○ Data from a market survey
○ Data collected from an experiment
○ Observed data
● Secondary Sources: The person performing data
LEVELS OF MEASUREMENT
analysis is not the data collector CATEGORICAL VARIABLES
○ Analyzing census data ● A nominal scale classifies data into distinct categories
○ Examining data from print journals, from in which no ranking is implied.
government agencies or data published on the
Categorical Variables Categories
internet I-Phone Ownership Yes / No
Type of Bank Accounts Savings/Current/Investment
TYPES OF VARIABLES Cable TV Provider Sky/Destiny
● Categorical (qualitative) variables have values that can ● An ordinal scale classifies data into distinct categories
only be placed into categories, such as “yes” and “no.” in which ranking is implied.
● Numerical (quantitative) variables have values that Categorical Variables Categories
represent quantities. Student class designation Freshman, Sophomore, Junior,
Senior
Product satisfaction Satisfied, Neutral, Unsatisfied
Faculty rank Professor, Associate Professor,
Assistant Professor, Instructor
Standard & Poor’s bond AAA, AA, A, BBB, BB, B, CCC,
ratings CC, C, DDD, DD, D
Student Grades A, B, C, D, F

LEVELS OF MEASUREMENT
NUMERICAL VARIABLES
● The difference between numerical and categorical ● An interval scale is an ordered scale in which the
data. difference between measurements is a meaningful
quantity but the measurements do not have a true
zero point.
● A ratio scale is an ordered scale in which the
difference between the measurements is a
meaningful quantity and the measurements have a
true zero point.

LEVELS OF MEASUREMENT
INTERVAL MEASUREMENT
NOTE: Ambiguity is introduced when continuous data ● Data can not only be ranked, but also have meaningful
are rounded to whole numbers. intervals between scale points (e.g., difference
● The levels of measurement in data and ways of coding
between 60F and 70F is same as difference between
data.
20F and 30F).
● Since intervals between numbers represent distances,
mathematical operations can be performed (e.g.,
average).
● Zero point of interval scales is arbitrary, so ratios are
not meaningful (e.g., 60F is not twice as warm as 30F).
● Likert Scales
○ A special case of interval data frequently used in
survey research.

2 TJMS, JLLA
○ The coarseness of a Likert scale refers to the number INTERVAL AND RATIO SCALES
of scale points (typically 5 or 7). NUMERICAL VARIABLES

○ A neutral midpoint (“Neither Agree Nor Disagree”) is


allowed if an odd number of scale points is used or
omitted to force the respondent to “lean” one way or ● Changing Data By Recoding
the other. ○ In order to simplify data or when exact data
magnitude is of little interest, ratio data can be
Likert coding: 1-5 scale Likert coding: -2 to +2 scale
recoded downward into ordinal or nominal
5 = Help a lot +2 = Help a lot
4 = Help a little +1 = Help a little measurements (but not conversely).
3 = No effect 0 = No effect ■ For example, recode systolic blood pressure as
2 = Hurt a little −1 = Hurt a little “normal” (under 130), “elevated” (130 to 140), or
1 = Hurt a lot −2 = Hurt a lot “high” (over 140).
○ Careful choice of verbal anchors results in
■ The above recoded data are ordinal (ranking is
measurable intervals (e.g., the distance from 1 to 2 is
preserved), but intervals are unequal and some
“the same” as the interval, say, from 3 to 4).
information is lost.
○ Ratios are not meaningful (e.g., here 4 is not twice 2).
○ Many statistical calculations can be performed (e.g.,
SAMPLING CONCEPTS
averages, correlations, etc.).
● A population involves all of the items one is interested
○ More variants of Likert scales:
in. It may be finite (e.g., all of the passengers on a
plane) or effectively infinite (e.g., all of the Cokes
produced in an ongoing bottling process).
● A sample is a subset of the population and involves
looking only at some of the items selected from the
population.

SAMPLE VS. CENSUS


● A sample involves looking only at some items selected
from the population.
LEVELS OF MEASUREMENT
RATIO MEASUREMENT ● A census is an examination of all items in a defined
● Ratio data have all properties of nominal, ordinal and population.
interval data types and also possess a meaningful zero ● Why can’t the United States Census survey every
(absence of quantity being measured). person in the population? – mobility, undocumented
● Because of this zero point, ratios of data values are workers, budget constraints, incomplete responses,
meaningful (e.g., $20 million profit is twice as much as etc.
$10 million).
● Zero does not have to be observable in the data, it is an
absolute reference point.
● Use the following procedure to recognize data types:
Question If “Yes”
Q1. Is there a meaningful zeroRatio data (all statistical
point? operations are allowed)
Q2. Are intervals between Interval data (common
scale points meaningful? statistics allowed, e.g., means
and standard deviations)
Q3. Do scale points represent Ordinal data (restricted to
rankings? certain types of nonparametric
statistical tests)
Q4. Are there discrete Nominal data (only counting
categories? allowed, e.g. finding the mode)

3 TJMS, JLLA
PARAMETER VS. STATISTIC UNIVARIATE AND BIVARIATE DATA
● A parameter is a measurement or characteristic of the ● Univariate: set of n observations in one variable, e.g.,
population. height of Grade 3 pupils, GWA of second year students,
● A parameter is usually unknown because we can rarely etc.
observe the entire population. ● Bivariate: set of n observations involving two variables,
● Statistics is a numerical value computed from a e.g., height and weight of newly born babies, monthly
sample. income and expenses of selected employees
● Statistics can be used as estimates of parameters
found in the population. ORGANIZING CATEGORICAL DATA
● Symbols are used to represent population parameters
and sample statistics. UNIVARIATE
● From a sample on n items, chosen from a population,
we compute statistics that can be used as estimates SUMMARY TABLE
of parameters found in the population. ● A summary table indicates the frequency, amount, or
● To avoid confusion, we use different symbols for each percentage of items in a set of categories so that you
parameter and its corresponding statistic. can see differences between categories.
● Thus, the population mean is denoted by μ ( the lower
How do you spend the holidays? Frequency Percentage
Greek letter mu) while the sample mean is x̄.
At home with family 25 50
● The population proportion is denoted π (the lowercase Travel to visit family 7 14
Greek letter pi), while the sample proportion is p. Vacation 6 12
Catching up on work 8 16
Other 4 8

BAR CHART
● In a bar chart, a bar shows each category, the length
of which represents the amount, frequency or
percentage of values falling into a category.

● Target Population
○ The population must be carefully specified and the
sample must be drawn scientifically so that the
sample is representative.
■ The target population is the population we are
interested in (e.g., U.S. gasoline prices).
■ The sampling frame is the group from which we
take the sample (e.g., 115,000 stations).
■ The frame should not differ from the target
PIE CHART
population.
● The pie chart is a circle broken up into slices that
represent categories. The size of each slice of the pie
DATA PRESENTATION USING EXCEL
varies according to the percentage in each category.

BENEFITS OF EXCEL’S ATTRACTIVE FEATURES


● Not having to incur the extra costs of using specialized
statistical programs
● Familiarity with excel
● Easy to use and easy to learn
● Allows to use the same worksheet-based data that
users have created for other business purposes
● Some graphical functions produce more vivid visual
outputs.

BIVARIATE

4 TJMS, JLLA
CROSS TABULATIONS: THE CONTINGENCY TABLE FREQUENCY DISTRIBUTION AND HISTOGRAM
● A cross-classification (contingency) table presents ● Bins and Bin Limits
the results of two categorical variables. The joint ○ A frequency distribution is a table formed by
responses are classified so that the categories of one classifying n data values into k classes (bins)
variable are located in the rows and the categories of ○ Bin limits define the values to be included in each
the other variable are located in the columns. bin. Widths must all be the same.
● The cell is the intersection of the row and column and ○ Frequencies are the number of observations within
the value in the cell represents the data corresponding each bin.
to that specific pairing of row and column categories. ○ Express as relative frequencies (frequency divided by
● A useful way to visually display the results of the total) or percentages (relative frequency times
cross-classification data is by constructing a 100)
side-by-side bar chart. ● Constructing a Frequency Distribution
● Example: 1. Find the smallest and largest data values
○ A survey was conducted to determine whether Quiz Scores in Math 607
students stay in the Philippines during holidays. The 55 23 45 50 58
22 34 38 38 42
result, classified by gender, is as follows:
36 30 30 50 52
Where do you stay during holidays? Male Female Total 54 54 49 54 60
In the Philippines 16 20 36 20 27 39 36 47
Other countries 2 12 14 53 52 57 50 50
Total 18 32 50 37 45 48 55 59
35 55 46 41 42
39 45 42 38 50
2. Choose the number of bins (k)
■ k should be much smaller than n
■ Too many bins results in sparsely populated bins,
too few and dissimilar data values are lumped
together
■ Herbert Sturges proposes the following rule:
Sample Size (n) Suggested Number
of Bins (k)
16 5
32 6
ORGANIZING NUMERICAL DATA 64 7
128 8
256 9
ORDERED ARRAY 512 10
● An ordered array is a sequence of data, in rank order, 1024 11
from the smallest value to the largest value. 3. Set the bin limits
Day Students
xmax - xmin
16 17 17 18 18 18 Bin Width =
Age of
19 19 20 20 21 22
k
Surveyed
College
22 25 27 32 38 42 ■ For example, for k = 7 bins, the approximate bin
Night Students width is:
Students
18 18 19 19 20 21
60 - 20
23 28 32 33 41 45 Bin Width = = 5.7 ≈ 6
7
4.Put the data values in the appropriate bin.
STEM AND LEAF DISPLAY
5. Create the table, you can include:
● A stem-and-leaf display organizes data into groups
■ Frequencies - counts for each bin
(called stems) so that the values within each group
■ Relative frequencies - absolute frequency divided
(the leaves) branch out to the right on each row.
by the total number of data
■ Cumulative frequency - accumulated relative
Age of College Students
frequency values as bin limit increases
Day Students Night Students
Stem Leaf Stem Leaf
THE HISTOGRAM (BAR CHART)
1 67788899 1 8899
2 0012257 2 0138
3 28 3 23 Quiz Score Frequency Relative Percentage
4 2 4 15 Frequency
20 - 25 3 0.07 6.67
26 - 31 3 0.07 6.67

5 TJMS, JLLA
32 - 37 5 0.11 11.11
38 - 43 9 0.20 20.00
44 - 49 7 0.16 15.56
50 - 55 14 0.31 31.11
56 - 61 4 0.09 8.89
Total 45 1.00 100.00

THE HISTOGRAM IN EXCEL

BIVARIATE DATA - SCATTER PLOTS


● Scatter plots are used for numerical data consisting of
paired observations taken from two numerical
variables
● One variable is measured on the vertical axis and the
other variable is measured on the horizontal axis

TIME SERIES - BIVARIATE DATA


● A time-series plot is used to study patterns in the
THE POLYGON (LINE GRAPH) values of a numerical variable over time. Each value is
● A percentage polygon is formed by having the plotted as a point in two dimensions with the time
midpoint of each class represent the data in that class period on the horizontal X axis and the variable of
and then connecting the sequence of midpoints at interest on the Y axis.
their respective class percentages.
● The cumulative percentage polygon, or ogive,
displays the variable of interest along the X axis, and
the cumulative percentages along the Y axis.
● NOTE: In a percentage polygon the vertical axis would
be defined to show the percentage of observations per
class

6 TJMS, JLLA
GRAPHICAL ERRORS
● Chart Junk

● No Relative Basis

● No Zero Point on the Vertical Axis


SCATTER PLOT IN EXCEL (97-2003)

CHAPTER SUMMARY
PRINCIPLES OF EXCELLENT GRAPHS
● The graph should not distort the data.
● The graph should not contain unnecessary
adornments (sometimes referred to as chart junk).
● The scale on the vertical axis should begin at zero.
● All axes should be properly labeled.
● The graph should contain a title.
● The simplest possible graph should be used for a given
set of data.

7 TJMS, JLLA
data be
values affected
exist. by gaps
in data
values.
Mode Most =MODE(Data) Useful for May not
frequently attribute be
occurring data or unique,
data value discrete and is not
data with helpful
a small for
range. continuo
us data.
Trimmed Same as the =TRIMMEAN(Data Mitigates Excludes
● In this chapter, we have mean mean except ,Percent) effects of some
omit highest extreme data
○ Organized categorical data using the summary and lowest values. values
table, bar chart, pie chart, and Pareto diagram. k% of data that
○ Organized numerical data using the ordered array, values (e.g., could be
5%) relevant.
stem and leaf display, frequency distribution,
Geometric =GEOMEAN(Data) Useful for Less
histogram, polygon, and ogive. mean (G) growth familiar
○ Examined cross tabulated data using the rates and and
contingency table and side-by-side bar chart. mitigates requires
high positive
○ Developed scatter plots and time series graphs. extremes data.
○ Examined the do’s and don'ts of graphically .
displaying data.
ARITHMETIC MEAN
MEASURES OF CENTRAL TENDENCY, POSITION, AND ● The arithmetic mean (mean) is the most common
VARIABILITY measure of central tendency.
● For a sample of size n:
DESCRIPTIVE STATISTICS
● The central tendency is the extent to which all the
data values group around a typical or central value.
○ Single number that represents all the distribution
● The variation is the amount of dispersion, or
● The most common measure of central tendency
scattering, of values.
● Mean = sum of values divided by the number of values
● The shape is the pattern of the distribution of values
● Affected by extreme values (outliers)
from the lowest value to the highest value.
● Used if scale data
○ Normality of the data. (Skewness etc.)
● Used to describe the data. (Describe first before
analysis)
NUMERICAL DESCRIPTION
Characteristic Interpretation
Central Tendency Where are the data values concentrated?
What seem to be typical or middle data
values?
Dispersion or How much variation is there in the data? GEOMETRIC MEAN
Variation How spread out are the data values? Are
● Geometric mean
there unusual values?
Shape Are the data values distributed ○ Used to measure the rate of change of a variable
symmetrically? Skewed? Sharply peaked? over time.
Flat? Bimodal?

FIVE MEASURES OF CENTRAL TENDENCY ● Geometric mean rate of return


Statistic Formula Excel Function Pro Con
○ Measures the status of an investment over time.
Mean =AVERAGE(Data) Familiar Influence
and uses d by ○ Where Ri is the rate of return in time period i
all the extreme
sample values.
informati
on.
Median Middle value =MEDIAN(Data) Robust Ignores
in sorted array when extremes
extreme and can

8 TJMS, JLLA
MEASURES OF CENTRAL TENDENCY LOCATING THE
MEDIAN
● The median of an ordered set of data is located at the
n + ½ ranked value.
● If the number of values is odd, the median is the
middle number.
● If the number of values is even, the median is the
average of the two middle numbers.
● Note that n + ½ is NOT the value of the median, only
● Use the 1-year returns to compute the arithmetic the position of the median in the ranked data.
mean and the geometric mean:
MODE
● Value that occurs most often
● Not affected by extreme values
● Used for categorical data
● Used for numerical primarily when grouped
● There may be no mode
● There may be several modes
Geometric Mean Example
An investment of $100,000 declined to $50,000 at the end
of year one and rebounded to $100,000 at end of year two:
1. Convert rates, –50% and 100% to decimals (divide by 100) to
get -.5 and 1.00
2. Add 1 to each decimal yields .5 and 2
3. Find the geometric mean using the geomean function
4. Subtract 1 from the answer to get a decimal TRIMMED MEAN
5. Convert back to a rate of return (% icon).
● Used if your data has outliers
● To calculate the trimmed mean, first remove the
CENTRAL TENDENCY highest and lowest k percent of the observations.
● Growth Rates
● To determine how many observations to trim, multiply
○ A variation on the geometric mean used to find the
k by n and round off the result.
average growth rate for a time series.
● Let us say that k x n = 3.4 = 3. So, we would remove the
three smallest and three largest observations before
averaging the remaining values.

● The average growth rate is given by taking the


geometric mean of the ratios of each year’s revenue to
the preceding year.
● Due to cancellations, only the first and last years are
relevant.

● or 38.9 % per year. In Excel, we would use


=(2363/635)^(1/4)-1

MEDIAN
● In an ordered array, the median is the “middle”
number (50% above, 50% below)
● Not affected by extreme values.
● Used if ordinal data
LOCATING EXTREME OUTLIERS Z-SCORE
● To compute the Z-score of a data value, subtract the
mean and divide by the standard deviation.

9 TJMS, JLLA
● The Z-score is the number of standard deviations a ○ Sample Data in Ordered Array: 11 12 13 16 16 17 18 21 22
data value is from the mean. ■ First, note that n = 9.
● A data value is considered an extreme outlier if its ■ Q1 = is in the (9+1)/4 = 2.5 ranked value of the
Z-score is less than -3.0 or greater than +3.0. ranked data, so use the value half way between
● The larger the absolute value of the Z-score, the farther the 2nd and 3rd ranked values, so Q1 = 12.5
the data value is from the mean. ■ Q1 and Q3 are measures of non-central location Q2
= median, a measure of central tendency.

DISPERSION
● Variation is the “spread” of data points about the
center of the distribution in a sample. Consider the
following measures of dispersion:

● Suppose the mean math SAT score is 490, with a MEASURES OF VARIATION
standard deviation of 100. Statistic Formula Excel Pro Con
● Compute the z-score for a test score of 620. Formula
Range =MAX(Data) Easy to Sensitive to
- MIN(Data) calculate extreme
data values.
● A score of 620 is 1.3 standard deviations above the Variance (s2 ) =VAR(Data) Plays a key Non-intuitiv
role in e meaning.
mean and would not be considered an outlier. mathemati
● Suppose the mean math SAT score is 490, with a cal
standard deviation of 100. statistics.
Standard =STDEV(Dat Most Non
● Compute the 3z-score.
deviation (s) a) common Intuitive
measure. meaning.
Uses the
same units
as the raw
● A score less than 190 or greater than 790 would be data ($ , £,
considered an outlier. ¥, etc.).
Coefficient. None Measures Requires
of variation relative non-negati
QUARTILE MEASURES (CV) variation in ve data.
● Quartiles split the ranked data into 4 segments with percent so
can
an equal number of values per segment.
compare
data sets.
Mean =AVEDEV(D Easy to Lacks “nice”
absolute ata) understand. theoretical
deviation properties.
(MAD)
● The first quartile, Q1 , is the value for which 25% of the
observations are smaller and 75% are larger.
● Q2 is the same as the median (50% are smaller, 50% MEASURES OF VARIATION RANGE
are larger). ● Simplest measure of variation
● Only 25% of the values are greater than the third ● Difference between the largest and the smallest
quartile. values:

QUARTILE MEASURES GUIDELINES


● Rule 1: If the result is a whole number, then the
quartile is equal to that ranked value.
● Rule 2: If the result is a fraction half (2.5, 3.5, etc), then
the quartile is equal to the average of the
corresponding ranked values.
● Rule 3: If the result is neither a whole number or a DISADVANTAGES OF THE RANGE
fractional half, you round the result to the nearest ● Ignores the way in which data are distributed.
integer and select that ranked value.

QUARTILE MEASURES LOCATING THE FIRST QUARTILE


● Example: Find the first quartile

10 TJMS, JLLA
● Sensitive to outliers COMPARING THE STANDARD DEVIATION

VARIANCE
● The population variance (σ2) is defined as the sum of
squared deviations around the mean m divided by the
population size.

● For the sample variance (s2 ), we divide by n – 1 instead


of n, otherwise s2 would tend to underestimate the
unknown population variance σ2 .
COEFFICIENT OF VARIATION
● The coefficient of variation (CV) is the ratio of the
standard deviation to the mean. The higher the
coefficient of variation, the greater the level of
dispersion around the mean.
● Useful for comparing variables measured in different
STANDARD DEVIATION units or with different means.
● The square root of the variance. ● A unit-free measure of dispersion
● Explains how individual values in a data set vary from ● Expressed as a percent of the mean.
the mean.
● Units of measure are the same as X.

● Only appropriate for nonnegative data. It is undefined


if the mean is zero or negative.

11 TJMS, JLLA
○ The coefficient compares two samples measured in
different units or one sample with a known
reference distribution (e.g., symmetric normal
distribution).
○ Calculate the sample’s skewness coefficient as:

MEAN ABSOLUTE DEVIATION ○ In Excel, use Data Analysis/Descriptive Statistics or


● The Mean Absolute Deviation (MAD) reveals the the function =SKEW(array)
average distance from an individual data point to the
mean (center of the distribution).
● Uses absolute values of the deviations around the
mean.

CENTRAL TENDENCY VS. DISPERSION


● Job Performance
○ Consider student ratings of four professors on eight
teaching attributes (10-point scale).

○ Consider the following table showing the 90% range


for the sample skewness coefficient.

○ A high mean (better rating) and low standard


○ Coefficients within the 90% range may be attributed
deviation (more consistency) is preferred. Which
to random variation.
professor do you think is best?

○ Coefficients outside the range suggest the sample


SKEWNESS AND KURTOSIS came from a nonnormal population.
● Skewness
○ Generally, skewness may be indicated by looking at
the sample histogram.

○ As n increases, the range of chance variation


○ This visual indicator is imprecise and does not take narrows.
into consideration sample size n.
○ Skewness may be indicated by comparing the mean
and median.

● Kurtosis
○ Kurtosis is the relative length of the tails and the
degree of concentration in the center.
○ Skewness is a unit-free statistic. ○ Consider three kurtosis prototype shapes.

12 TJMS, JLLA
○ A histogram is an unreliable guide to kurtosis since
scale and axis proportions may differ.
○ Excel and MINITAB calculate kurtosis as:

○ Consider the following table of expected 90% range


for sample kurtosis coefficient.
BIVARIATE DATA
● Sample Covariance
○ The sample covariance measures the strength of the
linear relationship between two numerical variables.
○ The sample covariance:

○ A sample coefficient within the ranges may be


attributed to chance variation.

○ The covariance is only concerned with the strength


of the relationship.
○ No causal effect is implied.
○ Covariance between two random variables:
○ Coefficients outside the range would suggest the ○ Statistical function covar also in Data Analysis.
sample differs from a normal population. ○ cov(X,Y) > 0 X and Y tend to move in the same
direction.
○ cov(X,Y) < 0 X and Y tend to move in opposite
directions
○ cov(X,Y) = 0 X and Y are independent
● The Correlation Coefficient
○ Unit free
○ Ranges between –1 and 1
○ As sample size increases, the chance range narrows. ○ The closer to –1, the stronger the negative linear
○ Inferences about kurtosis are risky for n < 50. relationship
○ The closer to 1, the stronger the positive linear
relationship
○ The closer to 0, the weaker any linear relationship

GENERAL DESCRIPTIVE STATS USING MICROSOFT


EXCEL
1. Select Data tab.
2. Select Data Analysis.
3. Select Descriptive Statistics and click OK.
4. Enter the cell range.
5. Check the Summary Statistics box.
6. Click OK ● The Correlation Coefficient Using Microsoft Excel
● An example of a Microsoft Excel descriptive statistics 1. Select Data tab /Data Analysis
output. 2. Choose Correlation from the selection menu 3

13 TJMS, JLLA
3. Click OK NORMAL DISTRIBUTION AND TEST OF NORMALITY
4. Enter Input Range
5. Check Labels in First Row if data has labels CONTINUOUS RANDOM VARIABLE
● A variable that can assume any value on a continuum
(can assume an uncountable number of values)
● Examples are as follows:
○ thickness of an item
○ time required to complete a task
○ temperature of a solution
○ Height
● Discrete: One which may take on only a countable
number of distinct values
○ Class Size
○ Shoe Size

NORMAL DISTRIBUTION
● It is the most common continuous distribution.
● Also known as the Gaussian distribution or the bell
curve.
● In this distribution, the probability that various values
occur within certain ranges or intervals can be
calculated.
6. Click OK to get output
NORMAL DISTRIBUTION PROPERTIES
● ‘Bell Shaped’
● Symmetrical
● Mean, Median and Mode are equal
● Location is characterized by the mean, μ
● Spread is characterized by the standard deviation, σ
● The random variable has an infinite theoretical range:
-∞ to +∞

NORMAL DISTRIBUTION SHAPE

ETHICAL CONSIDERATIONS
● Should document both good and bad results.
● Should be presented in a fair, objective and neutral
manner.
● Should not use inappropriate summary measures to
distort facts.

14 TJMS, JLLA
NORMAL PROBABILITIES
● Probability is measured by the area under the curve.

● The total area under the curve is 1.0, and the curve is
symmetric, so half is above the mean, half is below.

STANDARDIZED NORMAL DISTRIBUTION


● Also known as the “Z” distribution
● Mean is 0
● Standard Deviation is 1

● Values above the mean have positive Z-values, values


below the mean have negative Z-values.
● Example: ● Finding Normal Probability (Example 1)
○ If X is distributed normally with mean of 100 and ○ Let X represent the time it takes (in seconds) to
standard deviation of 50, the Z value for X = 200 is download an image file from the internet.
○ Suppose X is normal with mean 8.0 and standard
deviation 5.0
○ Find P(X < 8.6)
○ This says that X = 200 is two standard deviations (2
increments of 50 units) above the mean of 100.

○ Note that the distribution is the same, only the scale


has changed. We can express the problem in original
units (X) or in standardized units (Z)

15 TJMS, JLLA
ASSESSING NORMALITY
● It is important to evaluate how well the data set is
approximated by a normal distribution.
● Normally distributed data should approximate the
theoretical normal distribution:
○ The normal distribution is bell shaped (symmetrical)
where the mean is equal to the median.
○ The empirical rule applies to the normal distribution.
○ The interquartile range of a normal distribution is
1.33 standard deviations.
■ Interquartile Range: Q3 - Q1
● Construct charts or graphs
○ For small- or moderate-sized data sets, do stem
and-leaf display and box-and-whisker plot look
symmetric?
○ For large data sets, does the histogram or polygon
appear bell-shaped?
● Compute descriptive summary measures
○ Do the mean, median and mode have similar
values?
○ Is the interquartile range approximately 1.33 σ?
○ Is the range approximately 6 σ?
● Observe the distribution of the data set
● Try this: Given the normal probability, find the value ○ Do approximately 2/3 of the observations lie within
of x mean ± 1 standard deviation?
○ Let X represent the time it takes (in seconds) to ○ Do approximately 80% of the observations lie within
download an image file from the internet. mean ± 1.28 standard deviations?
○ Suppose X is normal with mean 8.0 and standard ○ Do approximately 95% of the observations lie within
deviation 5.0 mean ± 2 standard deviations?
○ Find X such that 20% of download times are less ● Evaluate normal probability plot
than X. ○ Is the normal probability plot approximately linear
with positive slope?

16 TJMS, JLLA
THE EMPIRICAL RULE AS APPLIED TO THE NORMAL
DISTRIBUTION
● This rule states that for symmetrical bell-shaped data
sets, one can find that roughly two out of every three
observations are contained within a distance of 1
standard deviation around the mean and roughly. ● The Box and central line are centered between the
endpoints if data are symmetric around the median.

THE NORMAL PROBABILITY PLOT ● A Box-and-Whisker plot can be shown in either vertical
● A normal probability plot for data from a normal or horizontal format.
distribution will be approximately linear:

OTHER WAYS OF ASSESSING NORMALITY OF DATA


● Checking for skewness with Pearson coefficient (PC) of
skewness as

○ Note: The data is considered significantly skewed


when PC is greater than or equal to +1 or less than or
equal to -1.
● Checking for outliers
○ Note: An outlier is a data value that lier more than 1.5
(IQR) units below Q1 or 1.5(IQR) units above Q3.

EXPLORATORY DATA ANALYSIS


THE FIVE NUMBER SUMMARY
● The five numbers that describe the spread of data are:
○ Minimum
○ First Quartile (Q1 )
○ Median (Q2 )
○ Third Quartile (Q3 )
○ Maximum

EXPLORATORY DATA ANALYSIS


THE BOX-AND-WHISKER PLOT
● The Box-and-Whisker Plot is a graphical display of the
five number summary.

17 TJMS, JLLA
18 TJMS, JLLA

You might also like