You are on page 1of 54

Introduction To Statistics

Recommended literature:
Hanke E. J, Reitsch A. G: Understanding Business
Statistics
Anderson, D.R. - Sweeney, D.J. - Williams, T.A.:
Statistics for Business and Economics. South-Western
Pub., 2005, 320 p., ISBN 978-032-422-486-3
Jaisingh, L.R.: Statistics for the Utterly Confused.
McGraw Hill, 2005, 352 p., ISBN 978-007-146-193-1
Everitt, B. S.: The Cambridge (explanatory) dictionary
of statistics. Cambridge University Press, 2006, 442 p.,
ISBN 978-052-169-027-0
Illowsky, B. - Dean, S. (2009, August 5). Collaborative
Statistics. Retrieved from the Connexions Web site:
http://cnx.org/content/col10522/1.36
Why study statistics?

1. Data are everywhere


2. Statistical techniques are used to make many
decisions that affect our lives
3. No matter what your career, you will make
professional decisions that involve data. An
understanding of statistical methods will help you
make these decisions efectively
Applications of statistical concepts in the
business world
Finance – correlation and regression, index numbers,
time series analysis
Marketing – hypothesis testing, chi-square tests,
nonparametric statistics
Personel – hypothesis testing, chi-square tests,
nonparametric tests
Operating management – hypothesis testing,
estimation, analysis of variance, time series analysis
Statistics
The science of collectiong, organizing, presenting,
analyzing, and interpreting data to assist in making
more effective decisions
Statistical analysis – used to manipulate summarize,
and investigate data, so that useful decision-making
information results.
Types of statistics
Descriptive statistics – Methods of organizing,
summarizing, and presenting data in an informative
way
Inferential statistics – The methods used to determine
something about a population on the basis of a sample
Population –The entire set of individuals or objects of
interest or the measurements obtained from all
individuals or objects of interest
Sample – A portion, or part, of the population of interest
Inferential Statistics
 Estimation
e.g., Estimate the population mean
weight using the sample mean
weight
 Hypothesis testing
e.g., Test the claim that the
population mean weight is 70 kg

Inference is the process of drawing conclusions or making decisions about a


population based on sample results
Sampling
a sample should have the same characteristics
as the population it is representing.
Sampling can be:
with replacement: a member of the population may
be chosen more than once (picking the candy from the
bowl)
 without replacement: a member of the population
may be chosen only once (lottery ticket)
Sampling methods
Sampling methods can be:
 random (each member of the population has an equal chance
of being selected)
 nonrandom

The actual process of sampling causes sampling


errors. For example, the sample may not be large
enough or representative of the population. Factors not
related to the sampling process cause nonsampling
errors. A defective counting device can cause a
nonsampling error.
Random sampling methods
 simple random sample (each sample of the same size has
an equal chance of being selected)
 stratified sample (divide the population into groups called
strata and then take a sample from each stratum)
 cluster sample (divide the population into strata and then
randomly select some of the strata. All the members from
these strata are in the cluster sample.)
 systematic sample (randomly select a starting point and
take every n-th piece of data from a listing of the
population)
Descriptive Statistics

Collect data
e.g., Survey

Present data
e.g., Tables and graphs

Summarize data
e.g., Sample mean = X i

n
Statistical data
The collection of data that are relevant to the problem
being studied is commonly the most difficult, expensive,
and time-consuming part of the entire research project.
Statistical data are usually obtained by counting or
measuring items.
Primary data are collected specifically for the analysis
desired
Secondary data have already been compiled and are
available for statistical analysis
A variable is an item of interest that can take on many
different numerical values.
A constant has a fixed numerical value.
Data
Statistical data are usually obtained by counting or
measuring items. Most data can be put into the
following categories:
Qualitative - data are measurements that each fail into
one of several categories. (hair color, ethnic groups
and other attributes of the population)
quantitative - data are observations that are measured
on a numerical scale (distance traveled to college,
number of children in a family, etc.)
Qualitative data
Qualitative data are generally described by words or
letters. They are not as widely used as quantitative data
because many numerical techniques do not apply to the
qualitative data. For example, it does not make sense to
find an average hair color or blood type.
Qualitative data can be separated into two subgroups:
 dichotomic (if it takes the form of a word with two options
(gender - male or female)
 polynomic (if it takes the form of a word with more than
two options (education - primary school, secondary school
and university).
Quantitative data
Quantitative data are always numbers and are the
result of counting or measuring attributes of a population.
Quantitative data can be separated into two
subgroups:
 discrete (if it is the result of counting (the number of
students of a given ethnic group in a class, the number of
books on a shelf, ...)
 continuous (if it is the result of measuring (distance
traveled, weight of luggage, …)
Types of variables
Variables

Qualitative Quantitative

Dichotomic Polynomic Discrete Continuous

Children in family, Amount of income


Gender, marital Brand of Pc, hair
Strokes on a golf tax paid, weight of a
status color
hole student
Numerical scale of measurement:
 Nominal – consist of categories in each of which the number of respective
observations is recorded. The categories are in no logical order and have
no particular relationship. The categories are said to be mutually exclusive
since an individual, object, or measurement can be included in only one of
them.
 Ordinal – contain more information. Consists of distinct categories in
which order is implied. Values in one category are larger or smaller than
values in other categories (e.g. rating-excelent, good, fair, poor)
 Interval – is a set of numerical measurements in which the distance
between numbers is of a known, sonstant size.
 Ratio – consists of numerical measurements where the distance between
numbers is of a known, constant size, in addition, there is a nonarbitrary
zero point.
References
 An Introduction to Statistical Learning, with Applications in R (2013),
by G. James, D. Witten, T. Hastie, and R. Tibshirani.

 The Elements of Statistical Learning (2009), by T. Hastie, R.


Tibshirani, and J. Friedman.

 Learning from Data: A Short Course (2012), by Y. Abu-Mostafa, M.


Magdon-Ismail, and H. Lin.

 Machine Learning: A Probabilistic Perspective (2012), by K. Murphy

 R and Data Mining: Examples and Case Studies (2013), Y. Zhao.


Lesson Goals:
 Understand the basic concepts of the learning problem and why/how
machine learning methods are used to learn from data to find underlying
patterns for prediction and decision-making.

 Understand the learning algorithm trade-offs, balancing performance


within training data and robustness on unobserved test data.

 Differentiate between supervised and unsupervised learning methods as


well as regression versus classification methods.

 Understand the basic concepts of assessing model accuracy and the bias-
variance trade-off.

 Become familiar with using the R statistical programming language and


practice by exploring data using basic statistical analysis.
Big Data is Everywhere
We are in the era of big data!
40 billion indexed web pages
100 hours of video are uploaded
to YouTube every minute
The deluge of data calls for
automated methods of data
analysis, which is what
machine learning provides!
What is Machine Learning?
Machine learning is a set of methods that can
automatically detect patterns in data.

These uncovered patterns are then used to predict


future data, or to perform other kinds of decision-
making under uncertainty.

The key premise is learning from data!!


What is Machine Learning?
Addresses the problem of analyzing huge bodies of data so that
they can be understood.

Providing techniques to automate the analysis and exploration


of large, complex data sets.

Tools, methodologies, and theories for revealing patterns in


data – critical step in knowledge discovery.
What is Machine Learning?
Driving Forces:
Explosive growth of data in a great variety of fields
 Cheaper storage devices with higher capacity
 Faster communication
 Better database management systems
Rapidly increasing computing power

We want to make the data work for us!!


Examples of Learning Problems
 Machine learning plays a key role in many areas of science, finance
and industry:
 Predict whether a patient, hospitalized due to a heart attack, will have a second
heart attack. The prediction is to be based on demographic, diet and clinical
measurements for that patient.
 Predict the price of a stock in 6 months from now, on the basis of company
performance measures and economic data.
 Identify the numbers in a handwritten ZIP code, from a digitized image.
 Estimate the amount of glucose in the blood of a diabetic person, from the
infrared absorption spectrum of that person’s blood.
 Identify the risk factors for prostate cancer, based on clinical and demographic
variables.
Research Fields
Statistics / Statistical Learning
Data Mining
Pattern Recognition
Artificial Intelligence
Databases
Signal Processing
Applications
Business
Walmart data warehouse mined for advertising and
logistics
Credit card companies mined for fraudulent use of your
card based on purchase patterns
Netflix developed movie recommender system
Genomics
Human genome project: collection of DNA sequences,
microarray data
Applications (cont.)
Information Retrieval
Terrabytes of data on internet, multimedia information
(video/audio files)

Communication Systems
Speech recognition, image analysis
The Learning Problem
Learning from data is used in situations where we don’t
have any analytic solution, but we do have data that we can
use to construct an empirical solution

The basic premise of learning from data is the use of a set


of observations to uncover an underlying process.
The Learning Problem (cont.)
The Learning Problem (cont.)
The Learning Problem: Example

0.10
0.05
0.00
y

-0.05
-0.10

0.0 0.2 0.4 0.6 0.8 1.0

x
The Learning Problem: Example (cont.)
The Learning Problem: Example (cont.)
Different estimates for the target function f that depend on the
standard deviation of the ε’s
Why do we estimate f?
We use modern machine learning methods to estimate
f by learning from the data.
The target function f is unknown.
We estimate f for two key purposes:
Prediction
Inference
Prediction
Prediction (cont.)
Prediction (cont.)
Prediction (cont.)
Average of the squared difference between the predicted
and actual value of Y.
Var(ε) represents the variance associated with ε.

Our aim is to minimize the reducible error!!


Example: Direct Mailing Prediction
 We are interested in predicting how much money an
individual will donate based on observations from 90,000
people on which we have recorded over 400 different
characteristics.
 We don’t care too much about each individual
characteristic.
 Learning Problem:
 For a given individual, should I send out a mailing?
Inference
Instead of prediction, we may also be interested in the
type of relationship between Y and the X’s.
Key questions:
Which predictors actually affect the response?
Is the relationship positive or negative?
Is the relationship a simple linear one or is it more
complicated?
Example: Housing Inference
We wish to predict median house price based on
numerous variables.
We want to learn which variables have the largest
effect on the response and how big the effect is.
For example, how much impact does the number of
bedrooms have on the house value?
How do we estimate f?
First, we assume that we have observed a set of
training data.

{( X1 , Y1 ), ( X 2 , Y2 ), , ( X n , Yn )}
Second, we use the training data and a machine
learning method to estimate f.
Parametric or non-parametric methods
Parametric Methods
This reduces the learning problem of estimating the
target function f down to a problem of estimating a set
of parameters.

This involves a two-step approach…


Parametric Methods (cont.)
Step 1:
Make some assumptions about the functional form of f.
The most common example is a linear model:

In this course, we will examine far more complicated


f (X
and i )  models
flexible 0  1 for
X i1f.  2 X i 2     p X ip
Parametric Methods (cont.)
Step 2:
We use the training data to fit the model (i.e. estimate
f….the unknown parameters).
The most common approach for estimating the
parameters in a linear model is via ordinary least squares
(OLS) linear regression.
However, there are superior approaches, as we will see
in this course.
Example: Income vs. Education
Seniority
Example: OLS Regression Estimate
Even if the standard deviation is low, we will still get a
bad answer if we use the incorrect model.
Non-Parametric Methods
As opposed to parametric methods, these do not make
explicit assumptions about the functional form of f.
Advantages:
Accurately fit a wider range of possible shapes of f.
Disadvantages:
Requires a very large number of observations to acquire
an accurate estimate of f.
Example: Thin-Plate Spline Estimate
 Non-linear regression
methods are more flexible and
can potentially provide more
accurate estimates.

 However, these methods can


run the risk of over-fitting the
data (i.e. follow the errors, or
noise, too closely), so too
much flexibility can produce
poor estimates for f.
Predictive Accuracy vs. Interpretability
Conceptual Question:
Why not just use a more flexible method if it is more
realistic?

Reason 1:
A simple method (such as OLS regression) produces a
model that is easier to interpret (especially for inference
purposes).

You might also like