Professional Documents
Culture Documents
Agenda
1) Orientation
2) Course Overview & Challenges
3) Introduction to Data Analysis / Fundamental Principles
ORIENTATION
About Me:
– Matthew Beck
– Rm 444; Merewether Building (H04)
– matthew.beck@sydney.edu.au
– 02 9114 1834
Course Overview
Learning Objectives
• Worthwhile addition?
• Machine performance? • What colour package?
• Monitor production levels • What method best to ship?
• What markets to ship to?
“To increase the business value of information, we need data from various angles. In
addition to sales data, we need to know why sales increased. We need to know how
and where we are influential.”
CEO, German Pharmaceutical Company
“With thousands of customers, products, and contractual terms and conditions, pricing
and incentive models become very complex. Analytics is a key way to get this
complexity under control. But we are not yet good enough in this regard.”
CEO, Japanese Electronics Manufacturer
Information is Valuable:
“On one NBA team, an analyst quickly gained the reputation as being the smartest
guy in the room but had virtually no impact on the decision-making process...while the
work he was producing was innovative, it was wasted because he did not have the
ability to communicate it in a manner that was understandable to the decision-makers.”
“It's not just about the data. It's what you do with the data.”
Mike Rhodin, Senior Vice President, IBM
Course Overview
Optimisation Modelling:
– Linear Programming (optimal allocation of limited resources)
Simulation Modelling:
– Incorporating uncertainty into the modelling process (probabilities)
Course Overview
What is Data?
– https://www.kaggle.com/c/flight
Class Activity
Descriptive Statistics:
– Procedures that describe the data we are studying
– Results help us organize and understand the data
– The results cannot be generalized to any larger group
Inferential Statistics:
– Trying to reach conclusions that extend past the immediate data
– How might the population behave based on our collected data
– The probability that our result is systematic not random chance
Hypothesis Testing:
– Also called tests of significance
– A Chi-square test, T-test, F-test, regression models are examples
– Typically tests for differences or relationships between variables
Predictive Analysis:
– Regression modelling (and others)
– Determine patterns and predict future outcomes and trends
Types of Data
CATEGORICAL CONTINUOUS
Class Activity
Nominal Data
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6
Mode = 9 No Mode
The University of Sydney Page 27
1. Descriptive Statistics
Ordinal Data
In an ordered array:
– The median is the “middle” number
– 50% of observations are above this point
– 50% of observations are below this point
– (n+1)/2 gives the position of the median
13 14 15 16 17 18 19 20 21 13 14 15 16 17 18 19 20 21
Median = 19 Median = 18
Interval Data
∑X i
X1 + X 2 + X 3 + X n
Observed
X= i =1
= Values
n n
Sample size
2
∑ (X − X)
i
2
S = i=1
n -1
– X = arithmetic mean
– n = sample size
– Xi = ith value of the variable X
∑ i
(X − X ) 2
S= i=1
n -1
– Has the same unit of measurement as the original data
– Provides a "standard" way of knowing what is “average” and what is
“extra large” or “extra small”
Data B
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 S = 0.926
Data C
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 S = 4.570
The University of Sydney Page 39
1. Descriptive Statistics
Ratio Data
Ratio Data
Mode:
– Can really only be used when dealing with nominal data
Mean:
– When data is symmetric and continuous
Median:
– When the data is skewed (has extreme or outlying observations)
Case-wise deletion:
– Cases or respondents with any missing responses are discarded
– Useful if extent of missing data is small
Pair-wise deletion:
– Only the piece of information requiring cleaning is discarded
Case Substitution
– Observations with missing data are replaced by choosing another non-sampled observation.
Mean Substitution
– Widely used; replaces missing values for a variable with the mean value of that variable
based on all valid responses.
Regression Imputation
– Regression analysis used to predict the missing values of a variable based on its relationship to
other variables in the data set.
Class Activity
Presenting Data
Presenting Data
http://www.edwardtufte.com/tufte/
http://kirkgoldsberry.com/
Presenting Data
300 30%
200 20%
100 10%
0 0%
1st Yr 2nd Yr 3rd Yr 4th Yr 1st Yr 2nd Yr 3rd Yr 4th Yr
200 50
100 25
0 0
Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4
$42
Monthly Sales Numbers
$39
$45
$36
$42
J F M A M J
$39
$60
$36
J F M A M J $40
$20
$0
J F M A M J
The University of Sydney Page 62
1. Descriptive Statistics
Pie Charts:
– Best for showing the composition of single group of data
Line Charts:
– Illustrate trends or changes over time
– Movements in the same or different direction
Scatterplots Charts:
– Illustrate distribution of data
– Help identify relationships between variables
– https://infogr.am/
– http://articles.sysev.com/principles-practices-effective-
presentation-communication/
– http://www.kaushik.net/avinash/data-presentation-tips-focus-
think-simplify-visualize/
– http://www.forbes.com/sites/kateharrison/2015/01/20/a-
good-presentation-is-about-data-and-story/
Class Activity
5. Email to matthew.beck@sydney.edu.au
Descriptives in Excel
Descriptives in SPSS
Class Activity
5. Email to matthew.beck@sydney.edu.au