Professional Documents
Culture Documents
Techniques of Data Analysis: Assoc. Prof. Dr. Abdul Hamid B. Hj. Mar Iman
Techniques of Data Analysis: Assoc. Prof. Dr. Abdul Hamid B. Hj. Mar Iman
Objectives
Specific:
* Concepts of data analysis
* Some data analysis techniques
* Some tips for data analysis
Method
Systematic
Breaking
Statistical Methods
Something to do with statistics
Statistics: meaningful quantities about a sample of
objects, things, persons, events, phenomena, etc.
Widely used in social sciences.
Simple to complex issues. E.g.
* correlation
* anova
* manova
* regression
* econometric modelling
Two main categories:
* Descriptive statistics
* Inferential statistics
Descriptive statistics
Use
200
180
50.00
160
40.00
140
30.00
120
20.00
100
10.00
%
prediction
error
80
20
40
60
80
100
120
10.00
20.00
30.00
40.00
50.00
60.00
100.00
80.00
60.00
40.00
20.00
0.00
-20.00
-40.00
-60.00
-80.00
-100.00
Inferential statistics
Using
Use
parametric analysis
Examples of relationship
Dep=9t 215.8
Dep=7t 192.6
Coefficientsa
Model
1
(Constant)
Tanah
Bangunan
Ansilari
Umur
Flo_go
Unstandardized
Coefficients
B
Std. Error
1993.108
239.632
-4.472
1.199
6.938
.619
4.393
1.807
-27.893
6.108
34.895
89.440
Standardized
Coefficients
Beta
-.190
.705
.139
-.241
.020
t
8.317
-3.728
11.209
2.431
-4.567
.390
Sig.
.000
.000
.000
.017
.000
.697
Issue
Correct technique
Correct technique
Using a regression
parameter
Multi-dimensional
scaling, Likert scaling
Simple regression
coefficient
Using R2
Hold-out samples
MAPE
Multi-dimensional
scaling, Likert scaling
Principles of analysis
Goal
of an analysis:
* To explain cause-and-effect phenomena
* To relate research with real-world event
* To predict/forecast the real-world
phenomena based on research
* Finding answers to a particular problem
* Making conclusions about real-world event
based on the problem
* Learning a lesson from the problem
Number
Female
Old
Young
6
4
10
15
analysing:
* Be objective
* Accurate
* True
Separate facts and opinion
Avoid wrong reasoning/argument. E.g.
mistakes in interpretation.
Basic concepts
Central tendency
Variability
Probability
Statistical Modelling
Basic Concepts
= 120,000
2
SD
SST
= 210,000
3
DST
J.B. houses
=?
Central Tendency
Measure
Advantages
Disadvantages
Mean
(Sum of
all values
no. of
values)
Exactly calculable
Median
Mode
(most
frequent
value)
(middle
value)
10 12
f 3
14 24 18 20 12
Thus,
= 96/12 = 8
= 96;
= 12
135-140
140-145
145-150
150-155
155-160
137.5
142.5
147.5
152.5
157.5
1282.5
885.0
305.0
157.5
fx 687.5
What
130-135
135-140
140-145 155-50
150-155
Rental (RM/month)
>135
> 140
> 145
> 150
> 155
Cumulative frequency
17
23
25
Taman
Variability
Indicates
The
standard deviation
standard deviation
Variability (contd.)
Why
Variability (contd.)
Coefficient
Could
Variability (contd.)
Std.
The following table shows the age distribution of second-time home buyers:
x^
Probability Distribution
Defined
1
2
3
4
2
3
4
5
3
4
5
6
4
5
6
7
5
6
7
8
6
7
8
9
7
8
9
10
10
11
10
11
12
Dice1
Dice2
Discrete values
P(Rental = RM 8) = 0
0.206
* Bell-shaped, symmetrical
= mean of variable x
* Has a function of
= std. dev. Of x
= ratio of circumference of a
circle to its diameter =
3.14
Probability distribution
1 = ?
2 = ?
3 = ?
Probability distribution
* Has the following distribution of observation
Probability distribution
There
Note: p(AGE=age) 1
How to turn this graph into
a probability distribution
function (p.d.f.)?
Not
Z-Distribution
160 155
E.g. Z = ------------- = 0.926
5.4
Probability is such a way that:
* Approx. 68% -1< z <1
* Approx. 95% -1.96 < z < 1.96
* Approx. 99% -2.58 < z < 2.58
Z-distribution (contd.)
When
X= , Z = 0, i.e.
When
X = + , Z = 1
When X = + 2, Z = 2
When X = + 3, Z = 3 and so on.
It can be proven that P(X1 <X< Xk) = P(Z1 <Z< Zk)
SND
Normal distributionQuestions
Your sample found that the mean price of affordable homes in Johor
Bahru, Y, is RM 155,000 with a variance of RM 3.8x107. On the basis of a
normality assumption, how sure are you that:
(a)
(b)
Answer (a):
160,000 -155,000
Always remember: to convert to SND, subtract the mean and divide by the std. dev.
Normal distributionQuestions
Answer (b):
X1 -
145,000 155,000
3.8x107
X2 -
160,000 155,000
3.8x10
7
P(Z1<-1.622)=0.0455; P(Z2>0.811)=0.1867
P(145,000<Z<160,000)
= P(1-(0.0455+0.1867)
= 0.7678
Normal distributionQuestions
You are told by a property consultant that the
average rental for a shop house in Johor Bahru is
RM 3.20 per sq. After searching, you discovered
the following rental data:
2.20, 3.00, 2.00, 2.50, 3.50,3.20, 2.60, 2.00,
3.10, 2.70
What is the probability that the rental is greater
than RM 3.00?
Students t-Distribution
Similar
to Z-distribution:
* t(0,) but n1
* - < t < +
* Flatter with thicker tails
* As n t(0,) N(0,1)
* Has a function of
where =gamma distribution; v=n-1=d.o.f; =3.147
Students t-Distribution
Given
Students t-Distribution
Student's
* defining
The
Students t-Distribution
fr(t) =
Fr(t) =
=
=
where r n-1 is the number of degrees of freedom, -<t<,(t) is the gamma function,
B(a,b) is the beta function, and I(z;a,b) is the regularized beta function defined by
* Causal
* Feedback
* Multi-directional
* Recursive
The last two categories are normally dealt with
through regression
Correlation
Co-exist.E.g.
* left shoe & right shoe, sleep & lying down, food & drink
Indicate some co-existence relationship. E.g.
* Linearly associated (-ve or +ve)
Formula:
* Co-dependent, independent
But, nothing to do with C-A-E r/ship!
Example: After a field survey, you have the following
data on the distance to work and distance to the city
of residents in J.B. area. Interpret the results?
Contingency
Test yourselves!
Q1: Calculate the min and std. variance of the following data:
PRICE - RM 000
SQ. M OF FLOOR
Q2: Calculate the mean price of the following low-cost houses, in various
localities across the country:
36
37
38
39
40
41
42
43
14
10
36
73
27
20
17
Test yourselves!
Q3: From a sample information, a population of housing
estate is believed have a normal distribution of X ~ (155,
45). What is the general adjustment to obtain a Standard
Normal Distribution of this population?
Q4: Consider the following ROI for two types of investment:
A: 3.6, 4.6, 4.6, 5.2, 4.2, 6.5
B: 3.3, 3.4, 4.2, 5.5, 5.8, 6.8
Decide which investment you would choose.
Test yourselves!
Q5: Find:
(AGE > 30-34)
(AGE 20-24)
( 35-39 AGE < 50-54)
Test yourselves!
Q6: You are asked by a property marketing manager to ascertain whether
or not distance to work and distance to the city are equally important
factors influencing peoples choice of house location.
You are given the following data for the purpose of testing:
Explore the data as follows:
Create histograms for both distances. Comment on the shape of the
histograms. What is you conclusion?
Construct scatter diagram of both distances. Comment on the output.
Explore the data and give some analysis.
Set a hypothesis that means of both distances are the same. Make
your conclusion.
Thank you