Professional Documents
Culture Documents
06 Transformation
06 Transformation
Institute of Information Systems and Marketing (IISM), Karlsruhe Service Research Institute (KSRI)
utilization
technology
law
society
2 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Outline of the lecture
Basics, Filtering
Transformation Regression, Cluster Analysis
Evaluation Utility Analysis, AHP, Decision Rules, Information Value, Page Rank
3 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Overview: Transformation
Basics
Classification
Aggregation
OLAP
Filtering
Regression
Clusteranalysis
4 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
The t-Test (two samples, independent)
Different sample size, Different sample size,
Same variance Different variance (Welch‘s test)
Variance unknown
Variance known
5 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
The t-Test
6 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
“To call in the statistician after the experiment is done may
be no more than asking him to perform a postmortem
examination: he may be able to say what the experiment
died of.”
Ronald Fisher
(1890 – 1962)
7 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Regression as a method in econometrics
Econometrics deal with the application of statistical and mathematical methods
on economic questions
With the help of econometric models we try to answer economic questions with
statistical inference based on data
Term was shaped in the beginning of the 20th century.
Nowadays econometric models are essential analysis methods in economic
science and financial econometrics
Comparable statistical models are used in physics (time series analysis) or in
biology and medicine (panel data)
Often used software packages are SAS (especially in finance and industry)
SPSS, Stata or R
8 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Steps to create an economic model
0. Real-world problem and First Hypothesis
3. Model assessment/-calculation
No Yes
9 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Basics Regression
Set-up of a theoretical model based on practical intuition or phenomenon
documented in literature
The new factors often represent a modification or further interpretation of the known
phenomenon (e.g. the sales volume of a product is not only dependent on the price but also on
the used advertising)
The scientific work can contribute by researching a new market or researching it with a bigger
data volume
10 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Example: How does the risk aversion of an investor
influence his portfolio decisions?
11 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Example: Risky investments and the age of investors
100%
90%
Person
Age % Stocks
t 80%
1 16 50
70%
2 20 80
3 29 80 60%
4 31 60
% Stokcs
50%
5 37 60
40%
6 44 65
7 48 30 30%
8 50 50
20%
9 59 10
10 61 20 10%
0%
10 20 30 40 50 60 70
Age
12 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Regression: Analytical view
13 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Classical linear regression
50%
40%
T T
30%
β
u^ t2 ( yt y^t ) 2 min
t 1 t 1
20%
10%
Result: adjusted line
0%
25 30 35 40 x 45 50 55 60 65 70 ^ x ^
t
Age
y^ t t
14 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
OLS estimator and estimated regression term
__ _ _
^
x y Txy ( x x )( y y )
t t t t
x Tx ( x x)
2 _2 _ 2
t t
_ ^_
^
y x
Example
t 1 2 3 4 5 6 7 8 9 10
xt 16 20 29 31 37 44 48 50 59 61
yt 0.5 0.8 0.8 0.6 0.6 0.65 0.3 0.5 0.1 0.2
15 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Example: Draft of the first problem
100%
Hypothese:
90%
Anleger verringern das
80% Verhältnis von riskanten zu
70% risikofreien Anlagen in ihrem
60%
Portfolio in Abhängigkeit
von ihrem Alter– z.B. sind
% Stokcs
50%
prozentual weniger riskante
40% Aktien in Portfolio älterer
30% Anleger als in denen von
20%
Jüngeren.
10%
0%
10 20 30 40 50 60 70
Age
16 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Given assumptions of the CLRM
yt xt ut
following assumptions are made about the error term ut :
Expected value of the error terms is 0:
The variance of the error term E (ut ) 0 is constant and fixed for all
values of xi:
var (ut ) 2
cov (ui , u j ) 0
The error terms are linear independent:
There is no relation between the error terms and the variable x:
cov(ut , xt ) 0
ut ~ N (0, 2 )
The error terms are normally distributed:
If the assumptions are fulfilled the estimators are called
Best Linear Unbiased Estimators (BLUE)
cf. Brooks (2008), p. 43ff.
17 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Excurse: Standard error (arithmetic mean)
Variance
The standard error shows, how much the estimations of the mean are in average
distributed around the exact mean.
18 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
^ ^
How “good” are the estimators β and α?
^
The estimators β and α^ are specific for the sample. The standard error could
serve as a measure for reliability and accuracy:
^ )s
SE (
xt2
T (x
_ 2
x)
t
With: s
u ^ 2
t
^ 1
T k 1
SE ( ) s
(x
_ 2
t x)
“If the standard errors are small, it shows that the coefficients are likely to be
precise on average, not how precise they are for this particular sample.”
(Brooks 2008, p. 46)
19 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Calculation of the standard error
^)s
SE (
xt2 ^
SE ( ) s
1
(x
_ 2
T (x
_ 2 x)
t x) t
Beispiel
t 1 2 3 4 5 6 7 8 9 10
xt 16 20 29 31 37 44 48 50 59 61
yt 0.5 0.8 0.8 0.6 0.6 0.65 0.3 0.5 0.1 0.2
^ ) 0.215 17769
SE ( 0.14845
8 10 * 2166.5
0.215 1
SE ( ^ ) 0.00352
8 2166.5
20 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Influences on the standard error
^ )s
SE (
xt2 ^
SE ( ) s
1
(x
_ 2
T (x
_ 2 x)
t x) t
The bigger…
the sample T,
the sum of the variances
t
( x x ) 2
21 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Variance structure in the linear regression
In linear regression there are three different kind of variances for every
measurement value yi
22 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Goodness of Fit: R2
How good does the model describe the variation of the dependent
variables?
Possibility: look at the residual sum of squares (RSS)
Problem: What does a RSS value of 0,215 with OLS estimation mean?
Use of the scaled version of RSS: R2
Idea: Explain variability of y by the variance from the average value
(benchmark)
Calculation of total sum of squares (TSS):
Separation of TSS in ESS (explained sum of squares) and RSS:
R2 0
R2 has
2 to have a value between 0 and 1
R 1
the model explains no variability of y (da TSS = RSS)
23
the model explains the variability of y (da RSS
Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux
= 0)
Institute of Information Systems and Marketing (IISM)
Example for R²
24 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Problem with the interpretation of R2
25 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Problem with R2 and the corrected R2
Problems:
R2 gives no answer whether the correct regression was used or important
variables were missing
R2 allows no comparison between models with different variables
R2 increases, as soon as the number of independent variables increases
lots of models have similar or the same values for R2
Corrected R2
Considers the loss of degrees of freedom by adding more variables:
2 k (1 R 2 )
2
R adj R
T k 1
Rule of choice: (don´t )add a variable if R2adj increases (decreases)
26 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Hypothesis and statistical null hypothesis
Research hypotheses have to be differed from statistical null hypotheses
Statistical null hypotheses have to be deduced from research hypotheses to
test the research hypotheses
If a parameter is equal is tested by statistical tests
H0: ß = 0 und H1: ß ≠ 0 (two sided test)
H0: ß = 1 und H1: ß > 1 (one sided Test)
27 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Literatur
Andrew, C. and Thomas, S. (1995). The overreaction hypothesis and the UK stockmarket. Journal of
Business Finance & Accounting 22, pp.961-973.
Brooks, C. (2008). Introductory econmetrics for finance. Second edition. Cambridge University Press
2008.
Canner N., Mankiw N. G. and Weil D. N (1997) An Asset Allocation Puzzle, American Economic Review
87, 181–191.
Debondt, W.F.M. and Thaler, R.H. (1985). Does the stock market overreact? Journal of Finance 40, pp.
793-805.
Debondt, W.F.M. and Thaler, R.H. (1987). Further evidence on investor overreaction and stock market
seasonality? Journal of Finance 42, pp. 557-580.
Fan, J. and Yao, Q. (2002). Nonlinear Time Series. Nonparametric and Parametric Methods. Springer
Series in Statistics.
Lintner J. (1965) The Valuation of Risk Assets and the Selection of Risky Investments in Stock Portfolios
and Capital Budgets, Review of Economics and Statistics 47, 13-37.
Percival, D.B. and Walden A.T. (2000) Wavelet Methods for Time Series Analysis.
Tobin, J. (1958) Liquidity Preference as Behavior Towards Risk, Review of Economic Studies 25, 65-86.
Wooldridge, J.M. (2009). Introductory econometrics: A modern approach. Fourth edition. South-Western
2009.
28 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Overview: Transformation
Basics
Classification
Aggregation
OLAP
Filtering
Regression
Clusteranalysis
29 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Clustering
… or how a well sorted data collection results from a big amount of
single data points
30 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Motivation
Today big amounts of data due to the use of internet: „Big Data“
Hard to evaluate and to work with
Need of methods to reduce complexity
31 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Cluster analysis
32 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Cluster analysis – the sense
33 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
How does clustering work?
Primary goal is it to create a data structure where the most similar data
elements are combined
Upcoming questions:
• How is similarity defined and measured?
• How are clusters build?
• How many clusters are build?
34 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Similarity
35 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Similarity
Similarity is the degree of conformity of objects measured with all of the used
characteristics
Measurable by correlation measures and distance measures
7 7
6 6
5
5
4
4
3
3
2
2 1
1 0
Category 1 Category 2 Category 3 Category 4
0
Category 1 Category 2 Category 3 Category 4
36 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Similarity of sets
The Jaccard-coefficient is a measure for sets:
Example:
Also useful for the similarity of the vectors V1 and V2 with binary entries:
37 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Distance measures
Euclidean distance
38 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Voronoi diagrams
39 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Distances between clusters
Single link: smallest distance between an element in one cluster ond one in the
other cluster, dist(Ki, Kj) = min(tip, tjq)
Complete link: biggest distance between an element in one cluster ond one in the
other cluster, dist(Ki, Kj) = max(tip, tjq)
Average: average distance between an element in one cluster ond one in the other
cluster dist(Ki, Kj) = avg(tip, tjq)
Centroid: distance between the center of both clusters, dist(Ki, Kj) = dist(Ci, Cj)
Medoid: distance between the media of two clusters, dist(Ki, Kj) = dist(Mi, Mj)
40 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
40
Differentiation of cluster algorithms
Graph theory
divertive
Hierarchical
agglomerative
Cluster algorithms
Exchange processes
partitioning
minimum distance
processes
optimizing
41 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Hierarchical processes
Start with a big cluster or a lot of single clusters that become more and
more precise
Stops if stop criterion is fulfilled (e.g. all data points combined or
distributed)
Differentiation between divisive (top-down) and agglomerative (bottom-
up) methods
42 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Example for algorithm AGNES
43 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Distance matrix
Observations
Observation
A B C D E F G
A ---
B 3.162 ---
C 5.099 2.000 ---
D 5.099 2.828 2.000 ---
E 5.000 2.236 2.236 4.123 ---
F 6.403 3.606 3.000 5.000 1.414 ---
G 3.606 2.236 3.606 5.000 2.000 3.162 ---
44 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Agglomerative building of clusters
AGGLOMERATIVE PROCESS CLUSTER SOLUTION
Overall Similarity
Measure
Step Minimum Observation Cluster Membership Number of (Average Within-
Distance Pair Clusters Cluster Distance)
Initial Solution (A)(B)(C)(D)(E)(F)(G) 7 0
1 1.414 E-F (A)(B)(C)(D)(E-F)(G) 6 1.414
2 2.000 E-G (A)(B)(C)(D)(E-F-G) 5 2.192
3 2.000 C-D (A)(B)(C-D)(E-F-G) 4 2.144
4 2.000 B-C (A)(B-C-D)(E-F-G) 3 2.234
5 2.236 B-E (A)(B-C-D-E-F-G) 2 2.896
6 3.162 A-B (A-B-C-D-E-F-G) 1 3.420
45 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Graphic Visualization
46 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Visualization - Dendogramm
47 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Visualization - Dendogramm
Shows when which clusters are combined until they sum up to the
complete cluster in the end
Outliner are recognizable
Distances given
48 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
(centrum based) Partitioning method
49 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
k-Means Algorithm
Portioning method
First, number of central points (k) is set and randomly or selectively
positioned in the data set
Afterwards, data points are allocated to the closest centrum (see
Voronoi-Diagrams)
New position of central points (calculation of mean in the cluster)
Variations:
• k-Means++ : faster, better selection of clusters
• k-Medoids: Update because of Medoid-rule (less sensitive for outliner)
50 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
k-Means Algorithm
51 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
k-Means Algorithm
52 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
k-Means Algorithm
Recalculate Clusters
53 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
k-Means Algorithm
Recalculate Clusters
54 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
k-Means Algorithm
Repeat
55 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
k-Means Algorithm
56 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
k-Means Algorithm
57 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Weaknesses of k-means
58 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Weaknesses in k-means
200 and 600 points in 2 Regions Desired result Result of k-means Clustering
59 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Quality criteria
3 different kinds of quality measurement:
extern: compare clusters with external expert knowledge/ information sources
intern: Complexity, variance criteria, Silhouette coefficient, how far is the distance between the
clusters, are they well separated ?
relative: Comparison of different cluster methods, different numbers of cluster, different start
parameters
Possible measures:
Silhouette
60 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Conclusion
Clustering groups by single observation depending on their similarity;
Clustering has application from data mining in marketing to genetics
There are different types of clustering algorithms. Most important hierarchical
methods (AGNES, Single-Linkage) as well as portioning methods (k-Mea
A problem with the k-means Algorithm is the definition of a cluster (number and
starting position )
The quality of clustering-results and algorithms is not trivial to evaluate and
should be done with various criteria
61 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)