06 Transformation

Management of Information Systems
Prof. Dr. Christof Weinhardt – Ewa Lux
Institute of Information Systems and Marketing (IISM), Karlsruhe Service Research Institute (KSRI)
KIT – University of the federal state Baden-Württemberg

and national research institute of the Helmholtz-association www.kit.edu
economy
utilization
technology
law
Acquisition Storing Transformation Evaluation Commercialization
society
2 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Outline of the lecture
Information, Measuring/Observation, Experiments, Forecasting,

Acquisition Simulation, Survey, Interviews
Storing Databases, SQL, Pivoting, Semantics and Ontologies
Basics, Filtering
Transformation Regression, Cluster Analysis
Evaluation Utility Analysis, AHP, Decision Rules, Information Value, Page Rank
Internet Economics, Digital Goods, Network Effects,

Marketing Standardization Networks, Pricing, Bundling
Overview: Transformation
Basics
Classification
Aggregation
OLAP
Filtering
Regression
Clusteranalysis
The t-Test (two samples, independent)
Different sample size, Different sample size,
Same variance Different variance (Welch‘s test)
Variance unknown
Variance known
The t-Test
“To call in the statistician after the experiment is done may
be no more than asking him to perform a postmortem
examination: he may be able to say what the experiment
died of.”
Indian Statistical Congress, Sankhya, ca. 1938
Ronald Fisher
(1890 – 1962)
Regression as a method in econometrics
Econometrics deal with the application of statistical and mathematical methods
on economic questions
With the help of econometric models we try to answer economic questions with
statistical inference based on data
Term was shaped in the beginning of the 20th century.
Nowadays econometric models are essential analysis methods in economic
science and financial econometrics
Comparable statistical models are used in physics (time series analysis) or in
biology and medicine (panel data)
Often used software packages are SAS (especially in finance and industry)
SPSS, Stata or R
Steps to create an economic model
0. Real-world problem and First Hypothesis
1a. Literature research about the problem
1b. Draft of an assessable theoretic model &

Draft of a testable hypothesis
2. Data collection and acquisition
3. Model assessment/-calculation
4. Is the model correct and significant?
No Yes
Brooks (2008), p. 9ff.

Modify model 5. Theoretical model interpretation
6. Application of the model
Basics Regression
Set-up of a theoretical model based on practical intuition or phenomenon
documented in literature
The new factors often represent a modification or further interpretation of the known
phenomenon (e.g. the sales volume of a product is not only dependent on the price but also on
the used advertising)
The scientific work can contribute by researching a new market or researching it with a bigger
data volume
At least two variables should be linked in the work

E.g. In average the success of an exam increases with the timely effort for the exam
preparation.
The model has to give a assessable , acceptable approximation of the real-

world problem.
Example: How does the risk aversion of an investor
influence his portfolio decisions?
An approach to explain the choice between risky and risk-free

investments:
Risk-free investments can be e.g. bonds or government securities
Risky investments can be e.g. shares of a Biotechnology Start-Ups
 Risky investments can bring higher returns, but the risk of a loss is also
bigger
Which factors influence the relation between risky and risk-free

investments in the portfolio?
The current estimations about the world economy
The long term relation between share price and dividend.
Risk aversion of an Investor.
The age of the investor.
…
Example: Risky investments and the age of investors
100%
90%
Person
Age % Stocks
t 80%
1 16 50
70%
2 20 80
3 29 80 60%
4 31 60
% Stokcs
50%
5 37 60
40%
6 44 65
7 48 30 30%
8 50 50
20%
9 59 10
10 61 20 10%
0%
10 20 30 40 50 60 70
Age
Regression: Analytical view
“A regression model is concerned with describing and evaluating the

relationship between a given variable [of interest] and one or more
other variables” (Brooks 2008, p. 27)
Variations in y are explained by x1, x2, …, xk
y is called dependent variable, because the value is dependent on xk
x1, x2, …, xk are called independent variables
Important notes and problems:

Difference between the definition of correlation (no causal dependence)
and regression
How many independent variables xk and equations are necessary to
describe the problem appropriately?
Which estimation methods are suitable for a given data set?
Classical linear regression
100% Equation for a line

α with a disturbing term ut:
90%
^
ut yt ^
^  x  u
y^ t   ^
80%
^
t t
yt
70%
Ordinary least squares (OLS) /
60% minimizing the residual sum of
squares (RSS):
% Stokcs
50%
40%
T T
30%
β
 u^ t2   ( yt  y^t ) 2  min
t 1 t 1
20%
10%
Result: adjusted line
0%
25 30 35 40 x 45 50 55 60 65 70 ^  x ^
t
Age
y^ t   t
OLS estimator and estimated regression term
Estimated coefficient for the intercept and the slope:
 
__ _ _
^
x y  Txy ( x  x )( y  y )
t t t t
  
 x  Tx  ( x  x)
2 _2 _ 2
t t
_ ^_
^
  y  x
Example
t 1 2 3 4 5 6 7 8 9 10
xt 16 20 29 31 37 44 48 50 59 61
yt 0.5 0.8 0.8 0.6 0.6 0.65 0.3 0.5 0.1 0.2
16 * 0.5  ...  61 * 0.2  10 * 39.5 * 0.505 25.375

^    0.01171
256  ...  3721  10 * 1560.25 2166.5
^  0.505  0.01171 * 39.5  0.9676

Example: Draft of the first problem
100%
Hypothese:
90%
Anleger verringern das
80% Verhältnis von riskanten zu
70% risikofreien Anlagen in ihrem
60%
Portfolio in Abhängigkeit
von ihrem Alter– z.B. sind
% Stokcs
50%
prozentual weniger riskante
40% Aktien in Portfolio älterer
30% Anleger als in denen von
20%
Jüngeren.
10%
0%
10 20 30 40 50 60 70
Age
Given assumptions of the CLRM
For the classical linear regression model (CLRM)
yt     xt  ut
following assumptions are made about the error term ut :
Expected value of the error terms is 0:
The variance of the error term E (ut )  0 is constant and fixed for all
values of xi:
var (ut )   2  
cov (ui , u j )  0
The error terms are linear independent:
There is no relation between the error terms and the variable x:
cov(ut , xt )  0
ut ~ N (0, 2 )
The error terms are normally distributed:
If the assumptions are fulfilled the estimators are called
Best Linear Unbiased Estimators (BLUE)
cf. Brooks (2008), p. 43ff.
Excurse: Standard error (arithmetic mean)
Arithmetic mean (concrete

sample)
Variance
Standard error (square root

of the variance)
The standard error shows, how much the estimations of the mean are in average
distributed around the exact mean.
^ ^
How “good” are the estimators β and α?
^
The estimators β and α^ are specific for the sample. The standard error could
serve as a measure for reliability and accuracy:
^ )s
SE (
 xt2
T (x
_ 2
 x)
t
With: s
u ^ 2
t
^ 1
T  k 1
SE (  )  s
(x
_ 2
t  x)
“If the standard errors are small, it shows that the coefficients are likely to be
precise on average, not how precise they are for this particular sample.”
(Brooks 2008, p. 46)
Calculation of the standard error
^)s
SE (
 xt2 ^
SE (  )  s
1
(x
_ 2
T (x
_ 2  x)
t  x) t
Beispiel
t 1 2 3 4 5 6 7 8 9 10
xt 16 20 29 31 37 44 48 50 59 61
yt 0.5 0.8 0.8 0.6 0.6 0.65 0.3 0.5 0.1 0.2
^ ) 0.215 17769
SE (  0.14845
8 10 * 2166.5
0.215 1
SE ( ^ )   0.00352
8 2166.5
Influences on the standard error
^ )s
SE (
 xt2 ^
SE (  )  s
1
(x
_ 2
T (x
_ 2  x)
t  x) t
The bigger…
 the sample T,
 the sum of the variances
 t
( x  x ) 2
And the smaller…

 The estimator for the estimation of the variance s2 of the error
terms

2
 The sum of the squares xt
… the smaller the standard errors for the coefficients are
Variance structure in the linear regression
In linear regression there are three different kind of variances for every
measurement value yi
Every value varies from the

1 mean – you can calculate the
total variance out of that
(  TSS)
The values estimated by the

2 regression line vary from the
mean as well
(  ESS)
There is a discrepancy between

3 empirical and predicted values
(  RSS)
Vgl. Rasch, Friese, Hofmann, Naumann: Quantitative Methoden | 1
Goodness of Fit: R2
How good does the model describe the variation of the dependent
variables?
Possibility: look at the residual sum of squares (RSS)
Problem: What does a RSS value of 0,215 with OLS estimation mean?
Use of the scaled version of RSS: R2
Idea: Explain variability of y by the variance from the average value
(benchmark)
Calculation of total sum of squares (TSS):
Separation of TSS in ESS (explained sum of squares) and RSS:
TSS  ESS  RSS

ESS RSS
R2 
( y ( y 
_ 2 _ 2 1
 y)  ^  y)  ^2
u TSS TSS
t t t
R2  0
R2 has
2 to have a value between 0 and 1
R 1
the model explains no variability of y (da TSS = RSS)
23
the model explains the variability of y (da RSS
Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux
= 0)
Institute of Information Systems and Marketing (IISM)
Example for R²
R² = 0,976 R² = 0,732 R² = 0,058
TSS  ESS  RSS

ESS RSS
R2 
( y
_ 2
( y
_ 2

^ ^2 1
t  y)  t  y)  ut TSS TSS
Problem with the interpretation of R2
Problem with R2 and the corrected R2
Problems:
R2 gives no answer whether the correct regression was used or important
variables were missing
R2 allows no comparison between models with different variables
R2 increases, as soon as the number of independent variables increases
lots of models have similar or the same values for R2
Corrected R2
Considers the loss of degrees of freedom by adding more variables:
2 k (1  R 2 )
2
R adj R 
T  k 1
Rule of choice: (don´t )add a variable if R2adj increases (decreases)
Hypothesis and statistical null hypothesis
Research hypotheses have to be differed from statistical null hypotheses
Statistical null hypotheses have to be deduced from research hypotheses to
test the research hypotheses
If a parameter is equal is tested by statistical tests
H0: ß = 0 und H1: ß ≠ 0 (two sided test)
H0: ß = 1 und H1: ß > 1 (one sided Test)
In statistics there is a null hypothesis H 0 and a alternative hypothesis H1

The null hypothesis is the hypothesis which is tested and rejected or not.
The p-value gives information about the evidence to receive the observed data
(or more extreme ones) if the null hypothesis is true (Interpretation as mistake
evidence)
A null hypothesis can only be rejected or not, but can never be „assumed“.
For that the tests have to be designed in a way that the rejection of the null
hypothesis proofs the research hypothesis.
Literatur
Andrew, C. and Thomas, S. (1995). The overreaction hypothesis and the UK stockmarket. Journal of
Business Finance & Accounting 22, pp.961-973.
Brooks, C. (2008). Introductory econmetrics for finance. Second edition. Cambridge University Press
2008.
Canner N., Mankiw N. G. and Weil D. N (1997) An Asset Allocation Puzzle, American Economic Review
87, 181–191.
Debondt, W.F.M. and Thaler, R.H. (1985). Does the stock market overreact? Journal of Finance 40, pp.
793-805.
Debondt, W.F.M. and Thaler, R.H. (1987). Further evidence on investor overreaction and stock market
seasonality? Journal of Finance 42, pp. 557-580.
Fan, J. and Yao, Q. (2002). Nonlinear Time Series. Nonparametric and Parametric Methods. Springer
Series in Statistics.
Lintner J. (1965) The Valuation of Risk Assets and the Selection of Risky Investments in Stock Portfolios
and Capital Budgets, Review of Economics and Statistics 47, 13-37.
Percival, D.B. and Walden A.T. (2000) Wavelet Methods for Time Series Analysis.
Tobin, J. (1958) Liquidity Preference as Behavior Towards Risk, Review of Economic Studies 25, 65-86.
Wooldridge, J.M. (2009). Introductory econometrics: A modern approach. Fourth edition. South-Western
2009.
Overview: Transformation
Basics
Classification
Aggregation
OLAP
Filtering
Regression
Clusteranalysis
Clustering
… or how a well sorted data collection results from a big amount of
single data points
Motivation
Today big amounts of data due to the use of internet: „Big Data“
Hard to evaluate and to work with
Need of methods to reduce complexity
Cluster analysis
A cluster (bunch, swarm, heap, group) describes a group of single data

objects
The single data objects within a cluster are supposed to be „similar“
Cluster analysis/ Clustering are processed to investigate similarity
structures and pictures of clusters based on that
In contrast to the classification new groups are created with clustering,
no pre knowledge is necessary (uniform process)
Cluster analysis – the sense
Practical applications are e.g.

• Marketing (Market or customer segmentation)
• Picture processing (recognition of patterns)
• Biology, Genetics (Sequence analysis)
• Meteorology (search for periodical patterns)
• Data-Mining for Big Data (Social Network analysis, Bioinformatics)
How does clustering work?
Primary goal is it to create a data structure where the most similar data
elements are combined
Upcoming questions:
• How is similarity defined and measured?
• How are clusters build?
• How many clusters are build?
How many clusters? Six Clusters
Two Clusters Four Clusters
Similarity
Similarity
Similarity is the degree of conformity of objects measured with all of the used
characteristics
Measurable by correlation measures and distance measures
7 7
6 6
5
5
4
4
3
3
2
2 1
1 0
Category 1 Category 2 Category 3 Category 4
0
Category 1 Category 2 Category 3 Category 4
Low distance, low correlation Higher distance, higher correlation
Similarity of sets
The Jaccard-coefficient is a measure for sets:
The number of equal elements (intersection) is divided by the number of

elements of the total set (union).
Example:
Also useful for the similarity of the vectors V1 and V2 with binary entries:
M11 Numbers of entries in which both vectors contain a 1

M00 Numbers of entries in which both vectors contain a 0
M10 Numbers of entries in which V1 contains a 1 and V2 a 0
M01 Numbers of entries in which V1 contains a 0 and V2 a 1
Distance measures
Important because clustering-alogrithms work with the distances

between points and clusters. Sufficient for metrical scales:
Euclidean distance
Squared Euclidean distance
City-Block (Manhattan) distance
Voronoi diagrams
Euklidean Distance Manhattan Metric
Distances between clusters
Single link: smallest distance between an element in one cluster ond one in the
other cluster, dist(Ki, Kj) = min(tip, tjq)
Complete link: biggest distance between an element in one cluster ond one in the
other cluster, dist(Ki, Kj) = max(tip, tjq)
Average: average distance between an element in one cluster ond one in the other
cluster dist(Ki, Kj) = avg(tip, tjq)
Centroid: distance between the center of both clusters, dist(Ki, Kj) = dist(Ci, Cj)
Medoid: distance between the media of two clusters, dist(Ki, Kj) = dist(Mi, Mj)
40
Differentiation of cluster algorithms
Different possibilities with advantages and disadvantages, e.g.

depending on data size , goal and structure …
There is never only one algorithm
Graph theory
divertive
Hierarchical
agglomerative
Cluster algorithms
Exchange processes
partitioning
minimum distance
processes
optimizing
Hierarchical processes
Start with a big cluster or a lot of single clusters that become more and
more precise
Stops if stop criterion is fulfilled (e.g. all data points combined or
distributed)
Differentiation between divisive (top-down) and agglomerative (bottom-
up) methods
+ Advantages: Flexibility, number of clusters flexible
- Disadvantages: runtime-complexity, analysis effort for the results,

once build clusters can not be changed anymore
Example for algorithm AGNES
Agglomerative method (AGNES = agglomerative nesting)

Survey with 7 participants about brand recognition (V1) and buying
behavior (V2) measured on a scale from 0 to 10
Distance matrix
Distance matrix with Euclidean distances
Observations
Observation
A B C D E F G
A ---
B 3.162 ---
C 5.099 2.000 ---
D 5.099 2.828 2.000 ---
E 5.000 2.236 2.236 4.123 ---
F 6.403 3.606 3.000 5.000 1.414 ---
G 3.606 2.236 3.606 5.000 2.000 3.162 ---
Agglomerative building of clusters
AGGLOMERATIVE PROCESS CLUSTER SOLUTION
Overall Similarity
Measure
Step Minimum Observation Cluster Membership Number of (Average Within-
Distance Pair Clusters Cluster Distance)
Initial Solution (A)(B)(C)(D)(E)(F)(G) 7 0
1 1.414 E-F (A)(B)(C)(D)(E-F)(G) 6 1.414
2 2.000 E-G (A)(B)(C)(D)(E-F-G) 5 2.192
3 2.000 C-D (A)(B)(C-D)(E-F-G) 4 2.144
4 2.000 B-C (A)(B-C-D)(E-F-G) 3 2.234
5 2.236 B-E (A)(B-C-D-E-F-G) 2 2.896
6 3.162 A-B (A-B-C-D-E-F-G) 1 3.420
Successive combination of clusters until all observations are connected

Afterwards analysis which cluster size and fragmentation makes sense
Here you should stop after Step4 because OSM increases significantly
at Step 5
Graphic Visualization
Stop after step4 results in the following 3 clusters
Visualization - Dendogramm
Original data set Dendogramm after AGNES
Visualization - Dendogramm
Shows when which clusters are combined until they sum up to the
complete cluster in the end
Outliner are recognizable
Distances given
(centrum based) Partitioning method
In general: Iterative shifting of cluster centrums, while minimizing a

error equation
Partitioning the data set D with n objects into a set of k clusters so that
the sum of the squared Euclidean distances is minimized
E   ik1 pCi (d ( p, ci )) 2
Clusters can always be change
+ Advantages: Clusters flexible, efficient O(tkn)
- Disadvantages: number of clusters fixed, not sufficient for convex

clusters, sensitive for outliner
k-Means Algorithm
Portioning method
First, number of central points (k) is set and randomly or selectively
positioned in the data set
Afterwards, data points are allocated to the closest centrum (see
Voronoi-Diagrams)
New position of central points (calculation of mean in the cluster)
Variations:
• k-Means++ : faster, better selection of clusters
• k-Medoids: Update because of Medoid-rule (less sensitive for outliner)
k-Means Algorithm
Randomly Initialize Clusters
k-Means Algorithm
Assign data points to

nearest clusters
k-Means Algorithm
Recalculate Clusters
k-Means Algorithm
Recalculate Clusters
k-Means Algorithm
Repeat
k-Means Algorithm
Repeat … until convergence
k-Means Algorithm
situation in the beginning with original data set
Weaknesses of k-means
Weaknesses in k-means
200 and 600 points in 2 Regions Desired result Result of k-means Clustering
Quality criteria
3 different kinds of quality measurement:
extern: compare clusters with external expert knowledge/ information sources
intern: Complexity, variance criteria, Silhouette coefficient, how far is the distance between the
clusters, are they well separated ?
relative: Comparison of different cluster methods, different numbers of cluster, different start
parameters
Possible measures:
Within Cluster Sum of Squares (siehe oben)
Silhouette
a(i): average distance to the other points in a cluster

b(i): minimal average distance to the point in the next cluster
Conclusion
Clustering groups by single observation depending on their similarity;
Clustering has application from data mining in marketing to genetics
There are different types of clustering algorithms. Most important hierarchical
methods (AGNES, Single-Linkage) as well as portioning methods (k-Mea
A problem with the k-means Algorithm is the definition of a cluster (number and
starting position )
The quality of clustering-results and algorithms is not trivial to evaluate and
should be done with various criteria

06 Transformation

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

06 Transformation

Uploaded by

Copyright:

Available Formats

Management of Information Systems

Prof. Dr. Christof Weinhardt – Ewa Lux

KIT – University of the federal state Baden-Württemberg

Acquisition Storing Transformation Evaluation Commercialization

Information, Measuring/Observation, Experiments, Forecasting,

Storing Databases, SQL, Pivoting, Semantics and Ontologies

Internet Economics, Digital Goods, Network Effects,

Indian Statistical Congress, Sankhya, ca. 1938

1a. Literature research about the problem

1b. Draft of an assessable theoretic model &

2. Data collection and acquisition

4. Is the model correct and significant?

Brooks (2008), p. 9ff.

6. Application of the model

At least two variables should be linked in the work

The model has to give a assessable , acceptable approximation of the real-

An approach to explain the choice between risky and risk-free

Which factors influence the relation between risky and risk-free

“A regression model is concerned with describing and evaluating the

Important notes and problems:

100% Equation for a line

Estimated coefficient for the intercept and the slope:

16 * 0.5  ...  61 * 0.2  10 * 39.5 * 0.505 25.375

For the classical linear regression model (CLRM)

Arithmetic mean (concrete

Standard error (square root

And the smaller…

… the smaller the standard errors for the coefficients are

Every value varies from the

The values estimated by the

There is a discrepancy between

TSS  ESS  RSS

R² = 0,976 R² = 0,732 R² = 0,058

TSS  ESS  RSS

In statistics there is a null hypothesis H 0 and a alternative hypothesis H1

A cluster (bunch, swarm, heap, group) describes a group of single data

Practical applications are e.g.

How many clusters? Six Clusters

Two Clusters Four Clusters

Low distance, low correlation Higher distance, higher correlation

The number of equal elements (intersection) is divided by the number of

M11 Numbers of entries in which both vectors contain a 1

Important because clustering-alogrithms work with the distances

Squared Euclidean distance

City-Block (Manhattan) distance

Euklidean Distance Manhattan Metric

Different possibilities with advantages and disadvantages, e.g.

+ Advantages: Flexibility, number of clusters flexible

- Disadvantages: runtime-complexity, analysis effort for the results,

Agglomerative method (AGNES = agglomerative nesting)

Distance matrix with Euclidean distances

Successive combination of clusters until all observations are connected

Stop after step4 results in the following 3 clusters

Original data set Dendogramm after AGNES

In general: Iterative shifting of cluster centrums, while minimizing a

+ Advantages: Clusters flexible, efficient O(tkn)

- Disadvantages: number of clusters fixed, not sufficient for convex

Randomly Initialize Clusters

Assign data points to

Repeat … until convergence

situation in the beginning with original data set

Within Cluster Sum of Squares (siehe oben)

a(i): average distance to the other points in a cluster