You are on page 1of 61

Management of Information Systems

Prof. Dr. Christof Weinhardt – Ewa Lux

Institute of Information Systems and Marketing (IISM), Karlsruhe Service Research Institute (KSRI)

KIT – University of the federal state Baden-Württemberg


and national research institute of the Helmholtz-association www.kit.edu
economy

utilization

technology
law

Acquisition Storing Transformation Evaluation Commercialization

society

2 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Outline of the lecture

Information, Measuring/Observation, Experiments, Forecasting,


Acquisition Simulation, Survey, Interviews

Storing Databases, SQL, Pivoting, Semantics and Ontologies

Basics, Filtering
Transformation Regression, Cluster Analysis

Evaluation Utility Analysis, AHP, Decision Rules, Information Value, Page Rank

Internet Economics, Digital Goods, Network Effects,


Marketing Standardization Networks, Pricing, Bundling

3 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Overview: Transformation

Basics
Classification
Aggregation
OLAP

Filtering

Regression

Clusteranalysis

4 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
The t-Test (two samples, independent)
Different sample size, Different sample size,
Same variance Different variance (Welch‘s test)
Variance unknown
Variance known

5 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
The t-Test

6 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
“To call in the statistician after the experiment is done may
be no more than asking him to perform a postmortem
examination: he may be able to say what the experiment
died of.”

Indian Statistical Congress, Sankhya, ca. 1938

Ronald Fisher
(1890 – 1962)

7 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Regression as a method in econometrics
Econometrics deal with the application of statistical and mathematical methods
on economic questions
With the help of econometric models we try to answer economic questions with
statistical inference based on data
Term was shaped in the beginning of the 20th century.
Nowadays econometric models are essential analysis methods in economic
science and financial econometrics
Comparable statistical models are used in physics (time series analysis) or in
biology and medicine (panel data)
Often used software packages are SAS (especially in finance and industry)
SPSS, Stata or R

8 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Steps to create an economic model
0. Real-world problem and First Hypothesis

1a. Literature research about the problem

1b. Draft of an assessable theoretic model &


Draft of a testable hypothesis

2. Data collection and acquisition

3. Model assessment/-calculation

4. Is the model correct and significant?

No Yes

Brooks (2008), p. 9ff.


Modify model 5. Theoretical model interpretation

6. Application of the model

9 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Basics Regression
Set-up of a theoretical model based on practical intuition or phenomenon
documented in literature
The new factors often represent a modification or further interpretation of the known
phenomenon (e.g. the sales volume of a product is not only dependent on the price but also on
the used advertising)
The scientific work can contribute by researching a new market or researching it with a bigger
data volume

At least two variables should be linked in the work


E.g. In average the success of an exam increases with the timely effort for the exam
preparation.

The model has to give a assessable , acceptable approximation of the real-


world problem.

10 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Example: How does the risk aversion of an investor
influence his portfolio decisions?

An approach to explain the choice between risky and risk-free


investments:
Risk-free investments can be e.g. bonds or government securities
Risky investments can be e.g. shares of a Biotechnology Start-Ups
 Risky investments can bring higher returns, but the risk of a loss is also
bigger

Which factors influence the relation between risky and risk-free


investments in the portfolio?
The current estimations about the world economy
The long term relation between share price and dividend.
Risk aversion of an Investor.
The age of the investor.

11 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Example: Risky investments and the age of investors

100%

90%
Person
Age % Stocks
t 80%
1 16 50
70%
2 20 80
3 29 80 60%
4 31 60

% Stokcs
50%
5 37 60
40%
6 44 65
7 48 30 30%
8 50 50
20%
9 59 10
10 61 20 10%

0%
10 20 30 40 50 60 70
Age

12 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Regression: Analytical view

“A regression model is concerned with describing and evaluating the


relationship between a given variable [of interest] and one or more
other variables” (Brooks 2008, p. 27)
Variations in y are explained by x1, x2, …, xk
y is called dependent variable, because the value is dependent on xk
x1, x2, …, xk are called independent variables

Important notes and problems:


Difference between the definition of correlation (no causal dependence)
and regression
How many independent variables xk and equations are necessary to
describe the problem appropriately?
Which estimation methods are suitable for a given data set?

13 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Classical linear regression

100% Equation for a line


α with a disturbing term ut:
90%
^
ut yt ^
^  x  u
y^ t   ^
80%
^
t t
yt
70%
Ordinary least squares (OLS) /
60% minimizing the residual sum of
squares (RSS):
% Stokcs

50%

40%
T T
30%
β
 u^ t2   ( yt  y^t ) 2  min
t 1 t 1
20%

10%
Result: adjusted line
0%
25 30 35 40 x 45 50 55 60 65 70 ^  x ^
t
Age
y^ t   t

14 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
OLS estimator and estimated regression term

Estimated coefficient for the intercept and the slope:

 
__ _ _
^
x y  Txy ( x  x )( y  y )
t t t t
  
 x  Tx  ( x  x)
2 _2 _ 2
t t
_ ^_
^
  y  x

Example
t 1 2 3 4 5 6 7 8 9 10
xt 16 20 29 31 37 44 48 50 59 61
yt 0.5 0.8 0.8 0.6 0.6 0.65 0.3 0.5 0.1 0.2

16 * 0.5  ...  61 * 0.2  10 * 39.5 * 0.505 25.375


^    0.01171
256  ...  3721  10 * 1560.25 2166.5
^  0.505  0.01171 * 39.5  0.9676

15 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Example: Draft of the first problem

100%
Hypothese:
90%
Anleger verringern das
80% Verhältnis von riskanten zu
70% risikofreien Anlagen in ihrem
60%
Portfolio in Abhängigkeit
von ihrem Alter– z.B. sind
% Stokcs

50%
prozentual weniger riskante
40% Aktien in Portfolio älterer
30% Anleger als in denen von
20%
Jüngeren.
10%

0%
10 20 30 40 50 60 70
Age

16 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Given assumptions of the CLRM

For the classical linear regression model (CLRM)

yt     xt  ut
following assumptions are made about the error term ut :
Expected value of the error terms is 0:
The variance of the error term E (ut )  0 is constant and fixed for all
values of xi:
var (ut )   2  
cov (ui , u j )  0
The error terms are linear independent:
There is no relation between the error terms and the variable x:
cov(ut , xt )  0
ut ~ N (0, 2 )
The error terms are normally distributed:
If the assumptions are fulfilled the estimators are called
Best Linear Unbiased Estimators (BLUE)
cf. Brooks (2008), p. 43ff.

17 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Excurse: Standard error (arithmetic mean)

Arithmetic mean (concrete


sample)

Variance

Standard error (square root


of the variance)

The standard error shows, how much the estimations of the mean are in average
distributed around the exact mean.

18 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
^ ^
How “good” are the estimators β and α?
^
The estimators β and α^ are specific for the sample. The standard error could
serve as a measure for reliability and accuracy:

^ )s
SE (
 xt2
T (x
_ 2
 x)
t

With: s
u ^ 2
t

^ 1
T  k 1
SE (  )  s
(x
_ 2
t  x)

“If the standard errors are small, it shows that the coefficients are likely to be
precise on average, not how precise they are for this particular sample.”
(Brooks 2008, p. 46)

19 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Calculation of the standard error

^)s
SE (
 xt2 ^
SE (  )  s
1
(x
_ 2
T (x
_ 2  x)
t  x) t

Beispiel
t 1 2 3 4 5 6 7 8 9 10
xt 16 20 29 31 37 44 48 50 59 61
yt 0.5 0.8 0.8 0.6 0.6 0.65 0.3 0.5 0.1 0.2

^ ) 0.215 17769
SE (  0.14845
8 10 * 2166.5

0.215 1
SE ( ^ )   0.00352
8 2166.5

20 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Influences on the standard error

^ )s
SE (
 xt2 ^
SE (  )  s
1
(x
_ 2
T (x
_ 2  x)
t  x) t

The bigger…
 the sample T,
 the sum of the variances
 t
( x  x ) 2

And the smaller…


 The estimator for the estimation of the variance s2 of the error
terms

2
 The sum of the squares xt

… the smaller the standard errors for the coefficients are

21 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Variance structure in the linear regression

In linear regression there are three different kind of variances for every
measurement value yi

Every value varies from the


1 mean – you can calculate the
total variance out of that
(  TSS)

The values estimated by the


2 regression line vary from the
mean as well
(  ESS)

There is a discrepancy between


3 empirical and predicted values
(  RSS)
Vgl. Rasch, Friese, Hofmann, Naumann: Quantitative Methoden | 1

22 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Goodness of Fit: R2
How good does the model describe the variation of the dependent
variables?
Possibility: look at the residual sum of squares (RSS)
Problem: What does a RSS value of 0,215 with OLS estimation mean?
Use of the scaled version of RSS: R2
Idea: Explain variability of y by the variance from the average value
(benchmark)
Calculation of total sum of squares (TSS):
Separation of TSS in ESS (explained sum of squares) and RSS:

TSS  ESS  RSS


ESS RSS
R2 
( y ( y 
_ 2 _ 2 1
 y)  ^  y)  ^2
u TSS TSS
t t t

R2  0
R2 has
2 to have a value between 0 and 1
R 1
the model explains no variability of y (da TSS = RSS)
23
the model explains the variability of y (da RSS
Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux
= 0)
Institute of Information Systems and Marketing (IISM)
Example for R²

R² = 0,976 R² = 0,732 R² = 0,058

TSS  ESS  RSS


ESS RSS
R2 
( y
_ 2
( y
_ 2

^ ^2 1
t  y)  t  y)  ut TSS TSS

24 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Problem with the interpretation of R2

25 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Problem with R2 and the corrected R2

Problems:
R2 gives no answer whether the correct regression was used or important
variables were missing
R2 allows no comparison between models with different variables
R2 increases, as soon as the number of independent variables increases
lots of models have similar or the same values for R2

Corrected R2
Considers the loss of degrees of freedom by adding more variables:

2 k (1  R 2 )
2
R adj R 
T  k 1
Rule of choice: (don´t )add a variable if R2adj increases (decreases)

26 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Hypothesis and statistical null hypothesis
Research hypotheses have to be differed from statistical null hypotheses
Statistical null hypotheses have to be deduced from research hypotheses to
test the research hypotheses
If a parameter is equal is tested by statistical tests
H0: ß = 0 und H1: ß ≠ 0 (two sided test)
H0: ß = 1 und H1: ß > 1 (one sided Test)

In statistics there is a null hypothesis H 0 and a alternative hypothesis H1


The null hypothesis is the hypothesis which is tested and rejected or not.
The p-value gives information about the evidence to receive the observed data
(or more extreme ones) if the null hypothesis is true (Interpretation as mistake
evidence)
A null hypothesis can only be rejected or not, but can never be „assumed“.
For that the tests have to be designed in a way that the rejection of the null
hypothesis proofs the research hypothesis.

27 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Literatur
Andrew, C. and Thomas, S. (1995). The overreaction hypothesis and the UK stockmarket. Journal of
Business Finance & Accounting 22, pp.961-973.
Brooks, C. (2008). Introductory econmetrics for finance. Second edition. Cambridge University Press
2008.
Canner N., Mankiw N. G. and Weil D. N (1997) An Asset Allocation Puzzle, American Economic Review
87, 181–191.
Debondt, W.F.M. and Thaler, R.H. (1985). Does the stock market overreact? Journal of Finance 40, pp.
793-805.
Debondt, W.F.M. and Thaler, R.H. (1987). Further evidence on investor overreaction and stock market
seasonality? Journal of Finance 42, pp. 557-580.
Fan, J. and Yao, Q. (2002). Nonlinear Time Series. Nonparametric and Parametric Methods. Springer
Series in Statistics.
Lintner J. (1965) The Valuation of Risk Assets and the Selection of Risky Investments in Stock Portfolios
and Capital Budgets, Review of Economics and Statistics 47, 13-37.
Percival, D.B. and Walden A.T. (2000) Wavelet Methods for Time Series Analysis.
Tobin, J. (1958) Liquidity Preference as Behavior Towards Risk, Review of Economic Studies 25, 65-86.
Wooldridge, J.M. (2009). Introductory econometrics: A modern approach. Fourth edition. South-Western
2009.

28 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Overview: Transformation

Basics
Classification
Aggregation
OLAP

Filtering

Regression

Clusteranalysis

29 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Clustering
… or how a well sorted data collection results from a big amount of
single data points

30 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Motivation

Today big amounts of data due to the use of internet: „Big Data“
Hard to evaluate and to work with
Need of methods to reduce complexity

31 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Cluster analysis

A cluster (bunch, swarm, heap, group) describes a group of single data


objects
The single data objects within a cluster are supposed to be „similar“
Cluster analysis/ Clustering are processed to investigate similarity
structures and pictures of clusters based on that
In contrast to the classification new groups are created with clustering,
no pre knowledge is necessary (uniform process)

32 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Cluster analysis – the sense

Practical applications are e.g.


• Marketing (Market or customer segmentation)
• Picture processing (recognition of patterns)
• Biology, Genetics (Sequence analysis)
• Meteorology (search for periodical patterns)
• Data-Mining for Big Data (Social Network analysis, Bioinformatics)

33 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
How does clustering work?

Primary goal is it to create a data structure where the most similar data
elements are combined
Upcoming questions:
• How is similarity defined and measured?
• How are clusters build?
• How many clusters are build?

How many clusters? Six Clusters

Two Clusters Four Clusters

34 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Similarity

35 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Similarity
Similarity is the degree of conformity of objects measured with all of the used
characteristics
Measurable by correlation measures and distance measures

7 7
6 6
5
5
4
4
3
3
2
2 1
1 0
Category 1 Category 2 Category 3 Category 4
0
Category 1 Category 2 Category 3 Category 4

Low distance, low correlation Higher distance, higher correlation

36 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Similarity of sets
The Jaccard-coefficient is a measure for sets:

The number of equal elements (intersection) is divided by the number of


elements of the total set (union).

Example:

Also useful for the similarity of the vectors V1 and V2 with binary entries:

M11 Numbers of entries in which both vectors contain a 1


M00 Numbers of entries in which both vectors contain a 0
M10 Numbers of entries in which V1 contains a 1 and V2 a 0
M01 Numbers of entries in which V1 contains a 0 and V2 a 1

37 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Distance measures

Important because clustering-alogrithms work with the distances


between points and clusters. Sufficient for metrical scales:

Euclidean distance

Squared Euclidean distance

City-Block (Manhattan) distance

38 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Voronoi diagrams

Euklidean Distance Manhattan Metric

39 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Distances between clusters
Single link: smallest distance between an element in one cluster ond one in the
other cluster, dist(Ki, Kj) = min(tip, tjq)

Complete link: biggest distance between an element in one cluster ond one in the
other cluster, dist(Ki, Kj) = max(tip, tjq)

Average: average distance between an element in one cluster ond one in the other
cluster dist(Ki, Kj) = avg(tip, tjq)

Centroid: distance between the center of both clusters, dist(Ki, Kj) = dist(Ci, Cj)
Medoid: distance between the media of two clusters, dist(Ki, Kj) = dist(Mi, Mj)

40 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
40
Differentiation of cluster algorithms

Different possibilities with advantages and disadvantages, e.g.


depending on data size , goal and structure …
There is never only one algorithm

Graph theory
divertive
Hierarchical
agglomerative
Cluster algorithms

Exchange processes
partitioning
minimum distance
processes
optimizing

41 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Hierarchical processes

Start with a big cluster or a lot of single clusters that become more and
more precise
Stops if stop criterion is fulfilled (e.g. all data points combined or
distributed)
Differentiation between divisive (top-down) and agglomerative (bottom-
up) methods

+ Advantages: Flexibility, number of clusters flexible

- Disadvantages: runtime-complexity, analysis effort for the results,


once build clusters can not be changed anymore

42 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Example for algorithm AGNES

Agglomerative method (AGNES = agglomerative nesting)


Survey with 7 participants about brand recognition (V1) and buying
behavior (V2) measured on a scale from 0 to 10

43 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Distance matrix

Distance matrix with Euclidean distances

Observations
Observation
A B C D E F G

A ---
B 3.162 ---
C 5.099 2.000 ---
D 5.099 2.828 2.000 ---
E 5.000 2.236 2.236 4.123 ---
F 6.403 3.606 3.000 5.000 1.414 ---
G 3.606 2.236 3.606 5.000 2.000 3.162 ---

44 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Agglomerative building of clusters
AGGLOMERATIVE PROCESS CLUSTER SOLUTION

Overall Similarity
Measure
Step Minimum Observation Cluster Membership Number of (Average Within-
Distance Pair Clusters Cluster Distance)
Initial Solution (A)(B)(C)(D)(E)(F)(G) 7 0
1 1.414 E-F (A)(B)(C)(D)(E-F)(G) 6 1.414
2 2.000 E-G (A)(B)(C)(D)(E-F-G) 5 2.192
3 2.000 C-D (A)(B)(C-D)(E-F-G) 4 2.144
4 2.000 B-C (A)(B-C-D)(E-F-G) 3 2.234
5 2.236 B-E (A)(B-C-D-E-F-G) 2 2.896
6 3.162 A-B (A-B-C-D-E-F-G) 1 3.420

Successive combination of clusters until all observations are connected


Afterwards analysis which cluster size and fragmentation makes sense
Here you should stop after Step4 because OSM increases significantly
at Step 5

45 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Graphic Visualization

Stop after step4 results in the following 3 clusters

46 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Visualization - Dendogramm

Original data set Dendogramm after AGNES

47 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Visualization - Dendogramm

Shows when which clusters are combined until they sum up to the
complete cluster in the end
Outliner are recognizable
Distances given

48 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
(centrum based) Partitioning method

In general: Iterative shifting of cluster centrums, while minimizing a


error equation
Partitioning the data set D with n objects into a set of k clusters so that
the sum of the squared Euclidean distances is minimized
E   ik1 pCi (d ( p, ci )) 2
Clusters can always be change

+ Advantages: Clusters flexible, efficient O(tkn)

- Disadvantages: number of clusters fixed, not sufficient for convex


clusters, sensitive for outliner

49 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
k-Means Algorithm

Portioning method
First, number of central points (k) is set and randomly or selectively
positioned in the data set
Afterwards, data points are allocated to the closest centrum (see
Voronoi-Diagrams)
New position of central points (calculation of mean in the cluster)
Variations:
• k-Means++ : faster, better selection of clusters
• k-Medoids: Update because of Medoid-rule (less sensitive for outliner)

50 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
k-Means Algorithm

Randomly Initialize Clusters

51 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
k-Means Algorithm

Assign data points to


nearest clusters

52 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
k-Means Algorithm

Recalculate Clusters

53 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
k-Means Algorithm

Recalculate Clusters

54 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
k-Means Algorithm

Repeat

55 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
k-Means Algorithm

Repeat … until convergence

56 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
k-Means Algorithm

situation in the beginning with original data set

57 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Weaknesses of k-means

58 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Weaknesses in k-means

200 and 600 points in 2 Regions Desired result Result of k-means Clustering

59 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Quality criteria
3 different kinds of quality measurement:
extern: compare clusters with external expert knowledge/ information sources
intern: Complexity, variance criteria, Silhouette coefficient, how far is the distance between the
clusters, are they well separated ?
relative: Comparison of different cluster methods, different numbers of cluster, different start
parameters

Possible measures:

Within Cluster Sum of Squares (siehe oben)

Silhouette

a(i): average distance to the other points in a cluster


b(i): minimal average distance to the point in the next cluster

60 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)
Conclusion
Clustering groups by single observation depending on their similarity;
Clustering has application from data mining in marketing to genetics
There are different types of clustering algorithms. Most important hierarchical
methods (AGNES, Single-Linkage) as well as portioning methods (k-Mea
A problem with the k-means Algorithm is the definition of a cluster (number and
starting position )
The quality of clustering-results and algorithms is not trivial to evaluate and
should be done with various criteria

61 Management of Information Systems – Prof. Christof Weinhardt, Ewa Lux Institute of Information Systems and Marketing (IISM)

You might also like