You are on page 1of 89

Basic Data Analysis

Lecture 1, 2023: Intro, 1

Class in basic concepts and methods, so that


students who have passed should be able to produce
analogous new methods
for new data types and/or issues

Versus, say, learning systems like “scikit-learn” that


can be used with no knowledge of methods

BDA 2023 Boris Mirkin 1


Basic Data Analysis
Instructor

Boris Mirkin / Борис Григорьевич Миркин (myself)

Professor, Data Analysis and AI, NRU HSE, Moscow,


bmirkin@hse.ru, +7(963)-7234021

Professor Emeritus, Computer Science, Birkbeck UL,


London UK, boris.mirkin@bbk.ac.uk

BDA 2023 Boris Mirkin 2


Basic Data Analysis
Instructor
Boris Mirkin PhD (Abstract Automata), Saratov, Russia
DSc (Data Science, 1990), Moscow, Russia
Prof. University of London (Computer Science, Emeritus from 2010) UK
Experience:
- Managed Data Analysis Research Projects (Data Analysis (1973-
present), Sociology (1974-82), Genomics (1975-2005), Organization
Structures (1973-1980)), Etc.
- Supervised: PhD (20+), MSc (60+), BSc (100+) research projects
- Traveled (1991-2011): France (Telecom, OECD, 1991-93), USA (Rutgers,
NJ, 1993-98), Germany (DKFZ-Cancer, 1996-99), UK (Birkbeck U London,
2000-10)
- Published (Google Hirsch=43) : 80+ journal papers, 10
monographs, 3 textbooks BDA 2023 Boris Mirkin 3
Main text for the class
Boris Mirkin,
Core Data Analysis
Springer, UTiCS Series,
2019, 2d Ed., 527 p.

(1st Ed., 2011, 390 p. )

BDA 2023 Boris Mirkin 4


Boris Mirkin,
Core Concepts in Data Analysis, 2011
“To single out just one of the
text’s many successes: I doubt
readers will ever encounter
again such a detailed and
excellent treatment of
correlation concepts.”
Computing Reviews of the ACM,
June 27, 2011

BDA 2023 Boris Mirkin 5


Lecture 1 Contents
 General
◦ Administration
◦ Data science, data analysis, and machine learning
◦ Three examples of data analysis: two successful and
one not
◦ Goal and contents of the class
 Data
◦ Data and metadata: Iris data table and its analysis
issues
◦ Developing data table from a dataset: Titanic
◦ Two formalizations of the concept of feature: vector
and random variable
BDA 2023 Boris Mirkin 6
Lecture 1 Contents
 Administration
 Brief history of Data Science
 Three examples of data analysis: one successful
and two not
 Goal and contents of the class
 Data and metadata: Iris dataset and problems of
its analysis
 Two formalizations of the concept of feature:
vector and random variable
 Feature scales: quantitative, ranking, nominal,
and binary
BDA 2023 Boris Mirkin 7
Administration: Lectures and Labs
 Two modules (all of the Fall 2023)
 In-class Exam Paper (E) in the end of
December (preceded by Revision lecture(s))
 Individual home-work project report
(H):

◦ Aa assignment in the end of each lecture


◦ A report, about 10 December, over
 The home work may be done by a team of up to
three individuals.

BDA 2023 Boris Mirkin 8


Administration: Assessment
 The final mark:

 M=0.7E+0.3H (*)
 where
 H – Homework project mark
 E – Exam mark

BDA 2023 Boris Mirkin 9


Homework project
 Individual home-work (HW) report
 By a team (1,2, or 3 individuals)
 A dataset of 80 or more objects and 5 or more
features – must be approved by me (by 20 October
(if not, a penalty of 1 point, out of 10, applies)
 A few Home assignments based on lectures
including –
 code,
 computational application of a method at the data, and
 comments/interpretation of the result(s).

BDA 2023 Boris Mirkin 10


Generic Home-Work project report
(in parentheses, share of the mark %)
 A1: Shaping report including Data description (10%)
 A2: K-means clustering (10%)
 A3: Cluster Interpretation (15%)
 A4: Bootstrap for comparing averages (15%)
 A5: Contingency table analysis (20%)
 A6: PCA: Hidden factor & data visualization (15%)
 A7: 2D Regression and correlation (15%)

BDA 2023 Boris Mirkin 11


General:

Data Science,
Data Analysis,
Machine Learning

Is there any difference among them?

BDA 2023 Boris Mirkin 12


General:

Data Science,
Data Analysis,
Machine Learning

Is there any difference?

In my view, there is.


BDA 2023 Boris Mirkin 13
Data Analysis  Machine Learning
[Data science is not well shaped yet]
 Methods are similar but perspectives do differ:
 ML: Providing computers with rules and methods for
learning from data to solve a problem
 No need in interpretation
 Quality assessment of a method: over hold-out samples
 Truth is in model
 DA: Using data for enhancement of theoretical
thinking (concepts, relations) in the data domain
 Special care for interpretation/interpretability
 Quality assessment of a method: over whole data
 Truth is in data BDA 2023 Boris Mirkin 14
The structure of mathematical
thinking in Data science

Data science

Machine learning Statistics

Textanalysis
Data

BDA 2023 Boris Mirkin 15


Why not Neural Networks only?
1
 NN are based on internal link weights in NN: not
interpretable.
- So what?
- Tell this to a judge or doctor. (New development:
towards Explainable ML)

 Each NN solves a specific problem (not universal) which


is a bit odd for intelligent systems

 Big companies – Amazon, Facebook, etc. – spend huge


money, advertise successes such as BERT or ChatGPT and
are silent on many a drawback (arbitrary architectures,
overfitting issues, lack of structural models for AI, etc.)

BDA 2023 Boris Mirkin 16


Why not NN only? 2
 Using artificial neurons for building AI ~ is similar to
drawing schemas of buildings by using bricks only.

 Interpretable methods can work better than NN/CNN


(e.g. a method in community detection by my former
PhD student Dr Soroosh Shalileh (2021)).

 Current DA jobs require various skills; job market clearly


recognizes ML jobs and DS jobs as different

 Each MS student will have at least 2 or 3 classes with


deep learning further on.
BDA 2023 Boris Mirkin 17
KD Nuggets Poll 2019

BDA 2023 Boris Mirkin 18


KD Nuggets Poll 2019 (just before
Covid19 scare)

Absent from that are:

- Data Analysis
- GIS
- Data mining
- Fuzzy systems
- ………..
BDA 2023 Boris Mirkin 19
KD Nuggets Poll 2019
Absent from that: WHY?
My guess:
- Data Analysis – Absent from USA,
present in France, Russia, Germany
- GIS - No industrial apps
- Data mining - Outdated
- Fuzzy systems – Not in vogue among
those people (?)
- ………..
BDA 2023 Boris Mirkin 20
Community views of DA and ML
Запрос Google, HH.ru, Superjob.ru,
страниц вакансий вакансий
Машинное 2 660 000 2 474 859
обучение
Анализ данных 70 100 000 13 605 1 190

Query Google, LinkedIn, Jooble.org,


pages jobs jobs

Machine 970 000 000 58 842 6 313


learning
3 120 000 000 674 932 297 472
Data analysis

DA by far supersedes ML
Queries made 2/09/2023 BDA 2023 Boris Mirkin 21
Data analysis

BDA 2023 Boris Mirkin 22


Two examples of successful data analysis

 Pluto: aPlanet?
 Planetary motion: Johann Kepler’s 3d
law

One example of “unsuccessful” data analysis


 Risk factors of respiratory diseases

BDA 2023 Boris Mirkin 23


Planets: Is any of them a planet indeed?

Example of a good cluster structure: W. Jevons


(1835-1882), updated in Mirkin 1996

Cl. 1

Cl. 2
???

Pluto doesn’t fit in the two clusters of planets: started a new


cluster
cluster recently, September 2006
in 2006. BDA 2023 Boris Mirkin 24
Planetary motion:
A much successful example of
small data analysis by J. Kepler
(1571-1630)

BDA 2023 Boris Mirkin 25


Double success 1

A History of Laws for planetary motion

 Double success Ptolemy (c. 150 a. d.):


 Sun and planets
 “circle” Earth

 Does not match data well


BDA 2023 Boris Mirkin 26


Double success 2

The History of Laws for planetary motion

 Copernicus
 (c. 1540):
 Planets circle Sun

 Does not match data well


 either

BDA 2023 Boris Mirkin 27
Double success 3
 0th Law: All planets move in the same plane
Laws for planetary motion: J. Kepler (c.
1605):
 1st Law: Planets revolve Sun in ellipses (ovals)
 2d Law: Speed changes – the further away from
Sun, the slower (equal sectors in any time unit)

 Does
either

BDA 2023 Boris Mirkin 28


Double success 4: 3d Kepler’s Law:
Kepler’s thinking
Distance
after 1605:
Period (average,
Planet
(year) relative to that

Mercury 0.241
of Earth)
0.39
It should be a
Venus 0.615 0.72 relation between
Earth 1.00 1.00 speed/period and
Mars 1.88 1.52
distance;
Jupiter 11.8 5.20
Saturn 29.5 9.54
Uranus 84.0 19.18 which one?
Neptune 165 30.06
Pluto 248 39.44
BDA 2023 Boris Mirkin 29
Double success 5
3d Kepler’s Law:

Is there any
relation between
speed/period and
distance?
Points on the
plane “Distance-
Period” fit no
line…
BDA 2023 Boris Mirkin 30
Example of Small Data Analysis
Double success 6
3d Kepler’s Law (1619):

[J. Napier invented


logarithm (1614)]
𝟑
Log(P)= Log(D)
𝟐

P2=D3
BDA 2023 Boris Mirkin 31
Double success 7

Three Kepler’s Laws: What is so grand?


Substantiated theoretically by
R. Hooke (1635-1703) and I. Newton (1642-1727)
UNIVERSAL GRAVITATION LAW

Equation above, cornerstone of the science


BDA 2023 Boris Mirkin 32
An example of unsuccessful
data analysis

 From my own data analysis experiences

 Risk factors of respiratory diseases in


Akademgorodok, Novosibirsk, Russia
(Rostovtsev, Mirkin, Shanin, 1981)

BDA 2023 Boris Mirkin 33


Respiratory
diseases

Ear-Nose
Lungs Bronchi

Rostovtsev, Mirkin, Shanin (1981 unpublished):


Investigation in the local respiratory diseases and
risk factors for them

~50 000 respondents: 14 hierarchical clusters


BDA 2023 Boris Mirkin 34
Rostovtsev, Mirkin, Shanin (1981
unpublished), 1: Respiratory diseases survey

Smoking Drinking

Risk factors suggested according


to the views of that time
BDA 2023 Boris Mirkin 35
Rostovtsev, Mirkin, Shanin (1981
unpublished), 2: Respiratory diseases
survey
Risk factors according to the data:

The disease in family

Poor housing

BDA 2023 Boris Mirkin 36


Rostovtsev, Mirkin, Shanin (1981
unpublished), 3: Respiratory diseases
Risk factors according to data :
- The disease in family
- Poor housing

Smoking/Drinking:
Statistically independent, not risk factors

These conclusions, now a common place:


Rejected as contradicting to “firmly established
principles” (1981) (like those by J. Snow 1854)
BDA 2023 Boris Mirkin 37
My class's contents

BDA 2023 Boris Mirkin 38


My class: Principled view of DA
 My view:
 Main Goals of DA  Main Problems of DA
 Main Methods of DA:
This is what I teach (see Mirkin B. “Core data
analysis”, Springer, 2019), a few main methods
including PCA

 Common view differs:


 See “Data Science from Scratch” by J. Grus
(2019), Russian translation 2021,
12 popular methods, no PCA
BDA 2023 Boris Mirkin 39
Textbook 2019 (1st ed. 2011)
Boris Mirkin,

Core Data Analysis,

Springer,
UTiCS Series,
2019,
527 p.

BDA 2023 Boris Mirkin 40


Б.Г. Миркин “Введение в анализ данных”,
ЮРАЙТ, Москва, 2014 (being annually reprinted
further on).

Currently, under my
updating revision.
Should be completed
by the end of October
2023.

BDA 2023 Boris Mirkin 41


38 video-clips of me teaching BDA,
about 6-8 minutes each, are at
Yandex Drive:

https://yadi.sk/d/s0jjzuBTMGDoxQ

BDA 2023 Boris Mirkin 42


Andre Ng (208K citations) 2022

BDA 2023 Boris Mirkin 43


What is Basic Data Analysis
 Two main pathways for Knowledge Enhancement
◦ Summarization: Developing Concepts
◦ Correlation: Deriving Statements of relation
between concepts
 Two major formats (both concepts and statements):
 Quantitative
 Categorical

Four main sections: SumQ, SumC, CorQ, CorC,


in my class

BDA 2023 Boris Mirkin 44


BDA: Summarization section

CATEGORICAL QUANTITATIVE

Clustering PCA

Partition
SUMMARIZATION PCA as SVD
K-means et al.
Correspondence Analysis
Interpretation: Latent Semantic Indexing
Nominal scales
Quantitative scales

Comparing clusters
Bootstrap
Hierarchical clustering

BDA 2023 Boris Mirkin 45


BDA: Correlation section

CATEGORICAL QUANTITATIVE

Classifiers Linear regression


Discriminant analysis
CORRELATION SVM Correlation coefficient
Classification tree Neural Network
Naïve Bayes Error back propagation

BDA 2023 Boris Mirkin 46


Data and Metadata

BDA 2023 Boris Mirkin 47


What is DATA? Data & Metadata
 Table  This
class
 Signal concentrates on
 Text
 Sequence
data tables as
 Map  generic,
 Image simplest, and
 Video best explored
 ……. object

BDA 2023 Boris Mirkin 48


Iris, 150x4 table, a most popular dataset
w1 Sepal length w3 Petal length
w2 Sepal width 𝑤4 Petal width

Data (matrix) &


# w1 w2 w3 w4 Metadata (margins)
1 5.1 3.5 1.4 0.3 What type of analysis
2 4.4 3.2 1.3 0.2
3 4.4 3.0 1.3 0.2
to do with that?
4 5.0 3.5 1.6 0.6
5 5.1 3.8 1.6 0.2
6 4.9 3.1 1.5 0.2
7 5.0 3.2 1.2 0.2
8 4.6 3.2 1.4 0.2
9 5.0 3.3 1.4 0.2

150 6.5 3.2 5.1 2.0
BDA 2023 Boris Mirkin 49
A typical data table: Anderson–Fisher’s Iris
Iris flower
Sepal / Чашелистик
Petal / Лепесток
1504 data of three taxa:
Taxon
1-50 Iris setosa (diploid)
51-100 Iris versicolor (tetraploid)
101-150 Iris virginica (hexaploid)
Features
w1 Sepal length
w2 Sepal width
w3 Petal length
w4 Petal width
Metadata
𝑎𝑛𝑑
T Taxa
BDA 2023 Boris Mirkin 50
Three Iris taxa:

Setosa Virginica Versicolor

BDA 2023 Boris Mirkin 51


Some data analysis problems at Iris, 1
#
w1 w2 w3 w4
1 5.1 3.5 1.4 0.3
2 4.4 3.2 1.3 0.2
3 4.4 3.0 1.3 0.2
4 5.0 3.5 1.6 0.6
5 5.1 3.8 1.6 0.2
6 4.9 3.1 1.5 0.2
7 5.0 3.2 1.2 0.2
8 4.6 3.2 1.4 0.2
9 5.0 3.3 1.4 0.2

150 6.5 3.2 5.1 2.0

- Visualise data: map similar specimens at


points near each other; dissimilar
specimens, at far away points
- Build a predictor of sepal sizes from the
petal sizes (say, to lessen the burden of
measurement)
BDA 2023 Boris Mirkin 52
Some data analysis problems at Iris, 2
Iris 2 #
w1 w2 w3 w4
1 5.1 3.5 1.4 0.3
2 4.4 3.2 1.3 0.2
3 4.4 3.0 1.3 0.2
4 5.0 3.5 1.6 0.6
5 5.1 3.8 1.6 0.2
6 4.9 3.1 1.5 0.2
7 5.0 3.2 1.2 0.2
8 4.6 3.2 1.4 0.2
9 5.0 3.3 1.4 0.2

150 6.5 3.2 5.1 2.0

- Build a predictor of taxa (classifier) based


on the petal/sepal sizes
- Check how much features W1—W4 are
relevant (for example, by making clusters
and comparing them to the taxa)

BDA 2023 Boris Mirkin 53


Iris dataset structure: 2D visualized with SVD
* =

BDA 2023 Boris Mirkin 54


Developing data table from a data set, 1
Table 1. A fragment of Titanic (ship sank 1912) dataset.
[At: S – Southampton England, C – Cherbourg France, Q – Queensland Ireland;
SS – Siblings/Spouses; PCh – Parents/Children; Cl. – Class ]

Is this a data table? No. Why?


№ Surv. Cl. Name Sex Age SS PCh Price At

Braund, Mr. Owen М


1 0 3 22 1 0 7.25 S
Cumings, Mrs. John F
2 1 1 38 1 0 71.28 C
Heikkinen, Miss. Lai F
3 1 3 26 0 0 7.92 S
Futrelle, Mrs. F
4 1 1 Jacques 35 1 0 53.1 S
Allen, Mr. William М
5 0 3 35 0 0 8.05 S
Moran, Mr. James М
6 0 3 0 0 8.46 Q

7 1 3 Johnson, Mrs. Oscar F 27 0 2 11.13 S

BDA 2023 Boris Mirkin 55


Developing data table from a data set, 2
What is wrong with this dataset?
- Missing entry in “Age” column?
- String values? In “Sex”, “At”, “Name”?
- “Name” containing commas, dots and spaces?
№ Surv. Cl. Name Sex Age SS PCh Price At

Braund, Mr. Owen М


1 0 3 22 1 0 7.25 S
Cumings, Mrs. John F
2 1 1 38 1 0 71.28 C
Heikkinen, Miss. Lai F
3 1 3 26 0 0 7.92 S
Futrelle, Mrs. F
4 1 1 Jacques 35 1 0 53.1 S
Allen, Mr. William М
5 0 3 35 0 0 8.05 S
Moran, Mr. James М
6 0 3 0 0 8.46 Q

7 1 3 Johnson, Mrs. Oscar F 27 0 2 11.13 S


BDA 2023 Boris Mirkin 56
Developing data table from a data set, 3
- Missing entry in “Age” column?
Nothing wrong with this. A typical situation. No typical solution,
though. Because there is no general data model. I am going to
give you some advise(s) later on.
№ Surv. Cl. Name Sex Age SS PCh Price At

Braund, Mr. Owen М


1 0 3 22 1 0 7.25 S
Cumings, Mrs. John F
2 1 1 38 1 0 71.28 C
Heikkinen, Miss. Lai F
3 1 3 26 0 0 7.92 S
Futrelle, Mrs. F
4 1 1 Jacques 35 1 0 53.1 S
Allen, Mr. William М
5 0 3 35 0 0 8.05 S
Moran, Mr. James М
6 0 3 0 0 8.46 Q

7 1 3 Johnson, Mrs. Oscar F 27 0 2 11.13 S


BDA 2023 Boris Mirkin 57
Developing data table from a data set, 4
- String values? In “Sex”, “At”, “Name”?
Nothing wrong with this either. Categorical features frequently have
string values. Both “Sex” and “At” are nominal features to partition
the entity set in non-overlapping parts corresponding to feature
values each. Will be treated further on.
№ Surv Cl. Name Sex Age SS PCh Price At
.
Braund, Mr. Owen М
1 0 3 22 1 0 7.25 S
Cumings, Mrs. John F
2 1 1 38 1 0 71.28 C
Heikkinen, Miss. Lai F
3 1 3 26 0 0 7.92 S
Futrelle, Mrs. F
4 1 1 Jacques 35 1 0 53.1 S
Allen, Mr. William М
5 0 3 35 0 0 8.05 S
Moran, Mr. James М
6 0 3 0 0 8.46 Q

7 1 3 Johnson, Mrs. Oscar F 27 0 2 11.13 S


BDA 2023 Boris Mirkin 58
Developing data table from a data set, 5
- “Name” containing commas, dots, and spaces?
Nothing wrong with commas, dots, and spaces. Yet “Name” is not a
nominal feature. Its values are individual and not related to other
entities, in contrast to, say, “Sex” values, M and F. “Name” is
metadata, not a feature.
№ Surv. Cl. Name Sex Age SS PCh Price At

Braund, Mr. Owen М


1 0 3 22 1 0 7.25 S
Cumings, Mrs. John F
2 1 1 38 1 0 71.28 C
Heikkinen, Miss. Lai F
3 1 3 26 0 0 7.92 S
Futrelle, Mrs. F
4 1 1 Jacques 35 1 0 53.1 S
Allen, Mr. William М
5 0 3 35 0 0 8.05 S
Moran, Mr. James М
6 0 3 0 0 8.46 Q

7 1 3 Johnson, Mrs. Oscar F 27 0 2 11.13 S


BDA 2023 Boris Mirkin 59
Developing and quantifying data table from data, 6
- Remove “Name”
- Add together Family=SS+PCh
- Unify “Price”
- Envelop “Sex” and “At” in dummy columns corresponding to
categories

Surviv Cla F M Age Fa Pr At S At C At Q


№ m ice

1 0 3 0 1 22 1 8 1 0 0

2 1 1 1 0 38 1 53 0 1 0

3 1 3 1 0 26 0 8 1 0 0

4 1 1 1 0 35 1 53 1 0 0

5 0 3 0 1 35 0 8 1 0 0

6 0 3 0 1 28 0 8 0 0 1

7 1 3 1 0 27 2 8 1 0 0

This is a data table!


BDA 2023 Boris Mirkin 60
All 2D data tables are similar:

2D data array [data]

and information of features


(columns) and objects (rows)
[metadata]

BDA 2023 Boris Mirkin 61


Modeling Feature in Data Analysis

BDA 2023 Boris Mirkin 62


Mathematical Model of Feature
 Dual perspective
(like “photon” in quantum physics, both
particle and wave)

◦ DA: Mapping I → R

◦ ST: Density function f(x)

BDA 2023 Boris Mirkin 63


Iris, features w1, w2, w3, w4
Iris
# w1 w2 w3 w4
1 5.1 3.5 1.4 0.3
2 4.4 3.2 1.3 0.2
3 4.4 3.0 1.3 0.2
4 5.0 3.5 1.6 0.6
5 5.1 3.8 1.6 0.2
6 4.9 3.1 1.5 0.2
7 5.0 3.2 1.2 0.2
8 4.6 3.2 1.4 0.2
9 5.0 3.3 1.4 0.2
… …..
150 6.5 3.2 5.1 2.0

Consider feature w1. How to model it? Data


Science: entries in w1 matter only!!!
BDA 2023 Boris Mirkin 64
What is feature w1? According to
Data Analysis view, just the column
w1’s contents, specimen sepal length:
 Index 1 through 9
5.1 4.4 4.4 5.0 5.1 4.9 5.0 4.6 5.0
. . . . . . . . . . . . . . . . . . . . . . . . .
 Index 142 through 150
6.7 6.3 6.5 6.5 7.3 6.7 5.6 6.4 6.5
What is this as
a mathematical object?
BDA 2023 Boris Mirkin 65
What is the column w1’s contents
as a mathematical object?
:
 Index 1 through 9
5.1 4.4 4.4 5.0 5.1 4.9 5.0 4.6 5.0
. . . . . . . . . . . . . . . . . . . . . . . . .
 Index 142 through 150
6.7 6.3 6.5 6.5 7.3 6.7 5.6 6.4 6.5

Two different views co-exist (like the


photon, unit of light, in quantum
physics: both a particle and a wave)
BDA 2023 Boris Mirkin 66
Two different views at the w1
feature as a mathematical object:
:  Index 1 through 9
5.1 4.4 4.4 5.0 5.1 4.9 5.0 4.6 5.0
. . . . . . . . . . . . . . . . . . . . . . . . .
 Index 142 through 150
6.7 6.3 6.5 6.5 7.3 6.7 5.6 6.4 6.5

A) Vector of 150x1 dimension


B) 150-strong sample from a random
variable
BDA 2023 Boris Mirkin 67
A) Feature as vector, 1:
:
 Index 1 through 9
5.1 4.4 4.4 5.0 5.1 4.9 5.0 4.6 5.0
. . . . . . . . . . . . . . . . . . . . . . . . .
 Index 142 through 150
6.7 6.3 6.5 6.5 7.3 6.7 5.6 6.4 6.5
Math: Given a set I of object indices
or names, feature is a mapping f: I→R
where R is the set of all reals, that is,
f=(fi), iI, an |I|-dimensional vector
BDA 2023 Boris Mirkin 68
A) Feature as vector, 2:
: Math: f=(fi), iI, an |I|-dimensional vector
N=|I|
 Two main characteristics:
 Center: Mean
ҧ 1 𝑁
𝑓 = σ𝑖=1 𝑓𝑖
𝑁
 Spread:Variance
1 𝑁
𝑠𝑓2 = σ𝑖=1 𝑓𝑖 − 𝑓 ҧ 2
𝑁
Standard deviation sf (sq. root of 𝑠𝑓2 )
BDA 2023 Boris Mirkin 69
A) Feature as vector, 3:
:  Index 1 through 9
5.1 4.4 4.4 5.0 5.1 4.9 5.0 4.6 5.0
. . . . . . . . . . . . . . . . . . . . . . . . .
 Index 142 through 150
6.7 6.3 6.5 6.5 7.3 6.7 5.6 6.4 6.5
Pro: a) Intuitive;
b) Objects are explicit (rows)
c) Linear algebra applies

Con: d) Empirical (depends on I,


cannot be extended to the universe)
BDA 2023 Boris Mirkin 70
B) Feature as random variable, 1:
: Index 1 through 9
5.1 4.4 4.4 5.0 5.1 4.9 5.0 4.6 5.0
. . . . . . . . . . . . . . . . . . . . . . . . .
 Index 142 through 150
6.7 6.3 6.5 6.5 7.3 6.7 5.6 6.4 6.5

Histogram: range is divided in n(=10) bins; numbers


of objects falling in bins are presented by bar heights.
BDA 2023 Boris Mirkin 71
B) Feature as random variable, 2:
:

(a) (b)
Histogram: (a) range is divided in n(=20) bins;
numbers of objects falling in bins are presented by
bars.
Relative histogram: (b) bars express proportions
of objects in the bins (sum to 1).
BDA 2023 Boris Mirkin 72
B) Feature as random variable, 3:
:

Relative histogram: bars express proportions of


objects in the bins.
Density function, an abstraction of histogram at N
and n tending to infinity: a measurable non-negative
+∞
function (curve) f(x) such that ‫׬‬−∞ 𝒇 𝒙 𝒅𝒙 = 𝟏 .
BDA 2023 Boris Mirkin 73
B) Feature as random variable, 4:
:
f(x)

0 ab x

Density function, an abstraction of relative


histogram at N, n tending to infinity: a “measurable”
non-negative function f(x) such that
+∞
න 𝒇 𝒙 𝒅𝒙 = 𝟏
−∞
𝒃
‫𝒇 𝒂׬‬ 𝒙 𝒅𝒙
= 𝐩𝐫𝐨𝐛𝐚𝐛𝐢𝐥𝐢𝐭𝐲 𝐨𝐟 𝐭𝐡𝐞 𝐯𝐚𝐫𝐢𝐚𝐛𝐥𝐞 𝐭𝐨 𝐟𝐚𝐥𝐥 𝐢𝐧 [𝒂, 𝒃]
BDA 2023 Boris Mirkin 74
B) Feature as random variable, 5:
: f(x)

0 ab x
Density function f(x)
Two characteristics:
Mean
+∞
𝑬 𝒇 =𝝁=න 𝒙𝒇 𝒙 𝒅𝒙
−∞
Variance
Var(f)=E([f- 𝝁]2)
Standard deviation: Square root of Var(f)
BDA 2023 Boris Mirkin 75
B) Feature as random variable, 6:
:
f(x)

0 ab x

Math: Random variable=Density function

Pro: (a) Universal, does not depend on set I


(b) Probability theory can be used

Con: (c) Objects are implicit


BDA 2023 Boris Mirkin 76
Popular density functions:

 Gaussian/Normal N(0,1)
f(x) = exp{-x2}
 Power law /Hyperbolic law
f(x)=cx−

 Uniform distribution
f(x)= const on [a, b]

BDA 2023 Boris Mirkin 77


B) Popular density functions: Gaussian N(0,1)

f(x) = exp{-x2}

BDA 2023 Boris Mirkin 78


B) Popular density functions: general
Gaussian N(,)
−(𝑥−)2
 f(x)=exp( )/ 2𝜋𝜎 2
2𝜎 2
  is mean
 𝜎 2 is variance

BDA 2023 Boris Mirkin 79


B) General Gaussian N(,)
 Bell curve (symmetric over )
 𝜎 2 is variance, 𝜎 is standard
deviation (same scale)
 2𝜎 rule, 3𝜎 rule

BDA 2023 Boris Mirkin 80


B) General Gaussian N(,)
 Bell curve (symmetric over )
 Central interval to account for
0.95=95% of the area:
 [ − 1.96𝜎,  +1.96𝜎]

BDA 2023 Boris Mirkin 81


B) Popular density functions: power law
 f(x)=cx−
  the steepness
 Scale-free (why? Can you tell?)
 Mean exists at >2, Var, >3

Matthew effect
(see next slide)
BDA 2023 Boris Mirkin 82
B) Power law: Matthew effect
 For unto every one that hath shall be given,
and he shall have abundance: but from him
that hath not shall be taken even that which
he hath. Matthew Gospel 25:29

Examples:
Wealth
Quotations
Web site popularity

BDA 2023 Boris Mirkin 83


B) Popular density functions: uniform
distribution over [a, b] interval

Why is
1/(b-a)?

Mean=
(a+b)/2

Var=(b-a)2/12

BDA 2023 Boris Mirkin 84


Mechanisms:
 Gaussian/Normal
◦ Sum of many “small” independent random variables,
Central Limit Theorem

 Power law /Hyperbolic law


◦ Success generates success

 Uniform distribution
◦ Nothing is known except for the interval [a,b]

BDA 2023 Boris Mirkin 85


Lecture 1: Concepts learnt

 What is Data analysis


 Views of Data Scientists
 Data and metadata
 Developing data table from a dataset
 Feature as mapping
 Feature as probabilistic distribution
 Popular distributions: Gaussian, Power Law,
Uniform
 Mean and Variance

BDA 2023 Boris Mirkin 86


Home work 1: Data finding and Report
writing ◦ A very short explanation of the
 1. Each to form/join a team of one,
two or three teammates; the team choice of the dataset
finds a meaningful dataset of ◦ Information of the dataset: the
their liking on the internet: say, by
googling “data analysis dataset” (see meaning and number of entities,
next slide) list of features with their scale
 Number of entities  80, of features types and feature categories if
5
categorical, source address
 No missings
 2. Start writing a team’s report
 The dataset is to be approved file to submit it later as either
by me. Deadline for sending team’s
request for approval to a Word or Adobe pdf file (Pdf
bmirkin@hse.ru: 20 October 2023. files of former years reports will be
Those missing the deadline will have provided as examples for your
their mark reduced by 1 (in the
scale of 10). convenience.)
 Project title page
 To get my approval, a team
member is to send me a  Section 1. Intro with information of
message including: the dataset
BDA 2023 Boris Mirkin 87
Finding a dataset of your liking, 1:
 Choose a subject of your liking, say «banking» or «global
warming» or
 Google that (like “banking datasets” or “global warming
datasets”)
 Take a look at the first pages and click on a site of your
liking; if this does not show any interesting dataset,
repeat the attempt at a different site. Otherwise, go to
the next item.
 If the data set is too large (say more than a thousand
objects), select a smaller subset over a convenient
feature.
 Select a few features (less than a dozen but more than
four) and develop the corresponding data table
 Present that to the Instructor, as explained above, and
get approval.

BDA 2023 Boris Mirkin 88


Computation and comments
You may use any computing environment including the most
popular Python, MatLab/Octave, R, etc.You may write your
own codes or use those provided by the environments.
In your report, you are to make an exact reference to what a
specific tool was used, provide its code and specify the
parameters of your application. The less you comment on
your solutions, the greater the penalty to be imposed on
your mark. [It is assumed that the failure to comment is
because of ignorance rather than out of laziness.]

BDA 2023 Boris Mirkin 89

You might also like