Professional Documents
Culture Documents
BData Analysis 2023 - 1
BData Analysis 2023 - 1
M=0.7E+0.3H (*)
where
H – Homework project mark
E – Exam mark
Data Science,
Data Analysis,
Machine Learning
Data Science,
Data Analysis,
Machine Learning
Data science
Textanalysis
Data
- Data Analysis
- GIS
- Data mining
- Fuzzy systems
- ………..
BDA 2023 Boris Mirkin 19
KD Nuggets Poll 2019
Absent from that: WHY?
My guess:
- Data Analysis – Absent from USA,
present in France, Russia, Germany
- GIS - No industrial apps
- Data mining - Outdated
- Fuzzy systems – Not in vogue among
those people (?)
- ………..
BDA 2023 Boris Mirkin 20
Community views of DA and ML
Запрос Google, HH.ru, Superjob.ru,
страниц вакансий вакансий
Машинное 2 660 000 2 474 859
обучение
Анализ данных 70 100 000 13 605 1 190
DA by far supersedes ML
Queries made 2/09/2023 BDA 2023 Boris Mirkin 21
Data analysis
Pluto: aPlanet?
Planetary motion: Johann Kepler’s 3d
law
Cl. 1
Cl. 2
???
Copernicus
(c. 1540):
Planets circle Sun
Does
either
Mercury 0.241
of Earth)
0.39
It should be a
Venus 0.615 0.72 relation between
Earth 1.00 1.00 speed/period and
Mars 1.88 1.52
distance;
Jupiter 11.8 5.20
Saturn 29.5 9.54
Uranus 84.0 19.18 which one?
Neptune 165 30.06
Pluto 248 39.44
BDA 2023 Boris Mirkin 29
Double success 5
3d Kepler’s Law:
Is there any
relation between
speed/period and
distance?
Points on the
plane “Distance-
Period” fit no
line…
BDA 2023 Boris Mirkin 30
Example of Small Data Analysis
Double success 6
3d Kepler’s Law (1619):
P2=D3
BDA 2023 Boris Mirkin 31
Double success 7
Ear-Nose
Lungs Bronchi
Smoking Drinking
Poor housing
Smoking/Drinking:
Statistically independent, not risk factors
Springer,
UTiCS Series,
2019,
527 p.
Currently, under my
updating revision.
Should be completed
by the end of October
2023.
https://yadi.sk/d/s0jjzuBTMGDoxQ
CATEGORICAL QUANTITATIVE
Clustering PCA
Partition
SUMMARIZATION PCA as SVD
K-means et al.
Correspondence Analysis
Interpretation: Latent Semantic Indexing
Nominal scales
Quantitative scales
Comparing clusters
Bootstrap
Hierarchical clustering
CATEGORICAL QUANTITATIVE
1 0 3 0 1 22 1 8 1 0 0
2 1 1 1 0 38 1 53 0 1 0
3 1 3 1 0 26 0 8 1 0 0
4 1 1 1 0 35 1 53 1 0 0
5 0 3 0 1 35 0 8 1 0 0
6 0 3 0 1 28 0 8 0 0 1
7 1 3 1 0 27 2 8 1 0 0
◦ DA: Mapping I → R
(a) (b)
Histogram: (a) range is divided in n(=20) bins;
numbers of objects falling in bins are presented by
bars.
Relative histogram: (b) bars express proportions
of objects in the bins (sum to 1).
BDA 2023 Boris Mirkin 72
B) Feature as random variable, 3:
:
0 ab x
0 ab x
Density function f(x)
Two characteristics:
Mean
+∞
𝑬 𝒇 =𝝁=න 𝒙𝒇 𝒙 𝒅𝒙
−∞
Variance
Var(f)=E([f- 𝝁]2)
Standard deviation: Square root of Var(f)
BDA 2023 Boris Mirkin 75
B) Feature as random variable, 6:
:
f(x)
0 ab x
Gaussian/Normal N(0,1)
f(x) = exp{-x2}
Power law /Hyperbolic law
f(x)=cx−
Uniform distribution
f(x)= const on [a, b]
f(x) = exp{-x2}
Matthew effect
(see next slide)
BDA 2023 Boris Mirkin 82
B) Power law: Matthew effect
For unto every one that hath shall be given,
and he shall have abundance: but from him
that hath not shall be taken even that which
he hath. Matthew Gospel 25:29
Examples:
Wealth
Quotations
Web site popularity
Why is
1/(b-a)?
Mean=
(a+b)/2
Var=(b-a)2/12
Uniform distribution
◦ Nothing is known except for the interval [a,b]