Professional Documents
Culture Documents
Statistics I
Viktor Nagy, Ph.D.
Óbuda University
Pro Scientia et Futuro
Statistics: the science that deals with the collection, tabulation, and systematic
classification of quantitative data, especially as a basis for inference and
induction.
+ Interpretation, presentation
Discrete: items that can be counted; the take on possible values that can be
listed out. The list may be fixed (finite) or it may go from 0, 1, 2, on to infinity
(making it countably infinite).
Continuous: can only be described using intervals, the possible values cannot be
counted.
Studies
Controls No controls
Contemporaneous Historical
Controlled Observational
experiment studies
Types of
data
Qualitative Quantitative
Measurement Scales
Xh
N f i
1
N
1 Xh i 1
Harmonic mean Harmonikus átlag k k
fi gi
i 1 X i
i 1 X i
i 1 X i
k
N
fi k k
X g N Xi X Xi
Mértani átlag
Geometric mean Xg fi gi
i 1
(geometriai) i
i 1 i 1 i 1
k
N
f Xi
Arithmetic mean
Számtani átlag Xi X i 1
i k
gi X i
(aritmetikai) X i 1 k
N f
i 1
i
i 1
k
N
X 2 f X i i
2
k
Root mean square/quadraticNégyzetes
mean átlag
(kvadratikus) Xq i 1
i
Xq i 1
k
g X i i
2
N f i 1
i
i 1
f i N
Óbuda University
Pro Scientia et Futuro
Óbuda University
Pro Scientia et Futuro
Same ratio:
Figure: Wikipedia
Óbuda University
Pro Scientia et Futuro
Statistics I
Viktor Nagy, Ph.D.
Óbuda University
Pro Scientia et Futuro
Frequency distribution: to graphically describe the data (simply a table). A table that
shows the number of data observations that fall into specific intervals.
Points accumulated
33 22 15 20 17 13 34 22 5 17
17 14 26 27 26 34 12 6 21 23
26 6 34 20 20 4 13 34 4 18
28 17 14 38 11 12 23 11 32 26
Measures of Central Tendency: describe the center point of a data set with a single value.
Mean or average: add all the values and devide by the number of observations.
Mean of groupped data from a frequency distribution: calculated by using the midpoints
Median: represents the value in the data set for which half the observations are higher
and the other half are lower. When there is an even number of data points, the median
will be the average of the two center points.
Mode: the observation that occurs the most frequent. (More possible)
Óbuda University
Pro Scientia et Futuro
Percentile: the percentage of individuals who are below where particular number is
located.
Points accumulated
4 4 5 6 6 11 11 12 12 13
13 14 14 15 17 17 17 17 18 20
20 20 21 22 22 23 23 26 26 26
26 27 28 32 33 34 34 34 34 38
Óbuda University
Pro Scientia et Futuro
Measures of Dispersion: describes how far the individual data values have strayed from
the mean.
Range: the difference between the highest value and the lowest value in the data set.
Interquartile range (IQR): middle fifty: difference between the upper and lower
quartiles; measures the spread of the center half.
~ is used for identify outliers: extreme values, whose accuracy is questioned and can
cause unwanted distortions in statistical results. Values to be discarded:
<Q1-1.5IQR or >Q3+1.5IQR
Interquantile range:….
Óbuda University
Pro Scientia et Futuro
Variance: describes the relative distance between the data points and the mean in the
data set.
~ of the population
Óbuda University
Pro Scientia et Futuro
Mean absolute difference (MD) or absolute mean difference: the average absolute
difference of two indepedent values from the population
Box plot
Box and whisker diagram
A histogram is a graphical representation of the distribution of numerical data. To construct a histogram, the
first step is to „bin” the range of values—that is, divide the entire range of values into a series of intervals—and
then count how many values fall into each interval. The bins are usually specified as consecutive, non-
overlapping intervals of a variable. The bins (intervals) must be adjacent, and are often (but are not required to
be) of equal size.
If the bins are of equal size, a rectangle is erected over the bin with height proportional to the frequency — the
number of cases in each bin. However, bins need not be of equal width; in that case, the erected rectangle is
defined to have its area proportional to the frequency of cases in the bin. The vertical axis is then not the
frequency but frequency density — the number of cases per unit of the variable on the horizontal axis.
1
1 − 𝑘 2 x 100
percent values will fall within k standard
deviations from the mean
In general: 𝜇 ± kσ
Smaller SD: Tighter
and taller around
the mean
Probability density function (PDF) or density of a continuous (random) variable describes the relative likelihood for this
variable to take on a given value.
The probability of the random variable falling within a particular range of values is given by the integral of this variable’s
density over that range. Its integral over the entire space is equal to one.
X ~ N 0,1
Changing an x value to z-value is called standardizing. The z-formula:
Cumulative distribution function
x
t 2 x t2
F x
1
x
1
e 2
e
2
dt 2
dt
2 2
Skewness: the measure of asymmetry
𝑄3 − 𝑀𝑒 − 𝑀𝑒 − 𝑄1
𝐹0.25 =
𝑄3 − 𝑀𝑒 + 𝑀𝑒 − 𝑄1
𝐷9 − 𝑀𝑒 − 𝑀𝑒 − 𝐷1
𝐹0.1 =
𝐷9 − 𝑀𝑒 + 𝑀𝑒 − 𝐷1
F>0 F=0 F<0
𝑋ത − 𝑀𝑜 A>0 A=0 A<0
𝐴=
𝜎 P>0 P=0 P<0
3 ∙ 𝑋ത − 𝑀𝑒
𝑃=
𝜎
Statistics I.
Viktor Nagy, Ph.D.
Óbuda University
Pro Scientia et Futuro
Classes of variable B
j
C1B C 2B C Bj C (Bk 1) C kB
C1A f11 f12 f1j f1(k-1) f1k f1●
Classes of variable A
Association Factory
Quality
A B C
Accept 31 7 4
Waste 3 35 45
Cramer’s V
Independent
Óbuda University
Pro Scientia et Futuro
Factory
Association Q Total
A B C
31 11.42 7 14.11 4 16.46
A 42
33.57 3.58 9.43
3 22.58 35 27.89 45 32.54
W 83
16.98 1.81 4.77
Total 34 42 49 125
Óbuda University
Pro Scientia et Futuro
Mixed
the type of the movie explains 93.44 percent of the variation of number of tickets sold
Óbuda University
Pro Scientia et Futuro
Candidates
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
R1 8 10 5 4 7 1 2 6 3 9
Referees
R2 8 9 7 4 5 2 1 6 3 10
d i2 0 1 4 0 4 1 1 0 0 1 d i
2
12
6 d i2 6 12
1 1 0,9273 Strong, positive correlation
N ( N 2 1) 10 (10 1)
2
Price of Age of dX dY
Nr the car the car d X dY dX
2
dY
2
(€) (years)
X X Y Y
1 1300 3 150 -1.8 -270 22500 3.24
2 1000 5 -150 0.2 -30 22500 0.04
3 800 8 -350 3.2 -1120 122500 10.24
4 850 7 -300 2.2 -660 90000 4.84
5 1800 1 650 -3.8 -2470 422500 14.44
Sum 5750 24 0 0 -4550 680000 32.8
Average X 1150 Y 4.8
RXY
C XY
d d x y
4550
0.9634
XY d d
2
x
2
y
680000 32.8
INDEX NUMBERS
An index is a means of comparing changes in some variable, often price over time. This
is particularly useful when there are many items involved and when the prices and
quantities are in different units. The best known index is the consumer price index (CPI):
it is the official measure of inflation when it comes to up-rating public sector pensions
and benefits. It compares the price of a „basket” of goods from one month to another.
Consumer Price Index
The Consumer Price Index measures the average change of price changes of
goods and services purchased by households for their own consumption. It
measures the inflation of national currency.
Consumer price index for pensioners
the consumer price index for pensioners shows how the differences in the
structure of the pensioners' consumption influence the indices of this strata of
population. The three groups of commodities (foods, medicines, expenditures
related to housing) which have major impact in terms of the pensioners'
consumption amount to 60 percent from the consumer basket of this strata. The
index is calculated by eliminating products and services related to child care.
Óbuda University
Pro Scientia et Futuro
SIMPLE INDICES
Changes in quantity, price, value
Value Iv
q p
1 1
v v i
1 0 v
v 1
q p
0 0 v v
0 0
v
i 1
q p 0 0 q p 0 0
qp
i 1 0
Paasche (current-weighted): I q1
q p 1 1
q p i 0 1 q
q p
1 1
q p 0 1 q p 0 1
qp
i
1 1
Price I p0
q0 p1
q0 p0 i p
q p 0 1
Laspeyres (base-weighted):
q 0 p0 q 0 p0 q p
i 0 1
I 1p
q p 1 1
q p i 1 0 p
q p 1 1
Paasche (current-weighted):
q p 1 0 q p 1 0
qp
i 1 1
Fisher indicies: I qF I q0 I q1 I pF I p0 I 1p
I v I q0 I 1p I q1 I p0 I qF I pF
Óbuda University
Pro Scientia et Futuro
INDEX NUMBERS
An index is a means of comparing changes in some variable, often price over time. This
is particularly useful when there are many items involved and when the prices and
quantities are in different units. The best known index is the consumer price index (CPI):
it is the official measure of inflation when it comes to up-rating public sector pensions
and benefits. It compares the price of a „basket” of goods from one month to another.
Consumer Price Index
The Consumer Price Index measures the average change of price changes of
goods and services purchased by households for their own consumption. It
measures the inflation of national currency.
Consumer price index for pensioners
the consumer price index for pensioners shows how the differences in the
structure of the pensioners' consumption influence the indices of this strata of
population. The three groups of commodities (foods, medicines, expenditures
related to housing) which have major impact in terms of the pensioners'
consumption amount to 60 percent from the consumer basket of this strata. The
index is calculated by eliminating products and services related to child care.
Óbuda University
Pro Scientia et Futuro
SIMPLE INDICES
Changes in quantity, price, value
Value Iv
q p
1 1
v v i
1 0 v
v 1
q p
0 0 v v
0 0
v
i 1
q p 0 0 q p 0 0
qp
i 1 0
Paasche (current-weighted): I q1
q p 1 1
q p i 0 1 q
q p
1 1
q p 0 1 q p 0 1
qp
i
1 1
Price I p0
q0 p1
q0 p0 i p
q p 0 1
Laspeyres (base-weighted):
q 0 p0 q 0 p0 q p
i 0 1
I 1p
q p 1 1
q p i 1 0 p
q p 1 1
Paasche (current-weighted):
q p 1 0 q p 1 0
qp
i 1 1
Fisher indicies: I qF I q0 I q1 I pF I p0 I 1p
I v I q0 I 1p I q1 I p0 I qF I pF
Óbuda University
Pro Scientia et Futuro
Statistics I.
Viktor Nagy, Ph.D.
Óbuda University
Pro Scientia et Futuro
Classes of variable B
j
C1B C 2B C Bj C (Bk 1) C kB
C1A f11 f12 f1j f1(k-1) f1k f1●
Classes of variable A
Association Factory
Quality
A B C
Accept 31 7 4
Waste 3 35 45
Cramer’s V
Independent
Óbuda University
Pro Scientia et Futuro
Factory
Association Q Total
A B C
31 11.42 7 14.11 4 16.46
A 42
33.57 3.58 9.43
3 22.58 35 27.89 45 32.54
W 83
16.98 1.81 4.77
Total 34 42 49 125
Óbuda University
Pro Scientia et Futuro
Mixed
the type of the movie explains 93.44 percent of the variation of number of tickets sold
Óbuda University
Pro Scientia et Futuro
Candidates
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
R1 8 10 5 4 7 1 2 6 3 9
Referees
R2 8 9 7 4 5 2 1 6 3 10
d i2 0 1 4 0 4 1 1 0 0 1 d i
2
12
6 d i2 6 12
1 1 0,9273 Strong, positive correlation
N ( N 2 1) 10 (10 1)
2
Price of Age of dX dY
Nr the car the car d X dY dX
2
dY
2
(€) (years)
X X Y Y
1 1300 3 150 -1.8 -270 22500 3.24
2 1000 5 -150 0.2 -30 22500 0.04
3 800 8 -350 3.2 -1120 122500 10.24
4 850 7 -300 2.2 -660 90000 4.84
5 1800 1 650 -3.8 -2470 422500 14.44
Sum 5750 24 0 0 -4550 680000 32.8
Average X 1150 Y 4.8
RXY
C XY
d d x y
4550
0.9634
XY d d
2
x
2
y
680000 32.8
Box plot
Box and whisker diagram
A histogram is a graphical representation of the distribution of numerical data. To construct a histogram, the
first step is to „bin” the range of values—that is, divide the entire range of values into a series of intervals—and
then count how many values fall into each interval. The bins are usually specified as consecutive, non-
overlapping intervals of a variable. The bins (intervals) must be adjacent, and are often (but are not required to
be) of equal size.
If the bins are of equal size, a rectangle is erected over the bin with height proportional to the frequency — the
number of cases in each bin. However, bins need not be of equal width; in that case, the erected rectangle is
defined to have its area proportional to the frequency of cases in the bin. The vertical axis is then not the
frequency but frequency density — the number of cases per unit of the variable on the horizontal axis.
1
1 − 𝑘 2 x 100
percent values will fall within k standard
deviations from the mean
In general: 𝜇 ± kσ
Smaller SD: Tighter
and taller around
the mean
Probability density function (PDF) or density of a continuous (random) variable describes the relative likelihood for this
variable to take on a given value.
The probability of the random variable falling within a particular range of values is given by the integral of this variable’s
density over that range. Its integral over the entire space is equal to one.
X ~ N 0,1
Changing an x value to z-value is called standardizing. The z-formula:
Cumulative distribution function
x
t 2 x t2
F x
1
x
1
e 2
e
2
dt 2
dt
2 2
Skewness: the measure of asymmetry
𝑄3 − 𝑀𝑒 − 𝑀𝑒 − 𝑄1
𝐹0.25 =
𝑄3 − 𝑀𝑒 + 𝑀𝑒 − 𝑄1
𝐷9 − 𝑀𝑒 − 𝑀𝑒 − 𝐷1
𝐹0.1 =
𝐷9 − 𝑀𝑒 + 𝑀𝑒 − 𝐷1
F>0 F=0 F<0
𝑋ത − 𝑀𝑜 A>0 A=0 A<0
𝐴=
𝜎 P>0 P=0 P<0
3 ∙ 𝑋ത − 𝑀𝑒
𝑃=
𝜎
Measures of Dispersion: describes how far the individual data values have strayed from
the mean.
Range: the difference between the highest value and the lowest value in the data set.
Interquartile range (IQR): middle fifty: difference between the upper and lower
quartiles; measures the spread of the center half.
~ is used for identify outliers: extreme values, whose accuracy is questioned and can
cause unwanted distortions in statistical results. Values to be discarded:
<Q1-1.5IQR or >Q3+1.5IQR
Interquantile range:….
Óbuda University
Pro Scientia et Futuro
Variance: describes the relative distance between the data points and the mean in the
data set.
~ of the population
Óbuda University
Pro Scientia et Futuro
Mean absolute difference (MD) or absolute mean difference: the average absolute
difference of two indepedent values from the population
Statistics I
Viktor Nagy, Ph.D.
Óbuda University
Pro Scientia et Futuro
Frequency distribution: to graphically describe the data (simply a table). A table that
shows the number of data observations that fall into specific intervals.
Points accumulated
33 22 15 20 17 13 34 22 5 17
17 14 26 27 26 34 12 6 21 23
26 6 34 20 20 4 13 34 4 18
28 17 14 38 11 12 23 11 32 26
Measures of Central Tendency: describe the center point of a data set with a single value.
Mean or average: add all the values and devide by the number of observations.
Mean of groupped data from a frequency distribution: calculated by using the midpoints
Median: represents the value in the data set for which half the observations are higher
and the other half are lower. When there is an even number of data points, the median
will be the average of the two center points.
Mode: the observation that occurs the most frequent. (More possible)
Óbuda University
Pro Scientia et Futuro
Percentile: the percentage of individuals who are below where particular number is
located.
Points accumulated
4 4 5 6 6 11 11 12 12 13
13 14 14 15 17 17 17 17 18 20
20 20 21 22 22 23 23 26 26 26
26 27 28 32 33 34 34 34 34 38
Óbuda University
Pro Scientia et Futuro
Óbuda University
Pro Scientia et Futuro
Same ratio:
Figure: Wikipedia
Óbuda University
Pro Scientia et Futuro
Statistics I
Viktor Nagy, Ph.D.
Óbuda University
Pro Scientia et Futuro
Statistics: the science that deals with the collection, tabulation, and systematic
classification of quantitative data, especially as a basis for inference and
induction.
+ Interpretation, presentation
Discrete: items that can be counted; the take on possible values that can be
listed out. The list may be fixed (finite) or it may go from 0, 1, 2, on to infinity
(making it countably infinite).
Continuous: can only be described using intervals, the possible values cannot be
counted.
Studies
Controls No controls
Contemporaneous Historical
Controlled Observational
experiment studies
Types of
data
Qualitative Quantitative
Measurement Scales
Xh
N f i
1
N
1 Xh i 1
Harmonic mean Harmonikus átlag k k
fi gi
i 1 X i
i 1 X i
i 1 X i
k
N
fi k k
X g N Xi X Xi
Mértani átlag
Geometric mean Xg fi gi
i 1
(geometriai) i
i 1 i 1 i 1
k
N
f Xi
Arithmetic mean
Számtani átlag Xi X i 1
i k
gi X i
(aritmetikai) X i 1 k
N f
i 1
i
i 1
k
N
X 2 f X i i
2
k
Root mean square/quadraticNégyzetes
mean átlag
(kvadratikus) Xq i 1
i
Xq i 1
k
g X i i
2
N f i 1
i
i 1
f i N