Statistics I

Óbuda University
Pro Scientia et Futuro
Statistics I
Viktor Nagy, Ph.D.
Óbuda University
Statistics: the science that deals with the collection, tabulation, and systematic
classification of quantitative data, especially as a basis for inference and
induction.
+ Interpretation, presentation
In the evryday life?

• National census
• Entire sports industry
Óbuda University
Descriptive: to summarize and display data so we can quickly obtain an

overview.
Inferential: to make claims and conclusion about a population based on a
sample.
Population: a set of items which is of interest for some question.

Subpopulation: a subset that shares one or more additional properties.
Sample: a subset of the population.
Data: value assigned to an observation or a measurement.

Parameter: data that describes a characteristic about a population.
Statistic: …about a sample
Variable: any characteristic or numerical value that varies from individual to
individual.
Information: data transformed into useful facts
Óbuda University
Sources: primary and secondary
Quantitative data: numerical values (Numerical)

Qualitative data: descriptive terms to classify (Categorical)
Discrete: items that can be counted; the take on possible values that can be
listed out. The list may be fixed (finite) or it may go from 0, 1, 2, on to infinity
(making it countably infinite).
Continuous: can only be described using intervals, the possible values cannot be
counted.
The method of comparison: to know the effect of a treatment: compare the

responses of a treatment group with a control group. To make sure that the
treatment group is like the control group, investigators put subjects into them at
random. The control group is given a placebo: neutral but resembles the treatment.
The response should be to the treatment itself rather than to the idea of treatment.
Double-blind experiment: the subjects do not know whether they are in treatment
or in control; neither do those who evaluate the responses.
Óbuda University
Studies
Controls No controls
Contemporaneous Historical
Controlled Observational
experiment studies
Randomized Not randomized
In an observational study, the investigators do not assign the subjects to

treatment or control. The observational study can estabish association: one thing
is linked to another. Association may point to causation, but the effects of
treatment may be confounded. The confounder can be the third variable.
Óbuda University
Types of
data
Qualitative Quantitative
Measurement Scales
Nominal Ordinal Interval Ratio

Óbuda University
Átlagok Súlyozás nélkül Súlyozott

k
Xh 
N f i
1
N
1 Xh  i 1


Harmonic mean Harmonikus átlag k k
fi gi
i 1 X i 
i 1 X i

i 1 X i
k
N
 fi k k
X g  N  Xi X   Xi
Mértani átlag
Geometric mean Xg  fi gi
i 1
(geometriai) i
i 1 i 1 i 1
k
N
f  Xi
Arithmetic mean
Számtani átlag  Xi X i 1
i k
  gi  X i
(aritmetikai) X  i 1 k
N f
i 1
i
i 1
k
N
X 2  f X i i
2
k
Root mean square/quadraticNégyzetes
mean átlag
(kvadratikus) Xq  i 1
i
Xq  i 1
k
 g X i i
2
N f i 1
i
i 1
f i N
Óbuda University
Óbuda University
Lorenz curve is a graphical representation of

the distribution of concentration.
Gini coefficient is a ratio (fraction) and

a measure of inequality of a
distribution: the numerator is the area
between the Lorenz curve and the line
of equality (= the uniform distribution
line), the denominator is the area
under the line (= the triangle).
Gini index: the coefficient expressed in
percentage.
Same ratio:
Figure: Wikipedia
Óbuda University
Statistics I
Viktor Nagy, Ph.D.
Óbuda University
Frequency distribution: to graphically describe the data (simply a table). A table that
shows the number of data observations that fall into specific intervals.
Points accumulated
33 22 15 20 17 13 34 22 5 17
17 14 26 27 26 34 12 6 21 23
26 6 34 20 20 4 13 34 4 18
28 17 14 38 11 12 23 11 32 26
Points Number of Students 1. Classes of equal size

0 - 25 27 2. Mutually excusive classes (no
26 - 31 6 overlaping)
32 - 37 6 3. No fewer than 5 and no more than
38 - 43 1 15 classes (the true characteristic
44 - 50 0 will be hidden)
4. Avoid open-ended classes
5. Include all data values (exhaustive)
Óbuda University
Relative frequency distribution: it displays the percentage of observations of each class

relative to the total number of observations.
Points Number of Students Percentage

0 - 25 27 27/40=0.675
26 - 31 6 6/40=0.150
32 - 37 6 6/40=0.150
38 - 43 1 1/40=0.025
44 - 50 0 0/40=0.000
Total=40 Total=1.000
Óbuda University
Cumulative Relative frequency distribution: indictes the percentage of observations

that are less than or equal to the current class.
Points Number of Students Percentage Cumulative Percentage

0 - 25 27 27/40=0.675 0.675
26 - 31 6 6/40=0.150 0.825
32 - 37 6 6/40=0.150 0.975
38 - 43 1 1/40=0.025 1.000
44 - 50 0 0/40=0.000 1.000
Óbuda University
Measures of Central Tendency: describe the center point of a data set with a single value.
Mean or average: add all the values and devide by the number of observations.
Weighted mean: Type Score Weight (Percent)

Exam 94 50
Project 89 35
Homework 83 15
Mean of groupped data from a frequency distribution: calculated by using the midpoints
Points Number of Students

0 - 25 27
26 - 31 6
32 - 37 6
38 - 43 1
44 - 50 0
The result is only an approximation to the mean.
Óbuda University
Median: represents the value in the data set for which half the observations are higher
and the other half are lower. When there is an even number of data points, the median
will be the average of the two center points.
Mode: the observation that occurs the most frequent. (More possible)
Óbuda University
Percentile: the percentage of individuals who are below where particular number is
located.
Finding the kth percentile:

1. Order all numbers in the data set from smallest to largest.
2. Multiply k percent times the total number of numbers (N)
3. If your result is a whole number, go to Step 5. If not, round it up to the nearest
whole number and go to Step 4.
4. Count the numbers from left to right until you reach the value from Step 3. This
corresponding number is the kth percentile.
5. Count the numbers from left to right until you reach that whole number. The kth
percentile is the average of that corresponding number and the next number in
your data set.
Points accumulated
4 4 5 6 6 11 11 12 12 13
13 14 14 15 17 17 17 17 18 20
20 20 21 22 22 23 23 26 26 26
26 27 28 32 33 34 34 34 34 38
Óbuda University
Measures of Dispersion: describes how far the individual data values have strayed from
the mean.
Range: the difference between the highest value and the lowest value in the data set.
Interquartile range (IQR): middle fifty: difference between the upper and lower
quartiles; measures the spread of the center half.
~ is used for identify outliers: extreme values, whose accuracy is questioned and can
cause unwanted distortions in statistical results. Values to be discarded:
<Q1-1.5IQR or >Q3+1.5IQR
Interquantile range:….
Óbuda University
Average absolute deviation (mean

absolute deviation) from the mean:
the average of the absolute
deviations from the mean.
Variance: describes the relative distance between the data points and the mean in the
data set.
~ of the population
Óbuda University
Standard deviation (SD): is the square root of the variance.
Relative Standard deviation (RSD) or the coefficient of variaton (CV): standardized

measure of the dispersion of a (probability or frequency) distribution.
Óbuda University
Mean absolute difference (MD) or absolute mean difference: the average absolute
difference of two indepedent values from the population
Relative mean absoulute difference:
equal to twice the Gini coefficient (Lorenz curve)

The five-number summary & box and whisker diagram(s)
Histogram &(relative) frequency polygon
Probability density function, (standard) normal distribution, distribution function, z-formula
Skewness & kurtosis

The Five-Number Summary:
1. The minimum (smallest) number in the data set.
2. The 25th percentile.
3. The median.
5. The maximum (largest) number in the data set.
Box plot
Box and whisker diagram
A histogram is a graphical representation of the distribution of numerical data. To construct a histogram, the
first step is to „bin” the range of values—that is, divide the entire range of values into a series of intervals—and
then count how many values fall into each interval. The bins are usually specified as consecutive, non-
overlapping intervals of a variable. The bins (intervals) must be adjacent, and are often (but are not required to
be) of equal size.
If the bins are of equal size, a rectangle is erected over the bin with height proportional to the frequency — the
number of cases in each bin. However, bins need not be of equal width; in that case, the erected rectangle is
defined to have its area proportional to the frequency of cases in the bin. The vertical axis is then not the
frequency but frequency density — the number of cases per unit of the variable on the horizontal axis.
Symmetric skewed right (positive skew) bimodal

Frequency Polygons
Frequency polygons are a graphical device for understanding the shapes of

distributions. They serve the same purpose as histograms, but are
especially helpful for comparing sets of data. Frequency polygons are also
a good choice for displaying cumulative frequency distributions.
The values of many large data sets tend to cluster around the mean (or median or mode) so that the data
distribution in the histogram resembles a bell-shape, symmetrical curve. When this is the case, the empirical rule:
approx. 68-95-99.7% of the values will be within 1, 2 or 3 standard deviations from the mean.
Chebyshev’s Theorem: a mathematical

rule, similar to the empirical rule except
it applies to any distribution rather than
just bell-shape, symmetrical
distributions. It states, that for any
number k greater than 1, at least
1
1 − 𝑘 2 x 100
percent values will fall within k standard
deviations from the mean
In general: 𝜇 ± kσ
Smaller SD: Tighter
and taller around
the mean
Standard normal distribution

x2

 x  
1
 x   2 e 2 X ~ N 0,1
 2
Normal distributions f x   X ~ N ,  
1
e 2 2
 2
Probability density function (PDF) or density of a continuous (random) variable describes the relative likelihood for this
variable to take on a given value.
The probability of the random variable falling within a particular range of values is given by the integral of this variable’s
density over that range. Its integral over the entire space is equal to one.
 X    ~ N 0,1
Changing an x value to z-value is called standardizing. The z-formula:

Cumulative distribution function
x

t   2 x t2
F x  
1 
x  
1
 e 2
e
2
dt 2
dt
 2 2 
Skewness: the measure of asymmetry
𝑄3 − 𝑀𝑒 − 𝑀𝑒 − 𝑄1
𝐹0.25 =
𝑄3 − 𝑀𝑒 + 𝑀𝑒 − 𝑄1
𝐷9 − 𝑀𝑒 − 𝑀𝑒 − 𝐷1
𝐹0.1 =
𝐷9 − 𝑀𝑒 + 𝑀𝑒 − 𝐷1
F>0 F=0 F<0
𝑋ത − 𝑀𝑜 A>0 A=0 A<0
𝐴=
𝜎 P>0 P=0 P<0
3 ∙ 𝑋ത − 𝑀𝑒
𝑃=
𝜎
Kurtosis: the sharpness of the peak of a distribution

Sharp peak: K < 0.263
𝑄3 − 𝑄1
𝐾=
2 ∙ 𝐷9 − 𝐷1
Flat: K > 0.263
Óbuda University
Statistics I.
Viktor Nagy, Ph.D.
Óbuda University
Cross-tabulation (contingency tables, or crosstab): relationship between two

categorical data.
1. Association (between nominal data)
2. Mixed (nominal and ratio data)
3. Correlation (quantitative data)
Classes of variable B
 j
C1B C 2B  C Bj  C (Bk 1) C kB
C1A f11 f12  f1j  f1(k-1) f1k f1●
Classes of variable A
C2A f21 f22  f2j  f2(k-1) f2k f2●

        
CiA fi1 fi2  fij  fi(k-1) fik fi●
        
C (Ar 1) f(r-1)1 f(r-1)2 f(r-1)j f(r-1)(k-1) f(r-1)k f(r-1)●
 
C rA fr1 fr2  frj  fr(k-1) frk fr●

i
f●1 f●2  f●j  f●(k-1) f●k N
Óbuda University
Association (binary variable)
Yule’s coefficient of association

Dog Bear Σ
boy 3 10 13
girl 15 2 17
Σ 18 12 30
Óbuda University
Association Factory
Quality
A B C
Accept 31 7 4
Waste 3 35 45
Cramer’s V
Independent
Óbuda University
Factory
Association Q Total
A B C
31 11.42 7 14.11 4 16.46
A 42
33.57 3.58 9.43
3 22.58 35 27.89 45 32.54
W 83
16.98 1.81 4.77
Total 34 42 49 125
Óbuda University
Mixed
days Action Romantic Comedy

1. 57 14 38
2. 61 5 50
3. - 20 -
59 13 44
Sum of Squares Within
Sum of Squares Between groups
Sum of Squares Total
the type of the movie explains 93.44 percent of the variation of number of tickets sold
Óbuda University
Spearman’s rank correlation coefficient (rho)
Candidates and Referees
Candidates
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
R1 8 10 5 4 7 1 2 6 3 9
Referees
R2 8 9 7 4 5 2 1 6 3 10
d i2 0 1 4 0 4 1 1 0 0 1 d i
2
 12
6   d i2 6 12
  1  1  0,9273 Strong, positive correlation
N  ( N 2  1) 10  (10  1)
2
The coefficient varies: 1    1

Óbuda University
Pearson’s correlation coefficient
Price of Age of dX  dY 
Nr the car the car d X  dY dX
2
dY
2
(€) (years)
X X Y Y
1 1300 3 150 -1.8 -270 22500 3.24
2 1000 5 -150 0.2 -30 22500 0.04
3 800 8 -350 3.2 -1120 122500 10.24
4 850 7 -300 2.2 -660 90000 4.84
5 1800 1 650 -3.8 -2470 422500 14.44
Sum 5750 24 0 0 -4550 680000 32.8
Average X  1150 Y  4.8
RXY 
C XY

d d x y

 4550
 0.9634
 XY d d
2
x
2
y
680000  32.8
Strong, negative correlation

Óbuda University
INDEX NUMBERS
An index is a means of comparing changes in some variable, often price over time. This
is particularly useful when there are many items involved and when the prices and
quantities are in different units. The best known index is the consumer price index (CPI):
it is the official measure of inflation when it comes to up-rating public sector pensions
and benefits. It compares the price of a „basket” of goods from one month to another.
Consumer Price Index
The Consumer Price Index measures the average change of price changes of
goods and services purchased by households for their own consumption. It
measures the inflation of national currency.
Consumer price index for pensioners
the consumer price index for pensioners shows how the differences in the
structure of the pensioners' consumption influence the indices of this strata of
population. The three groups of commodities (foods, medicines, expenditures
related to housing) which have major impact in terms of the pensioners'
consumption amount to 60 percent from the consumer basket of this strata. The
index is calculated by eliminating products and services related to child care.
Óbuda University
SIMPLE INDICES
Changes in quantity, price, value
q1i p1i v1i q1i  p1i

iqi  i pi  iv i    iqi  i pi
q0i p0 i v0 i q0 i  p0 i
Óbuda University
WEIGHTED AGGREGATE INDICES
Value Iv 
q p
1 1

v  v i
1 0 v

v 1
q p
0 0 v v
0 0
v
i 1
Quantity Laspeyres (base-weighted): I q0 

q p 1 0

q p i 0 0 q

q p1 0
q p 0 0 q p 0 0
qp
 i 1 0
Paasche (current-weighted): I q1 
q p 1 1

q p i 0 1 q

q p
1 1
q p 0 1 q p 0 1
qp
 i
1 1
Price I p0 
 q0 p1 
 q0 p0  i p 
q p 0 1
Laspeyres (base-weighted):
q 0 p0 q 0 p0 q p
 i 0 1
I 1p 
q p 1 1

q p i 1 0 p

q p 1 1
Paasche (current-weighted):
q p 1 0 q p 1 0
qp
 i 1 1
Fisher indicies: I qF  I q0  I q1 I pF  I p0  I 1p
I v  I q0  I 1p  I q1  I p0  I qF  I pF
Óbuda University
INDEX NUMBERS
An index is a means of comparing changes in some variable, often price over time. This
is particularly useful when there are many items involved and when the prices and
quantities are in different units. The best known index is the consumer price index (CPI):
it is the official measure of inflation when it comes to up-rating public sector pensions
and benefits. It compares the price of a „basket” of goods from one month to another.
Consumer Price Index
The Consumer Price Index measures the average change of price changes of
goods and services purchased by households for their own consumption. It
measures the inflation of national currency.
Consumer price index for pensioners
the consumer price index for pensioners shows how the differences in the
structure of the pensioners' consumption influence the indices of this strata of
population. The three groups of commodities (foods, medicines, expenditures
related to housing) which have major impact in terms of the pensioners'
consumption amount to 60 percent from the consumer basket of this strata. The
index is calculated by eliminating products and services related to child care.
Óbuda University
SIMPLE INDICES
Changes in quantity, price, value
q1i p1i v1i q1i  p1i

iqi  i pi  iv i    iqi  i pi
q0i p0 i v0 i q0 i  p0 i
Óbuda University
WEIGHTED AGGREGATE INDICES
Value Iv 
q p
1 1

v  v i
1 0 v

v 1
q p
0 0 v v
0 0
v
i 1
Quantity Laspeyres (base-weighted): I q0 

q p 1 0

q p i 0 0 q

q p1 0
q p 0 0 q p 0 0
qp
 i 1 0
Paasche (current-weighted): I q1 
q p 1 1

q p i 0 1 q

q p
1 1
q p 0 1 q p 0 1
qp
 i
1 1
Price I p0 
 q0 p1 
 q0 p0  i p 
q p 0 1
Laspeyres (base-weighted):
q 0 p0 q 0 p0 q p
 i 0 1
I 1p 
q p 1 1

q p i 1 0 p

q p 1 1
Paasche (current-weighted):
q p 1 0 q p 1 0
qp
 i 1 1
Fisher indicies: I qF  I q0  I q1 I pF  I p0  I 1p
I v  I q0  I 1p  I q1  I p0  I qF  I pF
Óbuda University
Statistics I.
Viktor Nagy, Ph.D.
Óbuda University
Cross-tabulation (contingency tables, or crosstab): relationship between two

categorical data.
1. Association (between nominal data)
2. Mixed (nominal and ratio data)
3. Correlation (quantitative data)
Classes of variable B
 j
C1B C 2B  C Bj  C (Bk 1) C kB
C1A f11 f12  f1j  f1(k-1) f1k f1●
Classes of variable A
C2A f21 f22  f2j  f2(k-1) f2k f2●

        
CiA fi1 fi2  fij  fi(k-1) fik fi●
        
C (Ar 1) f(r-1)1 f(r-1)2 f(r-1)j f(r-1)(k-1) f(r-1)k f(r-1)●
 
C rA fr1 fr2  frj  fr(k-1) frk fr●

i
f●1 f●2  f●j  f●(k-1) f●k N
Óbuda University
Association (binary variable)
Yule’s coefficient of association

Dog Bear Σ
boy 3 10 13
girl 15 2 17
Σ 18 12 30
Óbuda University
Association Factory
Quality
A B C
Accept 31 7 4
Waste 3 35 45
Cramer’s V
Independent
Óbuda University
Factory
Association Q Total
A B C
31 11.42 7 14.11 4 16.46
A 42
33.57 3.58 9.43
3 22.58 35 27.89 45 32.54
W 83
16.98 1.81 4.77
Total 34 42 49 125
Óbuda University
Mixed
days Action Romantic Comedy

1. 57 14 38
2. 61 5 50
3. - 20 -
59 13 44
Sum of Squares Within
Sum of Squares Between groups
Sum of Squares Total
the type of the movie explains 93.44 percent of the variation of number of tickets sold
Óbuda University
Spearman’s rank correlation coefficient (rho)
Candidates and Referees
Candidates
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
R1 8 10 5 4 7 1 2 6 3 9
Referees
R2 8 9 7 4 5 2 1 6 3 10
d i2 0 1 4 0 4 1 1 0 0 1 d i
2
 12
6   d i2 6 12
  1  1  0,9273 Strong, positive correlation
N  ( N 2  1) 10  (10  1)
2
The coefficient varies: 1    1

Óbuda University
Pearson’s correlation coefficient
Price of Age of dX  dY 
Nr the car the car d X  dY dX
2
dY
2
(€) (years)
X X Y Y
1 1300 3 150 -1.8 -270 22500 3.24
2 1000 5 -150 0.2 -30 22500 0.04
3 800 8 -350 3.2 -1120 122500 10.24
4 850 7 -300 2.2 -660 90000 4.84
5 1800 1 650 -3.8 -2470 422500 14.44
Sum 5750 24 0 0 -4550 680000 32.8
Average X  1150 Y  4.8
RXY 
C XY

d d x y

 4550
 0.9634
 XY d d
2
x
2
y
680000  32.8
Strong, negative correlation

The five-number summary & box and whisker diagram(s)
Histogram &(relative) frequency polygon
Probability density function, (standard) normal distribution, distribution function, z-formula
Skewness & kurtosis

The Five-Number Summary:
1. The minimum (smallest) number in the data set.
3. The median.
5. The maximum (largest) number in the data set.
Box plot
Box and whisker diagram
A histogram is a graphical representation of the distribution of numerical data. To construct a histogram, the
first step is to „bin” the range of values—that is, divide the entire range of values into a series of intervals—and
then count how many values fall into each interval. The bins are usually specified as consecutive, non-
overlapping intervals of a variable. The bins (intervals) must be adjacent, and are often (but are not required to
be) of equal size.
If the bins are of equal size, a rectangle is erected over the bin with height proportional to the frequency — the
number of cases in each bin. However, bins need not be of equal width; in that case, the erected rectangle is
defined to have its area proportional to the frequency of cases in the bin. The vertical axis is then not the
frequency but frequency density — the number of cases per unit of the variable on the horizontal axis.
Symmetric skewed right (positive skew) bimodal

Frequency Polygons
Frequency polygons are a graphical device for understanding the shapes of

distributions. They serve the same purpose as histograms, but are
especially helpful for comparing sets of data. Frequency polygons are also
a good choice for displaying cumulative frequency distributions.
The values of many large data sets tend to cluster around the mean (or median or mode) so that the data
distribution in the histogram resembles a bell-shape, symmetrical curve. When this is the case, the empirical rule:
approx. 68-95-99.7% of the values will be within 1, 2 or 3 standard deviations from the mean.
Chebyshev’s Theorem: a mathematical

rule, similar to the empirical rule except
it applies to any distribution rather than
just bell-shape, symmetrical
distributions. It states, that for any
number k greater than 1, at least
1
1 − 𝑘 2 x 100
percent values will fall within k standard
deviations from the mean
In general: 𝜇 ± kσ
Smaller SD: Tighter
and taller around
the mean
Standard normal distribution

x2

 x  
1
 x   2 e 2 X ~ N 0,1
 2
Normal distributions f x   X ~ N ,  
1
e 2 2
 2
Probability density function (PDF) or density of a continuous (random) variable describes the relative likelihood for this
variable to take on a given value.
The probability of the random variable falling within a particular range of values is given by the integral of this variable’s
density over that range. Its integral over the entire space is equal to one.
 X    ~ N 0,1
Changing an x value to z-value is called standardizing. The z-formula:

Cumulative distribution function
x

t   2 x t2
F x  
1 
x  
1
 e 2
e
2
dt 2
dt
 2 2 
Skewness: the measure of asymmetry
𝑄3 − 𝑀𝑒 − 𝑀𝑒 − 𝑄1
𝐹0.25 =
𝑄3 − 𝑀𝑒 + 𝑀𝑒 − 𝑄1
𝐷9 − 𝑀𝑒 − 𝑀𝑒 − 𝐷1
𝐹0.1 =
𝐷9 − 𝑀𝑒 + 𝑀𝑒 − 𝐷1
F>0 F=0 F<0
𝑋ത − 𝑀𝑜 A>0 A=0 A<0
𝐴=
𝜎 P>0 P=0 P<0
3 ∙ 𝑋ത − 𝑀𝑒
𝑃=
𝜎
Kurtosis: the sharpness of the peak of a distribution

Sharp peak: K < 0.263
𝑄3 − 𝑄1
𝐾=
2 ∙ 𝐷9 − 𝐷1
Flat: K > 0.263
Óbuda University
Measures of Dispersion: describes how far the individual data values have strayed from
the mean.
Range: the difference between the highest value and the lowest value in the data set.
Interquartile range (IQR): middle fifty: difference between the upper and lower
quartiles; measures the spread of the center half.
~ is used for identify outliers: extreme values, whose accuracy is questioned and can
cause unwanted distortions in statistical results. Values to be discarded:
<Q1-1.5IQR or >Q3+1.5IQR
Interquantile range:….
Óbuda University
Average absolute deviation (mean

absolute deviation) from the mean:
the average of the absolute
deviations from the mean.
Variance: describes the relative distance between the data points and the mean in the
data set.
~ of the population
Óbuda University
Standard deviation (SD): is the square root of the variance.
Relative Standard deviation (RSD) or the coefficient of variaton (CV): standardized

measure of the dispersion of a (probability or frequency) distribution.
Óbuda University
Mean absolute difference (MD) or absolute mean difference: the average absolute
difference of two indepedent values from the population
Relative mean absoulute difference:
equal to twice the Gini coefficient (Lorenz curve)

Óbuda University
Statistics I
Viktor Nagy, Ph.D.
Óbuda University
Frequency distribution: to graphically describe the data (simply a table). A table that
shows the number of data observations that fall into specific intervals.
Points accumulated
33 22 15 20 17 13 34 22 5 17
17 14 26 27 26 34 12 6 21 23
26 6 34 20 20 4 13 34 4 18
28 17 14 38 11 12 23 11 32 26
Points Number of Students 1. Classes of equal size

0 - 25 27 2. Mutually excusive classes (no
26 - 31 6 overlaping)
32 - 37 6 3. No fewer than 5 and no more than
38 - 43 1 15 classes (the true characteristic
44 - 50 0 will be hidden)
4. Avoid open-ended classes
5. Include all data values (exhaustive)
Óbuda University
Relative frequency distribution: it displays the percentage of observations of each class

relative to the total number of observations.
Points Number of Students Percentage

0 - 25 27 27/40=0.675
26 - 31 6 6/40=0.150
32 - 37 6 6/40=0.150
38 - 43 1 1/40=0.025
44 - 50 0 0/40=0.000
Óbuda University
Cumulative Relative frequency distribution: indictes the percentage of observations

that are less than or equal to the current class.
Points Number of Students Percentage Cumulative Percentage

0 - 25 27 27/40=0.675 0.675
26 - 31 6 6/40=0.150 0.825
32 - 37 6 6/40=0.150 0.975
38 - 43 1 1/40=0.025 1.000
44 - 50 0 0/40=0.000 1.000
Óbuda University
Measures of Central Tendency: describe the center point of a data set with a single value.
Mean or average: add all the values and devide by the number of observations.
Weighted mean: Type Score Weight (Percent)

Exam 94 50
Project 89 35
Homework 83 15
Mean of groupped data from a frequency distribution: calculated by using the midpoints
Points Number of Students

0 - 25 27
26 - 31 6
32 - 37 6
38 - 43 1
44 - 50 0
The result is only an approximation to the mean.
Óbuda University
Median: represents the value in the data set for which half the observations are higher
and the other half are lower. When there is an even number of data points, the median
will be the average of the two center points.
Mode: the observation that occurs the most frequent. (More possible)
Óbuda University
Percentile: the percentage of individuals who are below where particular number is
located.
Finding the kth percentile:

1. Order all numbers in the data set from smallest to largest.
2. Multiply k percent times the total number of numbers (N)
3. If your result is a whole number, go to Step 5. If not, round it up to the nearest
whole number and go to Step 4.
4. Count the numbers from left to right until you reach the value from Step 3. This
corresponding number is the kth percentile.
5. Count the numbers from left to right until you reach that whole number. The kth
percentile is the average of that corresponding number and the next number in
your data set.
Points accumulated
4 4 5 6 6 11 11 12 12 13
13 14 14 15 17 17 17 17 18 20
20 20 21 22 22 23 23 26 26 26
26 27 28 32 33 34 34 34 34 38
Óbuda University
Óbuda University
Lorenz curve is a graphical representation of

the distribution of concentration.
Gini coefficient is a ratio (fraction) and

a measure of inequality of a
distribution: the numerator is the area
between the Lorenz curve and the line
of equality (= the uniform distribution
line), the denominator is the area
under the line (= the triangle).
Gini index: the coefficient expressed in
percentage.
Same ratio:
Figure: Wikipedia
Óbuda University
Statistics I
Viktor Nagy, Ph.D.
Óbuda University
Statistics: the science that deals with the collection, tabulation, and systematic
classification of quantitative data, especially as a basis for inference and
induction.
+ Interpretation, presentation
In the evryday life?

• National census
• Entire sports industry
Óbuda University
Descriptive: to summarize and display data so we can quickly obtain an

overview.
Inferential: to make claims and conclusion about a population based on a
sample.
Population: a set of items which is of interest for some question.

Subpopulation: a subset that shares one or more additional properties.
Sample: a subset of the population.
Data: value assigned to an observation or a measurement.

Parameter: data that describes a characteristic about a population.
Statistic: …about a sample
Variable: any characteristic or numerical value that varies from individual to
individual.
Information: data transformed into useful facts
Óbuda University
Sources: primary and secondary
Quantitative data: numerical values (Numerical)

Qualitative data: descriptive terms to classify (Categorical)
Discrete: items that can be counted; the take on possible values that can be
listed out. The list may be fixed (finite) or it may go from 0, 1, 2, on to infinity
(making it countably infinite).
Continuous: can only be described using intervals, the possible values cannot be
counted.
The method of comparison: to know the effect of a treatment: compare the

responses of a treatment group with a control group. To make sure that the
treatment group is like the control group, investigators put subjects into them at
random. The control group is given a placebo: neutral but resembles the treatment.
The response should be to the treatment itself rather than to the idea of treatment.
Double-blind experiment: the subjects do not know whether they are in treatment
or in control; neither do those who evaluate the responses.
Óbuda University
Studies
Controls No controls
Contemporaneous Historical
Controlled Observational
experiment studies
Randomized Not randomized
In an observational study, the investigators do not assign the subjects to

treatment or control. The observational study can estabish association: one thing
is linked to another. Association may point to causation, but the effects of
treatment may be confounded. The confounder can be the third variable.
Óbuda University
Types of
data
Qualitative Quantitative
Measurement Scales
Nominal Ordinal Interval Ratio

Óbuda University
Átlagok Súlyozás nélkül Súlyozott

k
Xh 
N f i
1
N
1 Xh  i 1


Harmonic mean Harmonikus átlag k k
fi gi
i 1 X i 
i 1 X i

i 1 X i
k
N
 fi k k
X g  N  Xi X   Xi
Mértani átlag
Geometric mean Xg  fi gi
i 1
(geometriai) i
i 1 i 1 i 1
k
N
f  Xi
Arithmetic mean
Számtani átlag  Xi X i 1
i k
  gi  X i
(aritmetikai) X  i 1 k
N f
i 1
i
i 1
k
N
X 2  f X i i
2
k
Root mean square/quadraticNégyzetes
mean átlag
(kvadratikus) Xq  i 1
i
Xq  i 1
k
 g X i i
2
N f i 1
i
i 1
f i N

Statistics I

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistics I

Uploaded by

Copyright:

Available Formats

Óbuda University

Pro Scientia et Futuro

In the evryday life?

Descriptive: to summarize and display data so we can quickly obtain an

Population: a set of items which is of interest for some question.

Data: value assigned to an observation or a measurement.

Sources: primary and secondary

Quantitative data: numerical values (Numerical)

The method of comparison: to know the effect of a treatment: compare the

Randomized Not randomized

In an observational study, the investigators do not assign the subjects to

Nominal Ordinal Interval Ratio

Átlagok Súlyozás nélkül Súlyozott

Lorenz curve is a graphical representation of

Gini coefficient is a ratio (fraction) and

Points Number of Students 1. Classes of equal size

Relative frequency distribution: it displays the percentage of observations of each class

Points Number of Students Percentage

Cumulative Relative frequency distribution: indictes the percentage of observations

Points Number of Students Percentage Cumulative Percentage

Weighted mean: Type Score Weight (Percent)

Points Number of Students

Finding the kth percentile:

Average absolute deviation (mean

Standard deviation (SD): is the square root of the variance.

Relative Standard deviation (RSD) or the coefficient of variaton (CV): standardized

Relative mean absoulute difference:

equal to twice the Gini coefficient (Lorenz curve)

Histogram &(relative) frequency polygon

Probability density function, (standard) normal distribution, distribution function, z-formula

Skewness & kurtosis

Symmetric skewed right (positive skew) bimodal

Frequency polygons are a graphical device for understanding the shapes of

Chebyshev’s Theorem: a mathematical

Standard normal distribution

Kurtosis: the sharpness of the peak of a distribution

Cross-tabulation (contingency tables, or crosstab): relationship between two

C2A f21 f22  f2j  f2(k-1) f2k f2●

Association (binary variable)

Yule’s coefficient of association

days Action Romantic Comedy

Sum of Squares Within

Sum of Squares Between groups

Sum of Squares Total

Spearman’s rank correlation coefficient (rho)

Candidates and Referees

The coefficient varies: 1    1

Pearson’s correlation coefficient

Strong, negative correlation

q1i p1i v1i q1i  p1i

WEIGHTED AGGREGATE INDICES

Quantity Laspeyres (base-weighted): I q0 

q1i p1i v1i q1i  p1i

WEIGHTED AGGREGATE INDICES

Quantity Laspeyres (base-weighted): I q0 

Cross-tabulation (contingency tables, or crosstab): relationship between two

C2A f21 f22  f2j  f2(k-1) f2k f2●

Association (binary variable)

Yule’s coefficient of association

days Action Romantic Comedy

Sum of Squares Within

Sum of Squares Between groups

Sum of Squares Total