You are on page 1of 82

Óbuda University

Pro Scientia et Futuro

Statistics I
Viktor Nagy, Ph.D.
Óbuda University
Pro Scientia et Futuro

Statistics: the science that deals with the collection, tabulation, and systematic
classification of quantitative data, especially as a basis for inference and
induction.
+ Interpretation, presentation

In the evryday life?


• National census
• Entire sports industry
Óbuda University
Pro Scientia et Futuro

Descriptive: to summarize and display data so we can quickly obtain an


overview.
Inferential: to make claims and conclusion about a population based on a
sample.

Population: a set of items which is of interest for some question.


Subpopulation: a subset that shares one or more additional properties.
Sample: a subset of the population.

Data: value assigned to an observation or a measurement.


Parameter: data that describes a characteristic about a population.
Statistic: …about a sample
Variable: any characteristic or numerical value that varies from individual to
individual.
Information: data transformed into useful facts
Óbuda University
Pro Scientia et Futuro

Sources: primary and secondary

Quantitative data: numerical values (Numerical)


Qualitative data: descriptive terms to classify (Categorical)

Discrete: items that can be counted; the take on possible values that can be
listed out. The list may be fixed (finite) or it may go from 0, 1, 2, on to infinity
(making it countably infinite).
Continuous: can only be described using intervals, the possible values cannot be
counted.

The method of comparison: to know the effect of a treatment: compare the


responses of a treatment group with a control group. To make sure that the
treatment group is like the control group, investigators put subjects into them at
random. The control group is given a placebo: neutral but resembles the treatment.
The response should be to the treatment itself rather than to the idea of treatment.
Double-blind experiment: the subjects do not know whether they are in treatment
or in control; neither do those who evaluate the responses.
Óbuda University
Pro Scientia et Futuro

Studies

Controls No controls

Contemporaneous Historical

Controlled Observational
experiment studies

Randomized Not randomized

In an observational study, the investigators do not assign the subjects to


treatment or control. The observational study can estabish association: one thing
is linked to another. Association may point to causation, but the effects of
treatment may be confounded. The confounder can be the third variable.
Óbuda University
Pro Scientia et Futuro

Types of
data

Qualitative Quantitative

Measurement Scales

Nominal Ordinal Interval Ratio


Óbuda University
Pro Scientia et Futuro

Átlagok Súlyozás nélkül Súlyozott


k

Xh 
N f i
1
N
1 Xh  i 1


Harmonic mean Harmonikus átlag k k
fi gi
i 1 X i 
i 1 X i

i 1 X i
k
N
 fi k k
X g  N  Xi X   Xi
Mértani átlag
Geometric mean Xg  fi gi
i 1
(geometriai) i
i 1 i 1 i 1
k
N
f  Xi
Arithmetic mean
Számtani átlag  Xi X i 1
i k
  gi  X i
(aritmetikai) X  i 1 k
N f
i 1
i
i 1

k
N

X 2  f X i i
2
k
Root mean square/quadraticNégyzetes
mean átlag
(kvadratikus) Xq  i 1
i
Xq  i 1
k
 g X i i
2

N f i 1
i
i 1

f i N
Óbuda University
Pro Scientia et Futuro
Óbuda University
Pro Scientia et Futuro

Lorenz curve is a graphical representation of


the distribution of concentration.

Gini coefficient is a ratio (fraction) and


a measure of inequality of a
distribution: the numerator is the area
between the Lorenz curve and the line
of equality (= the uniform distribution
line), the denominator is the area
under the line (= the triangle).
Gini index: the coefficient expressed in
percentage.

Same ratio:
Figure: Wikipedia
Óbuda University
Pro Scientia et Futuro

Statistics I
Viktor Nagy, Ph.D.
Óbuda University
Pro Scientia et Futuro

Frequency distribution: to graphically describe the data (simply a table). A table that
shows the number of data observations that fall into specific intervals.

Points accumulated
33 22 15 20 17 13 34 22 5 17
17 14 26 27 26 34 12 6 21 23
26 6 34 20 20 4 13 34 4 18
28 17 14 38 11 12 23 11 32 26

Points Number of Students 1. Classes of equal size


0 - 25 27 2. Mutually excusive classes (no
26 - 31 6 overlaping)
32 - 37 6 3. No fewer than 5 and no more than
38 - 43 1 15 classes (the true characteristic
44 - 50 0 will be hidden)
4. Avoid open-ended classes
5. Include all data values (exhaustive)
Óbuda University
Pro Scientia et Futuro

Relative frequency distribution: it displays the percentage of observations of each class


relative to the total number of observations.

Points Number of Students Percentage


0 - 25 27 27/40=0.675
26 - 31 6 6/40=0.150
32 - 37 6 6/40=0.150
38 - 43 1 1/40=0.025
44 - 50 0 0/40=0.000
Total=40 Total=1.000
Óbuda University
Pro Scientia et Futuro

Cumulative Relative frequency distribution: indictes the percentage of observations


that are less than or equal to the current class.

Points Number of Students Percentage Cumulative Percentage


0 - 25 27 27/40=0.675 0.675
26 - 31 6 6/40=0.150 0.825
32 - 37 6 6/40=0.150 0.975
38 - 43 1 1/40=0.025 1.000
44 - 50 0 0/40=0.000 1.000
Total=40 Total=1.000
Óbuda University
Pro Scientia et Futuro

Measures of Central Tendency: describe the center point of a data set with a single value.

Mean or average: add all the values and devide by the number of observations.

Weighted mean: Type Score Weight (Percent)


Exam 94 50
Project 89 35
Homework 83 15

Mean of groupped data from a frequency distribution: calculated by using the midpoints

Points Number of Students


0 - 25 27
26 - 31 6
32 - 37 6
38 - 43 1
44 - 50 0
The result is only an approximation to the mean.
Óbuda University
Pro Scientia et Futuro

Median: represents the value in the data set for which half the observations are higher
and the other half are lower. When there is an even number of data points, the median
will be the average of the two center points.

Mode: the observation that occurs the most frequent. (More possible)
Óbuda University
Pro Scientia et Futuro

Percentile: the percentage of individuals who are below where particular number is
located.

Finding the kth percentile:


1. Order all numbers in the data set from smallest to largest.
2. Multiply k percent times the total number of numbers (N)
3. If your result is a whole number, go to Step 5. If not, round it up to the nearest
whole number and go to Step 4.
4. Count the numbers from left to right until you reach the value from Step 3. This
corresponding number is the kth percentile.
5. Count the numbers from left to right until you reach that whole number. The kth
percentile is the average of that corresponding number and the next number in
your data set.

Points accumulated
4 4 5 6 6 11 11 12 12 13
13 14 14 15 17 17 17 17 18 20
20 20 21 22 22 23 23 26 26 26
26 27 28 32 33 34 34 34 34 38
Óbuda University
Pro Scientia et Futuro

Measures of Dispersion: describes how far the individual data values have strayed from
the mean.

Range: the difference between the highest value and the lowest value in the data set.

Interquartile range (IQR): middle fifty: difference between the upper and lower
quartiles; measures the spread of the center half.

~ is used for identify outliers: extreme values, whose accuracy is questioned and can
cause unwanted distortions in statistical results. Values to be discarded:

<Q1-1.5IQR or >Q3+1.5IQR

Interquantile range:….
Óbuda University
Pro Scientia et Futuro

Average absolute deviation (mean


absolute deviation) from the mean:
the average of the absolute
deviations from the mean.

Variance: describes the relative distance between the data points and the mean in the
data set.

~ of the population
Óbuda University
Pro Scientia et Futuro

Standard deviation (SD): is the square root of the variance.

Relative Standard deviation (RSD) or the coefficient of variaton (CV): standardized


measure of the dispersion of a (probability or frequency) distribution.
Óbuda University
Pro Scientia et Futuro

Mean absolute difference (MD) or absolute mean difference: the average absolute
difference of two indepedent values from the population

Relative mean absoulute difference:

equal to twice the Gini coefficient (Lorenz curve)


The five-number summary & box and whisker diagram(s)

Histogram &(relative) frequency polygon

Probability density function, (standard) normal distribution, distribution function, z-formula

Skewness & kurtosis


The Five-Number Summary:
1. The minimum (smallest) number in the data set.
2. The 25th percentile.
3. The median.
4. The 75th percentile.
5. The maximum (largest) number in the data set.

Box plot
Box and whisker diagram
A histogram is a graphical representation of the distribution of numerical data. To construct a histogram, the
first step is to „bin” the range of values—that is, divide the entire range of values into a series of intervals—and
then count how many values fall into each interval. The bins are usually specified as consecutive, non-
overlapping intervals of a variable. The bins (intervals) must be adjacent, and are often (but are not required to
be) of equal size.
If the bins are of equal size, a rectangle is erected over the bin with height proportional to the frequency — the
number of cases in each bin. However, bins need not be of equal width; in that case, the erected rectangle is
defined to have its area proportional to the frequency of cases in the bin. The vertical axis is then not the
frequency but frequency density — the number of cases per unit of the variable on the horizontal axis.

Symmetric skewed right (positive skew) bimodal


Frequency Polygons

Frequency polygons are a graphical device for understanding the shapes of


distributions. They serve the same purpose as histograms, but are
especially helpful for comparing sets of data. Frequency polygons are also
a good choice for displaying cumulative frequency distributions.
The values of many large data sets tend to cluster around the mean (or median or mode) so that the data
distribution in the histogram resembles a bell-shape, symmetrical curve. When this is the case, the empirical rule:
approx. 68-95-99.7% of the values will be within 1, 2 or 3 standard deviations from the mean.

Chebyshev’s Theorem: a mathematical


rule, similar to the empirical rule except
it applies to any distribution rather than
just bell-shape, symmetrical
distributions. It states, that for any
number k greater than 1, at least

1
1 − 𝑘 2 x 100
percent values will fall within k standard
deviations from the mean

In general: 𝜇 ± kσ
Smaller SD: Tighter
and taller around
the mean

Standard normal distribution


x2

 x  
1
 x   2 e 2 X ~ N 0,1
 2
Normal distributions f x   X ~ N ,  
1
e 2 2
 2

Probability density function (PDF) or density of a continuous (random) variable describes the relative likelihood for this
variable to take on a given value.

The probability of the random variable falling within a particular range of values is given by the integral of this variable’s
density over that range. Its integral over the entire space is equal to one.
 X    ~ N 0,1
Changing an x value to z-value is called standardizing. The z-formula:

Cumulative distribution function

x

t   2 x t2
F x  
1 
x  
1
 e 2
e
2
dt 2
dt
 2 2 
Skewness: the measure of asymmetry

𝑄3 − 𝑀𝑒 − 𝑀𝑒 − 𝑄1
𝐹0.25 =
𝑄3 − 𝑀𝑒 + 𝑀𝑒 − 𝑄1
𝐷9 − 𝑀𝑒 − 𝑀𝑒 − 𝐷1
𝐹0.1 =
𝐷9 − 𝑀𝑒 + 𝑀𝑒 − 𝐷1
F>0 F=0 F<0
𝑋ത − 𝑀𝑜 A>0 A=0 A<0
𝐴=
𝜎 P>0 P=0 P<0
3 ∙ 𝑋ത − 𝑀𝑒
𝑃=
𝜎

Kurtosis: the sharpness of the peak of a distribution


Sharp peak: K < 0.263
𝑄3 − 𝑄1
𝐾=
2 ∙ 𝐷9 − 𝐷1
Flat: K > 0.263
Óbuda University
Pro Scientia et Futuro

Statistics I.
Viktor Nagy, Ph.D.
Óbuda University
Pro Scientia et Futuro

Cross-tabulation (contingency tables, or crosstab): relationship between two


categorical data.
1. Association (between nominal data)
2. Mixed (nominal and ratio data)
3. Correlation (quantitative data)

Classes of variable B
 j
C1B C 2B  C Bj  C (Bk 1) C kB
C1A f11 f12  f1j  f1(k-1) f1k f1●
Classes of variable A

C2A f21 f22  f2j  f2(k-1) f2k f2●


        
CiA fi1 fi2  fij  fi(k-1) fik fi●
        
C (Ar 1) f(r-1)1 f(r-1)2 f(r-1)j f(r-1)(k-1) f(r-1)k f(r-1)●
 
C rA fr1 fr2  frj  fr(k-1) frk fr●

i
f●1 f●2  f●j  f●(k-1) f●k N
Óbuda University
Pro Scientia et Futuro

Association (binary variable)

Yule’s coefficient of association


Dog Bear Σ
boy 3 10 13
girl 15 2 17
Σ 18 12 30
Óbuda University
Pro Scientia et Futuro

Association Factory
Quality
A B C
Accept 31 7 4
Waste 3 35 45

Cramer’s V

Independent
Óbuda University
Pro Scientia et Futuro

Factory
Association Q Total
A B C
31 11.42 7 14.11 4 16.46
A 42
33.57 3.58 9.43
3 22.58 35 27.89 45 32.54
W 83
16.98 1.81 4.77
Total 34 42 49 125
Óbuda University
Pro Scientia et Futuro

Mixed

days Action Romantic Comedy


1. 57 14 38
2. 61 5 50
3. - 20 -
59 13 44

Sum of Squares Within

Sum of Squares Between groups

Sum of Squares Total

the type of the movie explains 93.44 percent of the variation of number of tickets sold
Óbuda University
Pro Scientia et Futuro

Spearman’s rank correlation coefficient (rho)

Candidates and Referees

Candidates

C1 C2 C3 C4 C5 C6 C7 C8 C9 C10

R1 8 10 5 4 7 1 2 6 3 9
Referees
R2 8 9 7 4 5 2 1 6 3 10

d i2 0 1 4 0 4 1 1 0 0 1 d i
2
 12

6   d i2 6 12
  1  1  0,9273 Strong, positive correlation
N  ( N 2  1) 10  (10  1)
2

The coefficient varies: 1    1


Óbuda University
Pro Scientia et Futuro

Pearson’s correlation coefficient

Price of Age of dX  dY 
Nr the car the car d X  dY dX
2
dY
2

(€) (years)
X X Y Y
1 1300 3 150 -1.8 -270 22500 3.24
2 1000 5 -150 0.2 -30 22500 0.04
3 800 8 -350 3.2 -1120 122500 10.24
4 850 7 -300 2.2 -660 90000 4.84
5 1800 1 650 -3.8 -2470 422500 14.44
Sum 5750 24 0 0 -4550 680000 32.8
Average X  1150 Y  4.8

RXY 
C XY

d d x y

 4550
 0.9634
 XY d d
2
x
2
y
680000  32.8

Strong, negative correlation


Óbuda University
Pro Scientia et Futuro

INDEX NUMBERS
An index is a means of comparing changes in some variable, often price over time. This
is particularly useful when there are many items involved and when the prices and
quantities are in different units. The best known index is the consumer price index (CPI):
it is the official measure of inflation when it comes to up-rating public sector pensions
and benefits. It compares the price of a „basket” of goods from one month to another.
Consumer Price Index
The Consumer Price Index measures the average change of price changes of
goods and services purchased by households for their own consumption. It
measures the inflation of national currency.
Consumer price index for pensioners
the consumer price index for pensioners shows how the differences in the
structure of the pensioners' consumption influence the indices of this strata of
population. The three groups of commodities (foods, medicines, expenditures
related to housing) which have major impact in terms of the pensioners'
consumption amount to 60 percent from the consumer basket of this strata. The
index is calculated by eliminating products and services related to child care.
Óbuda University
Pro Scientia et Futuro

SIMPLE INDICES
Changes in quantity, price, value

q1i p1i v1i q1i  p1i


iqi  i pi  iv i    iqi  i pi
q0i p0 i v0 i q0 i  p0 i
Óbuda University
Pro Scientia et Futuro

WEIGHTED AGGREGATE INDICES

Value Iv 
q p
1 1

v  v i
1 0 v

v 1

q p
0 0 v v
0 0
v
i 1

Quantity Laspeyres (base-weighted): I q0 


q p 1 0

q p i 0 0 q

q p1 0

q p 0 0 q p 0 0
qp
 i 1 0

Paasche (current-weighted): I q1 
q p 1 1

q p i 0 1 q

q p
1 1

q p 0 1 q p 0 1
qp
 i
1 1

Price I p0 
 q0 p1 
 q0 p0  i p 
q p 0 1
Laspeyres (base-weighted):
q 0 p0 q 0 p0 q p
 i 0 1

I 1p 
q p 1 1

q p i 1 0 p

q p 1 1
Paasche (current-weighted):
q p 1 0 q p 1 0
qp
 i 1 1

Fisher indicies: I qF  I q0  I q1 I pF  I p0  I 1p

I v  I q0  I 1p  I q1  I p0  I qF  I pF
Óbuda University
Pro Scientia et Futuro

INDEX NUMBERS
An index is a means of comparing changes in some variable, often price over time. This
is particularly useful when there are many items involved and when the prices and
quantities are in different units. The best known index is the consumer price index (CPI):
it is the official measure of inflation when it comes to up-rating public sector pensions
and benefits. It compares the price of a „basket” of goods from one month to another.
Consumer Price Index
The Consumer Price Index measures the average change of price changes of
goods and services purchased by households for their own consumption. It
measures the inflation of national currency.
Consumer price index for pensioners
the consumer price index for pensioners shows how the differences in the
structure of the pensioners' consumption influence the indices of this strata of
population. The three groups of commodities (foods, medicines, expenditures
related to housing) which have major impact in terms of the pensioners'
consumption amount to 60 percent from the consumer basket of this strata. The
index is calculated by eliminating products and services related to child care.
Óbuda University
Pro Scientia et Futuro

SIMPLE INDICES
Changes in quantity, price, value

q1i p1i v1i q1i  p1i


iqi  i pi  iv i    iqi  i pi
q0i p0 i v0 i q0 i  p0 i
Óbuda University
Pro Scientia et Futuro

WEIGHTED AGGREGATE INDICES

Value Iv 
q p
1 1

v  v i
1 0 v

v 1

q p
0 0 v v
0 0
v
i 1

Quantity Laspeyres (base-weighted): I q0 


q p 1 0

q p i 0 0 q

q p1 0

q p 0 0 q p 0 0
qp
 i 1 0

Paasche (current-weighted): I q1 
q p 1 1

q p i 0 1 q

q p
1 1

q p 0 1 q p 0 1
qp
 i
1 1

Price I p0 
 q0 p1 
 q0 p0  i p 
q p 0 1
Laspeyres (base-weighted):
q 0 p0 q 0 p0 q p
 i 0 1

I 1p 
q p 1 1

q p i 1 0 p

q p 1 1
Paasche (current-weighted):
q p 1 0 q p 1 0
qp
 i 1 1

Fisher indicies: I qF  I q0  I q1 I pF  I p0  I 1p

I v  I q0  I 1p  I q1  I p0  I qF  I pF
Óbuda University
Pro Scientia et Futuro

Statistics I.
Viktor Nagy, Ph.D.
Óbuda University
Pro Scientia et Futuro

Cross-tabulation (contingency tables, or crosstab): relationship between two


categorical data.
1. Association (between nominal data)
2. Mixed (nominal and ratio data)
3. Correlation (quantitative data)

Classes of variable B
 j
C1B C 2B  C Bj  C (Bk 1) C kB
C1A f11 f12  f1j  f1(k-1) f1k f1●
Classes of variable A

C2A f21 f22  f2j  f2(k-1) f2k f2●


        
CiA fi1 fi2  fij  fi(k-1) fik fi●
        
C (Ar 1) f(r-1)1 f(r-1)2 f(r-1)j f(r-1)(k-1) f(r-1)k f(r-1)●
 
C rA fr1 fr2  frj  fr(k-1) frk fr●

i
f●1 f●2  f●j  f●(k-1) f●k N
Óbuda University
Pro Scientia et Futuro

Association (binary variable)

Yule’s coefficient of association


Dog Bear Σ
boy 3 10 13
girl 15 2 17
Σ 18 12 30
Óbuda University
Pro Scientia et Futuro

Association Factory
Quality
A B C
Accept 31 7 4
Waste 3 35 45

Cramer’s V

Independent
Óbuda University
Pro Scientia et Futuro

Factory
Association Q Total
A B C
31 11.42 7 14.11 4 16.46
A 42
33.57 3.58 9.43
3 22.58 35 27.89 45 32.54
W 83
16.98 1.81 4.77
Total 34 42 49 125
Óbuda University
Pro Scientia et Futuro

Mixed

days Action Romantic Comedy


1. 57 14 38
2. 61 5 50
3. - 20 -
59 13 44

Sum of Squares Within

Sum of Squares Between groups

Sum of Squares Total

the type of the movie explains 93.44 percent of the variation of number of tickets sold
Óbuda University
Pro Scientia et Futuro

Spearman’s rank correlation coefficient (rho)

Candidates and Referees

Candidates

C1 C2 C3 C4 C5 C6 C7 C8 C9 C10

R1 8 10 5 4 7 1 2 6 3 9
Referees
R2 8 9 7 4 5 2 1 6 3 10

d i2 0 1 4 0 4 1 1 0 0 1 d i
2
 12

6   d i2 6 12
  1  1  0,9273 Strong, positive correlation
N  ( N 2  1) 10  (10  1)
2

The coefficient varies: 1    1


Óbuda University
Pro Scientia et Futuro

Pearson’s correlation coefficient

Price of Age of dX  dY 
Nr the car the car d X  dY dX
2
dY
2

(€) (years)
X X Y Y
1 1300 3 150 -1.8 -270 22500 3.24
2 1000 5 -150 0.2 -30 22500 0.04
3 800 8 -350 3.2 -1120 122500 10.24
4 850 7 -300 2.2 -660 90000 4.84
5 1800 1 650 -3.8 -2470 422500 14.44
Sum 5750 24 0 0 -4550 680000 32.8
Average X  1150 Y  4.8

RXY 
C XY

d d x y

 4550
 0.9634
 XY d d
2
x
2
y
680000  32.8

Strong, negative correlation


The five-number summary & box and whisker diagram(s)

Histogram &(relative) frequency polygon

Probability density function, (standard) normal distribution, distribution function, z-formula

Skewness & kurtosis


The Five-Number Summary:
1. The minimum (smallest) number in the data set.
2. The 25th percentile.
3. The median.
4. The 75th percentile.
5. The maximum (largest) number in the data set.

Box plot
Box and whisker diagram
A histogram is a graphical representation of the distribution of numerical data. To construct a histogram, the
first step is to „bin” the range of values—that is, divide the entire range of values into a series of intervals—and
then count how many values fall into each interval. The bins are usually specified as consecutive, non-
overlapping intervals of a variable. The bins (intervals) must be adjacent, and are often (but are not required to
be) of equal size.
If the bins are of equal size, a rectangle is erected over the bin with height proportional to the frequency — the
number of cases in each bin. However, bins need not be of equal width; in that case, the erected rectangle is
defined to have its area proportional to the frequency of cases in the bin. The vertical axis is then not the
frequency but frequency density — the number of cases per unit of the variable on the horizontal axis.

Symmetric skewed right (positive skew) bimodal


Frequency Polygons

Frequency polygons are a graphical device for understanding the shapes of


distributions. They serve the same purpose as histograms, but are
especially helpful for comparing sets of data. Frequency polygons are also
a good choice for displaying cumulative frequency distributions.
The values of many large data sets tend to cluster around the mean (or median or mode) so that the data
distribution in the histogram resembles a bell-shape, symmetrical curve. When this is the case, the empirical rule:
approx. 68-95-99.7% of the values will be within 1, 2 or 3 standard deviations from the mean.

Chebyshev’s Theorem: a mathematical


rule, similar to the empirical rule except
it applies to any distribution rather than
just bell-shape, symmetrical
distributions. It states, that for any
number k greater than 1, at least

1
1 − 𝑘 2 x 100
percent values will fall within k standard
deviations from the mean

In general: 𝜇 ± kσ
Smaller SD: Tighter
and taller around
the mean

Standard normal distribution


x2

 x  
1
 x   2 e 2 X ~ N 0,1
 2
Normal distributions f x   X ~ N ,  
1
e 2 2
 2

Probability density function (PDF) or density of a continuous (random) variable describes the relative likelihood for this
variable to take on a given value.

The probability of the random variable falling within a particular range of values is given by the integral of this variable’s
density over that range. Its integral over the entire space is equal to one.
 X    ~ N 0,1
Changing an x value to z-value is called standardizing. The z-formula:

Cumulative distribution function

x

t   2 x t2
F x  
1 
x  
1
 e 2
e
2
dt 2
dt
 2 2 
Skewness: the measure of asymmetry

𝑄3 − 𝑀𝑒 − 𝑀𝑒 − 𝑄1
𝐹0.25 =
𝑄3 − 𝑀𝑒 + 𝑀𝑒 − 𝑄1
𝐷9 − 𝑀𝑒 − 𝑀𝑒 − 𝐷1
𝐹0.1 =
𝐷9 − 𝑀𝑒 + 𝑀𝑒 − 𝐷1
F>0 F=0 F<0
𝑋ത − 𝑀𝑜 A>0 A=0 A<0
𝐴=
𝜎 P>0 P=0 P<0
3 ∙ 𝑋ത − 𝑀𝑒
𝑃=
𝜎

Kurtosis: the sharpness of the peak of a distribution


Sharp peak: K < 0.263
𝑄3 − 𝑄1
𝐾=
2 ∙ 𝐷9 − 𝐷1
Flat: K > 0.263
Óbuda University
Pro Scientia et Futuro

Measures of Dispersion: describes how far the individual data values have strayed from
the mean.

Range: the difference between the highest value and the lowest value in the data set.

Interquartile range (IQR): middle fifty: difference between the upper and lower
quartiles; measures the spread of the center half.

~ is used for identify outliers: extreme values, whose accuracy is questioned and can
cause unwanted distortions in statistical results. Values to be discarded:

<Q1-1.5IQR or >Q3+1.5IQR

Interquantile range:….
Óbuda University
Pro Scientia et Futuro

Average absolute deviation (mean


absolute deviation) from the mean:
the average of the absolute
deviations from the mean.

Variance: describes the relative distance between the data points and the mean in the
data set.

~ of the population
Óbuda University
Pro Scientia et Futuro

Standard deviation (SD): is the square root of the variance.

Relative Standard deviation (RSD) or the coefficient of variaton (CV): standardized


measure of the dispersion of a (probability or frequency) distribution.
Óbuda University
Pro Scientia et Futuro

Mean absolute difference (MD) or absolute mean difference: the average absolute
difference of two indepedent values from the population

Relative mean absoulute difference:

equal to twice the Gini coefficient (Lorenz curve)


Óbuda University
Pro Scientia et Futuro

Statistics I
Viktor Nagy, Ph.D.
Óbuda University
Pro Scientia et Futuro

Frequency distribution: to graphically describe the data (simply a table). A table that
shows the number of data observations that fall into specific intervals.

Points accumulated
33 22 15 20 17 13 34 22 5 17
17 14 26 27 26 34 12 6 21 23
26 6 34 20 20 4 13 34 4 18
28 17 14 38 11 12 23 11 32 26

Points Number of Students 1. Classes of equal size


0 - 25 27 2. Mutually excusive classes (no
26 - 31 6 overlaping)
32 - 37 6 3. No fewer than 5 and no more than
38 - 43 1 15 classes (the true characteristic
44 - 50 0 will be hidden)
4. Avoid open-ended classes
5. Include all data values (exhaustive)
Óbuda University
Pro Scientia et Futuro

Relative frequency distribution: it displays the percentage of observations of each class


relative to the total number of observations.

Points Number of Students Percentage


0 - 25 27 27/40=0.675
26 - 31 6 6/40=0.150
32 - 37 6 6/40=0.150
38 - 43 1 1/40=0.025
44 - 50 0 0/40=0.000
Total=40 Total=1.000
Óbuda University
Pro Scientia et Futuro

Cumulative Relative frequency distribution: indictes the percentage of observations


that are less than or equal to the current class.

Points Number of Students Percentage Cumulative Percentage


0 - 25 27 27/40=0.675 0.675
26 - 31 6 6/40=0.150 0.825
32 - 37 6 6/40=0.150 0.975
38 - 43 1 1/40=0.025 1.000
44 - 50 0 0/40=0.000 1.000
Total=40 Total=1.000
Óbuda University
Pro Scientia et Futuro

Measures of Central Tendency: describe the center point of a data set with a single value.

Mean or average: add all the values and devide by the number of observations.

Weighted mean: Type Score Weight (Percent)


Exam 94 50
Project 89 35
Homework 83 15

Mean of groupped data from a frequency distribution: calculated by using the midpoints

Points Number of Students


0 - 25 27
26 - 31 6
32 - 37 6
38 - 43 1
44 - 50 0
The result is only an approximation to the mean.
Óbuda University
Pro Scientia et Futuro

Median: represents the value in the data set for which half the observations are higher
and the other half are lower. When there is an even number of data points, the median
will be the average of the two center points.

Mode: the observation that occurs the most frequent. (More possible)
Óbuda University
Pro Scientia et Futuro

Percentile: the percentage of individuals who are below where particular number is
located.

Finding the kth percentile:


1. Order all numbers in the data set from smallest to largest.
2. Multiply k percent times the total number of numbers (N)
3. If your result is a whole number, go to Step 5. If not, round it up to the nearest
whole number and go to Step 4.
4. Count the numbers from left to right until you reach the value from Step 3. This
corresponding number is the kth percentile.
5. Count the numbers from left to right until you reach that whole number. The kth
percentile is the average of that corresponding number and the next number in
your data set.

Points accumulated
4 4 5 6 6 11 11 12 12 13
13 14 14 15 17 17 17 17 18 20
20 20 21 22 22 23 23 26 26 26
26 27 28 32 33 34 34 34 34 38
Óbuda University
Pro Scientia et Futuro
Óbuda University
Pro Scientia et Futuro

Lorenz curve is a graphical representation of


the distribution of concentration.

Gini coefficient is a ratio (fraction) and


a measure of inequality of a
distribution: the numerator is the area
between the Lorenz curve and the line
of equality (= the uniform distribution
line), the denominator is the area
under the line (= the triangle).
Gini index: the coefficient expressed in
percentage.

Same ratio:
Figure: Wikipedia
Óbuda University
Pro Scientia et Futuro

Statistics I
Viktor Nagy, Ph.D.
Óbuda University
Pro Scientia et Futuro

Statistics: the science that deals with the collection, tabulation, and systematic
classification of quantitative data, especially as a basis for inference and
induction.
+ Interpretation, presentation

In the evryday life?


• National census
• Entire sports industry
Óbuda University
Pro Scientia et Futuro

Descriptive: to summarize and display data so we can quickly obtain an


overview.
Inferential: to make claims and conclusion about a population based on a
sample.

Population: a set of items which is of interest for some question.


Subpopulation: a subset that shares one or more additional properties.
Sample: a subset of the population.

Data: value assigned to an observation or a measurement.


Parameter: data that describes a characteristic about a population.
Statistic: …about a sample
Variable: any characteristic or numerical value that varies from individual to
individual.
Information: data transformed into useful facts
Óbuda University
Pro Scientia et Futuro

Sources: primary and secondary

Quantitative data: numerical values (Numerical)


Qualitative data: descriptive terms to classify (Categorical)

Discrete: items that can be counted; the take on possible values that can be
listed out. The list may be fixed (finite) or it may go from 0, 1, 2, on to infinity
(making it countably infinite).
Continuous: can only be described using intervals, the possible values cannot be
counted.

The method of comparison: to know the effect of a treatment: compare the


responses of a treatment group with a control group. To make sure that the
treatment group is like the control group, investigators put subjects into them at
random. The control group is given a placebo: neutral but resembles the treatment.
The response should be to the treatment itself rather than to the idea of treatment.
Double-blind experiment: the subjects do not know whether they are in treatment
or in control; neither do those who evaluate the responses.
Óbuda University
Pro Scientia et Futuro

Studies

Controls No controls

Contemporaneous Historical

Controlled Observational
experiment studies

Randomized Not randomized

In an observational study, the investigators do not assign the subjects to


treatment or control. The observational study can estabish association: one thing
is linked to another. Association may point to causation, but the effects of
treatment may be confounded. The confounder can be the third variable.
Óbuda University
Pro Scientia et Futuro

Types of
data

Qualitative Quantitative

Measurement Scales

Nominal Ordinal Interval Ratio


Óbuda University
Pro Scientia et Futuro

Átlagok Súlyozás nélkül Súlyozott


k

Xh 
N f i
1
N
1 Xh  i 1


Harmonic mean Harmonikus átlag k k
fi gi
i 1 X i 
i 1 X i

i 1 X i
k
N
 fi k k
X g  N  Xi X   Xi
Mértani átlag
Geometric mean Xg  fi gi
i 1
(geometriai) i
i 1 i 1 i 1
k
N
f  Xi
Arithmetic mean
Számtani átlag  Xi X i 1
i k
  gi  X i
(aritmetikai) X  i 1 k
N f
i 1
i
i 1

k
N

X 2  f X i i
2
k
Root mean square/quadraticNégyzetes
mean átlag
(kvadratikus) Xq  i 1
i
Xq  i 1
k
 g X i i
2

N f i 1
i
i 1

f i N

You might also like