You are on page 1of 4

STATISTICS I FORMULA SHEET

PROF. ANGELA MONTANARI

1. Notation

j indicates the generic statistical unit


n indicates the number of observed units
i indicates the generic level of a variable
k indicates the number of levels of a variable
g indicates the generic group
G indicates the number of groups
ni indicates the absolute frequency of level i
Ni indicates the cumulative frequency of level i
fi indicates the relative frequency of level i
Fi indicates the relative cumulative frequency of level i
xi−1 ⊣ xi indicates the generic class for a continuous variable coded in classes in a frequency distribution
wi indicates the width of class i; wi = xi − xi−1
hi indicates the frequency density of class i; hi = ni /wi or hi = fi /wi
x̂i indicates the central value of the class xi−1 ⊣ xi ; x̂i = (xi + xi−1 )/2

2. Mean Values
• Median of a continuous variable X in a frequency distribution
n 1
2
− Ni−1 2
− Fi−1
mc = xi−1 + (xi − xi−1 ) = xi−1 + (xi − xi−1 )
ni fi
where
⋆ xi−1 is the lower bound of the median class
⋆ (xi − xi−1 ) is the width of the median class
⋆ ni is the frequency of the median class
⋆ Ni−1 is the cumulative frequency of the class that comes before the median class
• Arithmetic mean of a variable X in a data set.
Pn
j=1 xj
x̄ =
n
• Arithmetic mean of a variable X in a frequency distribution
Pk k
i=1 xi ni
X
x̄ = = xi f i
n i=1

• Arithmetic mean of a continuous variable X (coded in classes) in a frequency distribution


Pk k
x̂i ni X
x̄ = i=1 = x̂i fi
n i=1

3. Properties of the arithmetic mean


Pn
(1) Sum identity: j=1 xj = nx̄
Pn
(2) Zero sum of the deviations, of all the values of x, from their arithmetic mean: j=1 (xj − x̄) = 0

1
2 PROF. ANGELA MONTANARI

(3) P
Minimum sum of squared deviations, of all the values of x, from their arithmetic mean:
n 2
j=1 (xj − x̄) = min

(4) Associativity (for grouped data):


G
X x̄g ng x̄1 n1 + . . . + x̄G nG
x̄ = =
g=1
n n

(5) Equivariance of the arithmetic mean with respect to linear transformations (translations a and
scale changes b):
x∗j = a + bxj ⇒ x̄∗ = a + bx̄

4. Variability Measures
Categorical variables
• Gini heterogeneity measure E1 = 1 − ki=1 fi2 (minimum value 0, maximum value 1 − 1/k)
P

• Entropy: E2 = − ki=1 fi log fi (minimum value 0, maximum value log k)


P

Numeric variables
• Range: xmax − xmin
• Interquartile range: Q3 − Q1
• Sum of squared deviations from the mean of variable X in a data set (total sum of squares):
Xn n
X
Dev(X) = T SS = (xj − x̄)2 = x2j − nx̄2
j=1 j=1

• Variance of a variable X in a data set


Dev(X)
V ar(X) = s2x =
n
• Standard Deviation of a variable X in a data set
p
SD(X) = sx = V ar(X)
• Sum of squared deviation from the mean of variable X in a frequency distribution
k
X k
X
2
Dev(X) = T SS = (xi − x̄) ni = xi 2 ni − nx̄2
i=1 i=1
• Variance of a variable X in a frequency distribution
k k k k
1X 1X 2 X X
V ar(X) = s2x = (xi − x̄)2 ni = xi ni − x̄2 = (xi − x̄)2 fi = x2i fi − x̄2
n i=1 n i=1 i=1 i=1
• Standard deviation of a variable X in a frequency distribution
v v v v
u k u k u k u k
u1X u1X uX uX
SD(X) = sx = t (xi − x̄)2 ni = t xi 2 ni − x̄2 = t (xi − x̄)2 fi = t xi 2 fi − x̄2
n i=1 n i=1 i=1 i=1

• Sum of squared deviation from the mean of a continuous variable X (coded into classes) in a
frequency distribution
k
X k
X
2
Dev(X) = T SS = (x̂i − x̄) ni = x̂2i ni − nx̄2
i=1 i=1

• Variance of a continuous variable X (coded into classes) in a frequency distribution


k k k k
1X 1X 2 X X
V ar(X) = s2x = (x̂i − x̄)2 ni = x̂i ni − x̄2 = (x̂i − x̄)2 fi = x̂2i fi − x̄2
n i=1 n i=1 i=1 i=1
STATISTICS I FORMULA SHEET 3

• Standard Deviation of a continuous variable X (coded into classes) in a frequency distribution


v v v v
u k u k u k u k
u1X u1X uX uX
SD(X) = sx = t (x̂i − x̄)2 ni = t x̂2i ni − x̄2 = t (x̂i − x̄)2 fi = t x̂2i fi − x̄2
n i=1 n i=1 i=1 i=1

• Coefficient of Variation
sx
CV =

• Gini’s concentration ratio
If x(1) + x(2) + · · · + x(j) is the amount of variable X own by the j poorest units, q(j) = (x(1) +
x(2) + · · · + x(j) )/nx̄ is the corresponding proportion of the total amount. Denoting by pj = j/n
the comulative relative frequency of the first j units Gini’s concentration ratio is defined as
Pn−1
j=1 (pj − qj )
R= Pn−1
j=1 pj

and takes values in the range 0 − 1.

5. Deviance or Total sum of squares Decomposition


• For grouped data: Decomposition of the Total Sum of Squared deviations from the mean (De-
viance or TSS) of a variable X into Deviance Within or Sum of squares Within (WSS) and
Deviance Between or Sum of squares Between (BSS) groups:
Dev(X) = Dev(X)W ithin + Dev(X)Between or equivalently T SS = W SS + BSS
ng
G X G
X X
2
Dev(X)W ithin = W SS = (xjg − x̄g ) = s2g ng
g=1 j=1 g=1

G
X
Dev(X)Between = BSS = (x̄g − x̄)2 ng
g=1

where ng and s2g are the size and the variance of the g-th group respectively.

6. Association Measures
• Chi-Squared
u X v
2
X (nih − n∗ih )2
χ =
i=1 h=1
n∗ih
• Tchuprov
s
2 χ2
T = p
n (u − 1)(v − 1)
• Eta-Squared
Dev(Y )Between Dev(Y )W ithin
η2 = =1−
Dev(Y ) Dev(Y )
u X
X v
Dev(Y )W ithin = W SS = (yi − ȳh )2 nih
i=1 h=1
v
X
Dev(Y )Between = BSS = (ȳh − ȳ)2 n0h
h=1
u X
X v
Dev(Y ) = T SS = (yi − ȳ)2 nih
i=1 h=1
4 PROF. ANGELA MONTANARI

• Linear regression coefficient and intercept


Codev(X, Y ) sxy
b1 = = 2
Dev(X) sx
b0 = ȳ − b1 x̄
• Codeviance n n
X X
Codev(X, Y ) = (xj − x̄)(yj − ȳ) = xj yj − nx̄ȳ
j=1 j=1
• Covariance
Pn Pn
j=1 (xj − x̄)(yj − ȳ) j=1 xj y j
Covar(X, Y ) = sxy = = − x̄ȳ
n n
• Linear correlation coefficient
Codev(X, Y ) sxy p sx
r=p = = b1yx b1xy = b1
Dev(X)Dev(Y ) sx sy sy
• Sum of squared deviation from the mean of a variable Y into Model Sum of Squares (MSS) and
Residual Sum of Squares (RSS)
Sum of Squares in a linear regression model:
n
X
M SS = (yj∗ − ȳ)2 = b21 Dev(X)
j=1
Xn n
X
RSS = (yj − yj∗ )2 = e2j
j=1 j=1

T SS(Y ) = M SS + RSS
• Coefficient of determination
M SS RSS
R2 = =1− = r2
T SS T SS

You might also like