You are on page 1of 10

UNIT 1: INTRODUCTION TO STATISTICS

LECTURE 1: WHAT IS STATISTICS?

1. MOTIVATION
a. Review concepts from probability, and apply, in this unit, to simple statistics example

2. OVERALL GOALS OF THIS COURSE

a.
i. Obtain solid introduction to mathematical theory behind statistical models
ii. Understand the limitations of the statistical methods
iii. THEORETICAL GUARANTEES
1. Can be used to compare methods, and objectify how good a method is

3. WHY STATISTICS?

a.

i. Computer, using data, decided the fate of House of Card, i.e. who would direct, who would be in the case
4. STATISTICS, DATA SCIENCE, AND PROBABILITY

a.
i. All the fields are data driven, i.e. gather data to obtain insight, and then make decisions
1. However, depending on the field, the exact technique might vary, for example
a. In machine learning, if one wants to design a spam filter, one essentially wants a black box, which
looks at the data, and then decides whether it is spam or not
b. In biology, scientists might want to look at genes, however, instead of simply knowing whether a
gene is “spam” or not, want to know the biological underpinnings behind the obtained data
2. In machine learning, sometimes, one does not care about the insight, i.e. if it works it works, no need to
understand how
ii. At the heart of all these fields are core statistical principles, and these give us principled ways of making decisions
based on data
iii. Computation plays a big role, and algorithms essentially employ statistical principles in an efficient manner

b.
i. COMPUTATIONAL VIEW
1. APPROXIMATE NEAREST NEIGHBORS: This is akin to personalized recommendations, and asking which
vector in the data set is close to my personal vector
a. This is an algorithmic problem since if a profile is 100,000 numbers long, skimming through the
data set is hard
2. LOW DIMENSIONAL EMBEDDING: Project data in higher dimensions into something that can be visualized
a. This projection cannot be done randomly, otherwise there will be structural losses in the data
3. SPECTRAL METHODS: An algorithmic tool, pervasive in data science
4. DISTRIBUTED OPTIMIZATION: Optimization of data sets so large they are not stored on a single server
ii. STATISTICAL VIEW (ASSUMING INFINITE COMPUTATIONAL POWER)
1. Data is generated, essentially, via a random process, we need to understand its underpinnings so as to make
predictions
2. Extrapolate from the given finite, to the utopian, i.e. as if you had infinite
3. RANDOMNESS
i. True randomness: Natural randomness like dice, cards, etc.
ii. Pretend randomness: Things in the data which I do not understand, therefore, I will
pretend they are random
b. The goal of statistics is to understand the randomness in data
c. To understand randomness, we need to understand probability!
c.
i. In probability, there is a random process whose parameters are completely known
1. From this, everything is derived, i.e. likely outcomes, and so on

5. STATISTICS AND MODELLING

a.
i. In probability, one is given the description of a random process, i.e. the parameters, and one obtains derivations
pertaining to the description
ii. However, if one does not understand the random process, i.e. does not know that an outcome in the roll of a fair
sided die is 1/6, then one needs statistics to estimate the parameters of the random process
1. Statistics basically asks for data to supply information about a random process
iii. TYPES OF RANDOMNESS
1. REAL RANDOMNESS: Flipping a coin, dice, picking a random student
2. COMPLEX DETERMINISM: In marketing campaigns, for example, consumer behavior is not random, but is
too complex, hence we model it as random
a. Essentially these are things which we do not have in our data set
iv. STATISTICAL MODELING
1. Essentially, based on above, is reduced to
a. Complicated process “=” simple process + random noise
i. The simple process amounts to say 80% of what is going on, and the random noise, i.e.
the 20%, is like a modeling assumption
ii. Essentially, the random noise will average out over people
2. Good modelling essentially places as much weight as possible on the simple process, and the rest on
random noise
a. Also, if one knows that the random noise is always positive, then we model it with a random
variable X, which takes values x>0
b. CENTRAL DOGMA OF PROBABILITY AND STATISTICS
i.

ii.
1. Some truth, say a fair die, and probability tells you how the observations will look
2. Suppose we don’t know how the observations are being generated, and we just have observations, then,
the purpose of statistics is to extract the truth, using the data, i.e.
a. We reverse engineer probability
b. However, we cannot reverse engineer the complete truth

c.
i. In probability, one has enough data to determine the truth
ii. In statistics, we do not have any past studies, i.e. there are only 100 patients, and we try and discern the truth

6. ABOUT THIS COURSE

a.
i.
7. LET’S DO SOME STATISTICS

a.
i. In the above experiment, one can intuitively say, if p = 90%, then indeed, there is a preference to turn to the right
1. The question then becomes, at what value of p should one conclude that there is indeed a preference
ii. Essentially, we would like to know the true p, i.e. the value obtained if we look at the population of couples in the
world
1. To do so, we design the following statistical experiment
a. We observe n kissing couples, and if they turn right, we denote a value of 1, else 0
b. The estimate p, is simply the proportion of 1’s amongst the sequence of 1’s and 0’s
c. Here, the true p is estimated with the observed proportion, ^ p

b.
i. Now, given our estimate, ^ p=64.5 %, and that n = 124, we may say that there is a preference for turning right
ii. However, if n = 3, and we observe {R, R, L}, then we have ^ p=66.7 %, and then we cannot say there is a preference
iii. So, the question essentially is, how big should n be, and this is important to know
1. That is, for what value of n can I be satisfied with the ^ p I observe
2. And to understand this, one requires mathematics

c.
i. p, which is the sample average
in the above process, we have essentially derived our first estimator of p, ^
1. The estimator is a formula into which we plug in the numbers, and the sample average is defined as
n
a. ^p= Ŕ n= 1 ∑ Ri , here Ri=1 , ∀ i={ 1 ,… , n }, if the i th couple turns right, and is 0
n i=1
otherwise
2. When we have defined the collected data as above, the sample average is essentially the proportion of 1’s
we have collected, i.e.
a. The number of i’s, such that Ri=1 , since when we sum, the 0’s don’t add to our counter, and we
are simply counting the number of 1’s
ii. p= Ŕ n, OR, how accurate of an estimator is it?
Now, how good is ^
1. This depends on n, and also depends on the true p
2. The randomness arises because Ŕn is random, since it is a function of random variables, and we want to
model how close it is to the true p
3. Here, the observed value is a realization of a random variable

8. THE FIRST EXAMPLE: MODELING ASSUMPTIONS

a.
i. Coming up with a statistical model essentially involves making assumptions on our observations, i.e. the R'i s , and for
the above case, the assumptions are
1. Each Ri is a random variable
2. Then, we make an assumption about the distribution of our random variables
a. Here, each Ri Bernoulli ( p )
b. In the couples’ experiment, if we do not have extra data on each couple, say their handedness,
and so on, this prevents us from having a different parameter, p, for each Ri
3. Each Ri for i= {1 , … , n } are mutually independent
a. This assumption is made to enable us to use probability rules
i. For example, if two events are independent, then the probability of their intersection is
given via the multiplication rule
b. If the random variables are dependent, we would have to model their dependence
i. It can be difficult to model. Consider, how would one model 20,000 genes?
c. Nowadays, with data being collected from multiple sources, it bodes well to think about whether
the data we have is independent
b.

9. POPULATION VERSUS SAMPLES

a.
i. Using a computer, one can create a population of 5000 couples, and assign them p’s as shown, and then generate
samples of size n = 124, which look as shown
1. Essentially samples of size n = 124 were drawn 1000 times, for given p’s, and the 1000 data points obtained
were used to form the histograms
ii. The above histograms basically tell us the distribution of p’s one can get if we have a certain p
1. When the true p = 65%, then no dataset generated by p = 35% gives an estimate remotely close to 65%
2. Statistics essentially answers which histogram is more likely to have generated the data seen
3. So, in the kissing example, ^p=64.5 % is likely to have been generated when p = 65%
b.
i. Probability plays a crucial role here, since it gives us a good rule for what a histogram is supposed to look like
1. Moreover, probability rules will tell us exactly the probability of seeing specific values in a histogram
ii. In the above case, probability helps us understand the random variable ^ p= Ŕ n, and we would now like to ask
specific questions of the above random variable
1. UNBIASED
a. Is the expected value of our estimator ^ p close to the unknown parameter p?
2. CONSISTENT
a. Does the estimator, ^ p take values close to p with high probability?
3. Is the variance of ^ p large, i.e. does it fluctuate a lot?
a. We would like this to be small, preferably 0, i.e. ^ p= p
iii. In the above case, the things on the bottom right are all probability questions
1. They can be answered approximately as the sample size becomes larger
iv. SAMPLE AVERAGE
1. The sample average, or sample mean, of n random variables X 1 , … , X n is given as
n
1
a. X́ n= ∑ X i
n i=1
b. Essentially, statistics involves taking averages of random variables and doing fancy stuff
2. The averages of random variables are governed by two major tools, and they give its approximate
distribution
a. Central Limit Theorem
b. Law of Large Numbers

LECTURE 2: PROBABILITY REDUX

1. TWO IMPORTANT PROBABILITY TOOLS

a.
i. We work with the following
1. We have n i.i.d random variables, X 1 , … , X n, where, E ( X i ) =μ ,∧Var ( X ) =σ 2
ii. LAW OF LARGE NUMBERS
1. In statistics, expectations (a notion more within probability) are replaced by average, and we are justified in
doing so due to the Law of Large Numbers
2. Basically, the sample mean (a random variable which is formed by averaging random variables) converges
to the expectation of an individual random variable, i.e. a deterministic number
n
1
a. X́ n ≔ ∑ X i → μ, as n approaches infinity
n i=1
b. For the strong law, the convergence is almost surely (a.s.)
i. The strong law subsumes the weak law, i.e. if convergence happens almost surely, it will
also happen in probability
c. For the weak law, the convergence is in probability (P)
iii. CENTRAL LIMIT THEOREM
1. HOW LARGE IS n?

a. Suppose, as X́ n n → ∞ μ, we have that
1
i. ¿ ´X n −μ∨¿ , and if n=10100, the given convergence is not fast enough
ln ⁡¿ ¿
b. These deviations b/w the sample mean and the expectation, are given by the central limit
theorem
2. Per the CLT, for the random variable X́ n, it is the case that
X́ n−μ
d N ( 0 ,1 )
a. σ → , as n approaches infinity, or the equivalent form, as shown above

√n
i. In most cases, n≥30 is good enough for approximate convergence to the standard
normal
b. Now, a standard Gaussian, we know, will give us with high probability a number b/w [-3, 3], thus
X́ n−μ
i. σ , is a number b/w [-3, 3], with very high probability
√n

ii. Thus, we have| X́ n−μ|≤
√n
1
c. The above bound shows much faster convergence as compared to
ln ⁡¿ ¿
d. Now, based on the probability we desire, we can pick numbers other than 3
3. We can make the above statements since the standard Gaussian is stochastically bounded

2. HOEFFDING’S INEQUALITY
a.
i. A

b.
i. A
c. REVIEW: MARKOV AND CHEBYSHEV INEQUALITY

i.

You might also like