Professional Documents
Culture Documents
In probability theory, the normal (or Gaussian) distribution is a very common continuous
probability distribution. Normal distributions are important in statistics and are often used
in the natural and social sciences to represent real-valued random variables whose
distributions are not known.
rnorm()
To draw random numbers from the normal distribution use the rnorm function, which,
optionally, allows the specification of the mean and standard deviation.
Probability density function contains information about probability but it is not a probabilit
y since can have any value positive, even larger than one. It is defined only for continuous r
andom variables.
pnorm():
Similarly, pnorm calculates the distribution of the normal distribution; that is, the
cumulative probability that a given number, or smaller number, occurs. This is defined as
If you wish to find the probability that a number is larger than the given number you can
use the lower.tail option:
> pnorm(0,lower.tail=FALSE)
> pnorm(1,lower.tail=FALSE)
> pnorm(0,mean=2,lower.tail=FALSE)
Examples:
Question:
Suppose widgit weights produced at Acme Widgit Works have weights that are normally
distributed with mean 17.46 grams and variance 375.67 grams. What is the probability that
a randomly chosen widgit weighs more then 19 grams?
Answer: What is P(X > 19) when X has the N(17.46, 375.67) distribution?
R wants the s. d. as the parameter, not the variance. We'll need to take a square root!
>pnorm(19, mean=17.46, sd=sqrt(375.67))
[1] 0.5316644
qnorm()
This function takes the probability value and gives a number whose cumulative value
matches the probability value
qnorm() which is the inverse of pnorm(). The idea behind qnorm is that you give it a
probability, and it returns the number whose cumulative distribution matches the
probability. For example, if you have a normally distributed random variable with mean
zero and standard deviation one, then if you give the function a probability it returns the
associated Z-score:
> qnorm(0.5)
> qnorm(0.5,mean=1)
> qnorm(0.5,mean=1,sd=2)
> qnorm(0.5,mean=2,sd=2)
> qnorm(0.5,mean=2,sd=4)
> qnorm(0.25,mean=2,sd=2)
> qnorm(0.333)
> qnorm(0.333,sd=3)
> qnorm(0.75,mean=5,sd=2)
> x <- seq(0,1,by=.01)
> y <- qnorm(x)
> plot(x,y)
> y <- qnorm(x,mean=3,sd=2)
> plot(x,y)
> y <- qnorm(x,mean=3,sd=0.1)
> plot(x,y)
Example:
Question: Suppose IQ scores are normally distributed with mean 100 and standard
deviation 15. What is the 95th percentile of the distribution of IQ scores?
Answer: What is F-1(0.95) when X has the N(100, 152) distribution?
>qnorm(0.95, mean=100, sd=15)
[1] 124.6728
dnorm(x, mean, sd) – density function – takes a number returns the height
pnorm(x, mean, sd) – distribution function – takes a number gives the probability
qnorm(p, mean, sd) - qunatile function – takes a probability and returns a number
rnorm(n, mean, sd) – random number generation function – takes a number and generate
the random numbers
BINOMIAL DISTRIBUTION:
The probability distribution of the random variable X is called a binomial distribution, and is
given by the formula:
where
and n is the number of trials and p is the probability of success of a trial. The mean is np and the
variance is np(1 – p). When n = 1 this reduces to the Bernoulli distribution.
The binomial distribution model deals with finding the probability of success of an event which
has only two possible outcomes in a series of experiments.
In probability theory and statistics, the binomial distribution with parameters n and p is
the discrete probability distribution of the number of successes in a sequence
of n independent experiments, each asking a yes–no question, and each with its own boolean-
valued outcome: a random variable containing single bit of
information: success/yes/true/one (with probability p) or
failure/no/false/zero(with probability q = 1 − p).
R has four in-built functions to generate binomial distribution. They are described below
• x is a vector of numbers.
• p is a vector of probabilities.
• n is number of observations.
• size is the number of trials.
• prob is the probability of success of each trial.
rbinom()
Generating random numbers from the binomial distribution is not simply generating random
numbers but rather generating the number of successes of independent trials.
Random numbers with n=1(only one trial), size=10 (trial size of 10) and prob=0.4 (probability of
success is 0.4) is
> rbinom(n=1,size=10,prob=0.4)
> rbinom(n=8,size=150,prob=.4)
> rbinom(n=10,size=150,prob=.4)
> rbinom(8,150,.4)
Setting size to 1 turns the numbers into a Bernoulli random variable, which can take on only the
value 1 (success) or 0 (failure).
the density function gives the probability of an exact value in the distruibution
where
Question: Suppose widgits produced at Acme Widgit Works have probability 0.005 of being
defective. Suppose widgits are shipped in cartons containing 25 widgits. What is the probability
that a randomly chosen carton contains exactly one defective widgit?
Answer: What is P(X = 1) when n=25, prob=0.005?
>dbinom(1,25,.005)
[1] 0.1108317
pbinom()
The cumulative distribution function gives the cumulative probability of a given value
> pbinom(q = 3, size = 10, prob = 0.3)
> pbinom(q = 1:10, size = 10, prob = 0.3)
>x<-1:50
> y<-pbinom(x,10,.3)
> plot(x,y,main="pbinom()",xlab="random values",ylab="cumulative probability CDF")
Question: Suppose widgits produced at Acme Widgit Works have probability 0.005 of being
defective. Suppose widgits are shipped in cartons containing 25 widgits. What is the probability
that a randomly chosen carton contains no more than one defective widgit?
>pbinom(1,25,.005)
[1] 0.9930519
qbinom()
This function takes the probability value and gives a number whose cumulative value
matches the probability value.
Given a certain probability, qbinom returns the quantile, which for this distribution is the number
of successes
In probability theory and statistics, the Poisson distribution is a discrete probability distribution
that expresses the probability of a given number of events occurring in a fixed interval of time
and/or space
The Poisson distribution is popular for modelling the number of times an event occurs in an
interval of time or space.
A discrete random variable X is said to have a Poisson distribution with parameter λ > 0, if,
for k = 0, 1, 2, ..., the probability mass function of X is given by
where
• e is Euler's number (e = 2.71828...)
where
• x is a vector of numbers.
• p is a vector of probabilities.
• n is number of observations.
• lambda probability of a given number of events occurring.
rpois()
> rpois(1,lambda=1)
> rpois(10,lambda=1)
dpois()
the poisson distribution probability mass function or probability density function is given by the f
ollowing equation
> dpois(1,lambda=1)
> dpois(10,lambda=1)
> dpois(1:10,lambda=1)
> dpois(c(1,5,9),lambda=1)
Question: If there are twelve cars crossing a bridge per minute on average, find the probability of
that exactly sixteen cars crossing the bridge in a particular minute.
Answer: The probability of having sixteen cars crossing the bridge in a particular minute is given
by the function dpois.
> dpois(16,lambda = 12)
[1] 0.05429334
ppois()
the cumulative density function of a poisson distribution is given as
> ppois(1,lambda=1)
> ppois(10,lambda=1)
> ppois(1:10,lambda=1)
> ppois(c(1,5,9),lambda=1)
Question: If there are twelve cars crossing a bridge per minute on average, find the probability of
having seventeen or more cars crossing the bridge in a particular minute.
Answer: The probability of having sixteen or less cars crossing the bridge in a particular minute is
given by the function ppois.
> ppois(16,lambda = 12)
[1] 0.898709
qpois()
this takes a probability as an input and gives a number
> qpois(1,lambda=1)
> qpois(0.5,lambda=1)
> qpois(seq(0,1,.1),lambda=1)
> qpois(c(0.1,0.5,0.9),lambda=1)
OTHER DISTRIBUTIONS:
R supports many distributions, some of which are very common, while others are quite obscure.
They are
the mathematical formulas for the distributions supported by R are
BASIC STATISTICS:
The sample mean is the average and is computed as the sum of all the observed
outcomes from the sample divided by the total number of events. We use x as the symbol
for the sample mean. In math terms,
where n is the sample size and the x correspond to the observed valued.
The mode of a set of data is the number with the highest frequency.
The median is the middle score. If we have an even number of events we take the average of
the two middles.
The variance and the closely-related standard deviation are measures of how spread out a
distribution is. In other words, they are measures of variability.
> x<-1:5
> mean(x)
> median(x)
> sd(x)
> var(x)
> min(x)
> max(x)
> summary(x)
Correlation
Correlation is used to test relationships between quantitative variables or categorical
variables. In other words, it’s a measure of how things are related. The study of how variables
are correlated is called correlation analysis.
Some examples of data that have a high correlation:
• Your caloric intake and your weight.
• Your eye color and your relatives’ eye colors.
• The amount of time your study and your GPA.
Some examples of data that have a low correlation (or none at all):
• The cost of a car wash and how long it takes to buy a soda inside the station.
Correlations are useful because if you can find out what relationship variables have, you can
make predictions about future behavior. Knowing what the future holds is very important
A correlation coefficient is a way to put a value to the relationship. Correlation coefficients
have a value of between -1 and 1. A “0” means there is no relationship between the
variables at all, while -1 or 1 means that there is a perfect negative or positive
correlation (negative or positive correlation here refers to the type of graph the
relationship will produce).
where and are the means of x and y, and sx and sy are the standard deviations of x and y.
It can range between -1 and 1, with higher positive numbers meaning a closer relationship
between the two variables, lower negative numbers meaning an inverse relationship and
numbers near zero meaning no relationship
>cor(df[,2])
To visualize the corelations between variuables use pairs() instead of plot() function
>pairs(df)
Covariance is a measure of the degree to which returns on two risky assets move in tandem. A
positive covariance means that asset returns move together, while a negative covariance
means returns move inversely.
Similar to correlation is covariance, which is like a variance between variables
The T Score:
The t score is a ratio between the difference between two groups and the difference
within the groups. The larger the t score, the more difference there is between groups.
The smaller the t score, the more similarity there is between groups. When you run a t
test, the bigger the t-value, the more likely it is that the results are repeatable.
Every t-value has a p-value to go with it. A p-value is the probability that the results from
your sample data occurred by chance. P-values are from 0% to 100%. They are usually
written as a decimal. For example, a p value of 5% is 0.05. Low p-values are good; They
indicate your data did not occur by chance. For example, a p-value of .01 means there is only
a 1% probability that the results from an experiment happened by chance. In most cases, a
p-value of 0.05 (5%) is accepted to mean the data is valid.
The R function t.test() can be used to perform both one and two sample t-tests on
vectors of data.
Here x is a numeric vector of data values and y is an optional numeric vector of data
values. If y is excluded, the function performs a one-sample t-test on the data contained
in x, if it is included it performs a two-sample t-tests using both x and y.
The option mu provides a number indicating the true value of the mean (or difference in
means if you are performing a two sample test) under the null hypothesis. The option
alternative is a character string specifying the alternative hypothesis, and must be one
of the following: "two.sided" (which is the default), "greater" or "less" depending on
whether the alternative hypothesis is that the mean is different than, greater than or less
than mu, respectively. For example the following call:
performs a one sample t-test on the data contained in x where the null hypothesis is that
=10 and the alternative is that <10.
The option paired indicates whether or not you want a paired t-test (TRUE = yes and
FALSE = no). If you leave this option out it defaults to FALSE.
The option var.equal is a logical variable indicating whether or not to assume the two
variances as being equal when performing a two-sample t-test. If TRUE then the pooled
variance is used to estimate the variance otherwise the Welch (or Satterthwaite)
approximation to the degrees of freedom is used. If you leave this option out it defaults
to FALSE.
Finally, the option conf.level determines the confidence level of the reported confidence
interval for in the one-sample case and 1- 2 in the two-sample case.
A. One-sample t-tests
Let be the mean level of Salmonella in all batches of ice cream. Here the hypothesis
of interest can be expressed as:
H0: = 0.3
Ha: > 0.3
Hence, we will need to include the options alternative="greater", mu=0.3. Below is the
relevant R-code:
> x = c(0.593, 0.142, 0.329, 0.691, 0.231, 0.793, 0.519, 0.392, 0.418)
> t.test(x, alternative="greater", mu=0.3)
data: x
t = 2.2051, df = 8, p-value = 0.02927
alternative hypothesis: true mean is greater than 0.3
From the output we see that the p-value = 0.029. Hence, there is moderately strong
evidence that the mean Salmonella level in the ice cream is above 0.3 MPN/g.
B. Two-sample t-tests
Ex. 6 subjects were given a drug (treatment group) and an additional 6 subjects a
placebo (control group). Their reaction time to a stimulus was measured (in ms). We
want to perform a two-sample t-test for comparing the means of the treatment and
control groups.
Let 1 be the mean of the population taking medicine and 2 the mean of the untreated
population. Here the hypothesis of interest can be expressed as:
H0: 1- 2=0
Ha: 1- 2<0
Here we will need to include the data for the treatment group in x and the data for the
control group in y. We will also need to include the options alternative="less", mu=0.
Finally, we need to decide whether or not the standard deviations are the same in both
groups.
Below is the relevant R-code when assuming equal standard deviation:
Below is the relevant R-code when not assuming equal standard deviation:
> t.test(Control,Treat,alternative="less")
Here the pooled t-test and the Welsh t-test give roughly the same results (p-value =
0.00313 and 0.00339, respectively).
C. Paired t-tests
There are many experimental settings where each subject in the study is in both the
treatment and control group. For example, in a matched pairs design, subjects are
matched in pairs and different treatments are given to each subject in the pair. The
outcomes are thereafter compared pair-wise. Alternatively, one can measure each
subject twice, before and after a treatment. In either of these situations we can’t use
two-sample t-tests since the independence assumption is not valid. Instead we need to
use a paired t-test. This can be done using the option paired =TRUE.
Ex. A study was performed to test whether cars get better mileage on premium gas than
on regular gas. Each of 10 cars was first filled with either regular or premium gas,
decided by a coin toss, and the mileage for that tank was recorded. The mileage was
recorded again for the same cars using the other kind of gasoline. We use a paired t-
test to determine whether cars get significantly better mileage with premium gas.
Below is the relevant R-code:
> reg = c(16, 20, 21, 22, 23, 22, 27, 25, 27, 28)
> prem = c(19, 22, 24, 24, 25, 25, 26, 26, 28, 32)
> t.test(prem,reg,alternative="greater", paired=TRUE)
Paired t-test
The results show that the t-statistic is equal to 4.47 and the p-value is 0.00075. Since
the p-value is very low, we reject the null hypothesis. There is strong evidence of a
mean increase in gas mileage between regular and premium gasoline.
ANOVA
A t test is used to compare two groups and if we want to compare two or more groups then
use ANOVA (ANalysis Of VAriance)
We are often interested in determining whether the means from more than two populations
or groups are equal or not. To test whether the difference in means is statistically significant
we can perform analysis of variance (ANOVA) using the R function aov(). If the ANOVA F-test
shows there is a significant difference in means between the groups we may want to perform
multiple comparisons between all pair-wise means to determine how they differ.
Ex. A drug company tested three formulations of a pain relief medicine for migraine
headache sufferers. For the experiment 27 volunteers were selected and 9 were randomly
assigned to one of three drug formulations. The subjects were instructed to take the drug
during their next migraine headache episode and to report their pain on a scale of 1 to 10
(10 being most pain).
Drug A 4 5 4 3 2 4 3 4 4
Drug B 6 8 4 5 4 6 5 8 6
Drug C 6 7 6 6 7 5 6 5 5
Ho: Equal means for all three drug groups
Ha: Means of all three groups are not equal
To make side-by-side boxplots of the variable pain grouped by the variable drug we must
first read in the data into the appropriate format.
Note the command rep("A",9) constructs a list of nine A‟s in a row. The variable drug is
therefore a list of length 27 consisting of nine A‟s followed by nine B‟s followed by nine C‟s.
If we print the data frame migraine we can see the format the data should be on in order to
make side-by-side boxplots and perform ANOVA
> migraine
pain drug
14A
25A
34A
43A
52A
64A
…
25 6 C
26 5 C
27 5 C
We can now make the boxplots by typing:
From the boxplots it appears that the mean pain for drug A is lower than that for drugs B and
C.
Studying the output of the ANOVA table above we see that the F-statistic is 11.91 with a p-
value equal to 0.0003. We clearly reject the null hypothesis of equal means for all three drug
groups.